# Pipelines

### With MLOps, ML teams build machine learning pipelines that automatically collect and prepare data, select optimal features, run training using different parameter sets or algorithms, evaluate models, and run various model and system tests. All the executions, along with their data, metadata, code, and results, must be versioned and logged, providing quick results visualization, comparing them with past results, and understanding which data was used to produce each model.

## ML pipelines can be started manually or (preferably) triggered automatically when:

- The code, packages, or parameters change.

- The input data or feature engineering logic change.

- Concept drift is detected, and the model needs to be retrained with fresh data.

## ML pipelines have the following features:

- Built using microservices (containers or serverless functions), usually over Kubernetes.

- Track all their inputs (code, package dependencies, data, parameters) and the outputs (logs, metrics, data/features, artifacts, models) for every step in the pipeline in order to reproduce or explain experiment results.

- Version all the data and artifacts used throughout the pipeline.

- Store code and configuration in versioned Git repositories.

- Use CI techniques to automate the pipeline initiation, test automation, review, and approval process.



## Pipelines should be executed over scalable services or functions, which can span elastically over multiple servers or containers. This way, jobs complete faster, and computation resources are freed up once they are complete, saving high costs.

## You can find projects where the data preparation, training, evaluation, and even prediction are all made in one huge Notebook, but this approach can lead to challenges when moving to production, for example:

- Very hard to track the code changes across versions (in Git).

- Almost impossible to implement test harnesses and unit testing.

- Functions cannot be reused in various projects.

- Moving to production requires code refactoring and removal of visualization or scratch code.

- Lack of proper documentation.

### Data quality tests
- The dataset used for training is of high quality and does not carry bias.

### Model performance tests
- The model produces accurate results.

### Serving application tests
- The deployed model along with the data pre- or post-processing steps are robust and provide adequate performance.

### Pipeline tests
- Ensuring the automated development pipeline handles various exceptions and the desired scale.

## Here are some examples of data quality tests:

- There are no missing values.

- Values are of the correct type and fall under an expected range (for example, user age is between 0-120, with anticipated average and standard deviation).

- Category values fall within the possible options (for example, city names match the options in a city name list).

- There is no bias in the data (for example, the gender feature has the anticipated percentage of men and women).

## Tests can improve the model quality:

- Verify the performance is maintained across essential slices of the data (for example, devices by model, users by country or other categories, movies by genre) and that it does not drop significantly for a specific group.

- Compare the model results with previous versions or a baseline version and verify the performance does not degrade.

- Test different parameter combinations (hyperparameter search) to verify you chose the best parameter combination.

- Test for bias and fairness by verifying that the performance is maintained per gender and specific populations.

- Check feature importances and whether there are features with a marginal contribution that can be removed from the model.

- Test for immunity to fake, random, or malicious input vectors to increase robustness and defend against adversarial attacks.

## Monitoring Data and Concept Drift
### Concept drift is a phenomenon where the statistical properties of the target variable (y, which the model is trying to predict) change over time. Data drift (virtual drift) happens when the statistical properties of the inputs changes. In drift, the model built on past data no longer applies, and assumptions made by the model on past data need to be revised based on current data. Figure 2-18 illustrates the differences between concept drift and virtual (data) drift.


## The monitoring system saves the various feature statistics (min, max, average, stddev, histogram, and so on), and the drift level is calculated using one or more of the following metrics:

- Kolmogorov–Smirnov test

- Kullback–Leibler divergence

- Jensen–Shannon divergence

- Hellinger distance

- Standard score (Z-score)

- Chi-squared test

- Total variance distance

## Monitoring Model Performance and Accuracy

### An important metric is to measure model accuracy in production. For that, you must have the ground truth (the actual result that matches the prediction). In some models obtaining the ground truth is relatively simple. For example, if we predict that a stock price will go up today, we can wait a few hours and know if the prediction was accurate. This is the same with other prediction applications like predicting customer churn or machine failure where the actual result arrives with some delay.

## Application Pipeline Development
### There are two types of application pipelines: real-time (or online) pipelines, which constantly accept events or requests and respond immediately, and batch pipelines, which are triggered through an API or at a given schedule. Batch pipelines usually read and process larger datasets on every run.

## Real-time pipelines can be implemented manually by chaining individual containerized functions or can be automated by using a real-time pipeline framework such as MLRun serving graphs, Apache Beam, or AWS Step Functions. 

## A CI/CD pipeline for an ML application will likely implement the following steps:

- Data preparation

- Model training using hyperparameters and grid search

- Model evaluation

- Application pipeline deployment (with the best model)

### You probably want to avoid constantly staring at dashboards for model or data performance problems. Instead, you can define triggering policies and actions. For example, when a certain threshold is reached, a notification can alert the administrator or initiate an automated process for retraining a model or mitigating potential errors.