# Pipelines

### With MLOps, ML teams build machine learning pipelines that automatically collect and prepare data, select optimal features, run training using different parameter sets or algorithms, evaluate models, and run various model and system tests. All the executions, along with their data, metadata, code, and results, must be versioned and logged, providing quick results visualization, comparing them with past results, and understanding which data was used to produce each model.

## ML pipelines can be started manually or (preferably) triggered automatically when:

- The code, packages, or parameters change.

- The input data or feature engineering logic change.

- Concept drift is detected, and the model needs to be retrained with fresh data.

## ML pipelines have the following features:

- Built using microservices (containers or serverless functions), usually over Kubernetes.

- Track all their inputs (code, package dependencies, data, parameters) and the outputs (logs, metrics, data/features, artifacts, models) for every step in the pipeline in order to reproduce or explain experiment results.

- Version all the data and artifacts used throughout the pipeline.

- Store code and configuration in versioned Git repositories.

- Use CI techniques to automate the pipeline initiation, test automation, review, and approval process.



## Pipelines should be executed over scalable services or functions, which can span elastically over multiple servers or containers. This way, jobs complete faster, and computation resources are freed up once they are complete, saving high costs.

## You can find projects where the data preparation, training, evaluation, and even prediction are all made in one huge Notebook, but this approach can lead to challenges when moving to production, for example:

- Very hard to track the code changes across versions (in Git).

- Almost impossible to implement test harnesses and unit testing.

- Functions cannot be reused in various projects.

- Moving to production requires code refactoring and removal of visualization or scratch code.

- Lack of proper documentation.

### Data quality tests
- The dataset used for training is of high quality and does not carry bias.

### Model performance tests
- The model produces accurate results.

### Serving application tests
- The deployed model along with the data pre- or post-processing steps are robust and provide adequate performance.

### Pipeline tests
- Ensuring the automated development pipeline handles various exceptions and the desired scale.

## Here are some examples of data quality tests:

- There are no missing values.

- Values are of the correct type and fall under an expected range (for example, user age is between 0-120, with anticipated average and standard deviation).

- Category values fall within the possible options (for example, city names match the options in a city name list).

- There is no bias in the data (for example, the gender feature has the anticipated percentage of men and women).

## Tests can improve the model quality:

- Verify the performance is maintained across essential slices of the data (for example, devices by model, users by country or other categories, movies by genre) and that it does not drop significantly for a specific group.

- Compare the model results with previous versions or a baseline version and verify the performance does not degrade.

- Test different parameter combinations (hyperparameter search) to verify you chose the best parameter combination.

- Test for bias and fairness by verifying that the performance is maintained per gender and specific populations.

- Check feature importances and whether there are features with a marginal contribution that can be removed from the model.

- Test for immunity to fake, random, or malicious input vectors to increase robustness and defend against adversarial attacks.

## Monitoring Data and Concept Drift
### Concept drift is a phenomenon where the statistical properties of the target variable (y, which the model is trying to predict) change over time. Data drift (virtual drift) happens when the statistical properties of the inputs changes. In drift, the model built on past data no longer applies, and assumptions made by the model on past data need to be revised based on current data. Figure 2-18 illustrates the differences between concept drift and virtual (data) drift.


## The monitoring system saves the various feature statistics (min, max, average, stddev, histogram, and so on), and the drift level is calculated using one or more of the following metrics:

- Kolmogorov–Smirnov test

- Kullback–Leibler divergence

- Jensen–Shannon divergence

- Hellinger distance

- Standard score (Z-score)

- Chi-squared test

- Total variance distance

## Monitoring Model Performance and Accuracy

### An important metric is to measure model accuracy in production. For that, you must have the ground truth (the actual result that matches the prediction). In some models obtaining the ground truth is relatively simple. For example, if we predict that a stock price will go up today, we can wait a few hours and know if the prediction was accurate. This is the same with other prediction applications like predicting customer churn or machine failure where the actual result arrives with some delay.

## Application Pipeline Development
### There are two types of application pipelines: real-time (or online) pipelines, which constantly accept events or requests and respond immediately, and batch pipelines, which are triggered through an API or at a given schedule. Batch pipelines usually read and process larger datasets on every run.

## Real-time pipelines can be implemented manually by chaining individual containerized functions or can be automated by using a real-time pipeline framework such as MLRun serving graphs, Apache Beam, or AWS Step Functions. 

## A CI/CD pipeline for an ML application will likely implement the following steps:

- Data preparation

- Model training using hyperparameters and grid search

- Model evaluation

- Application pipeline deployment (with the best model)

### You probably want to avoid constantly staring at dashboards for model or data performance problems. Instead, you can define triggering policies and actions. For example, when a certain threshold is reached, a notification can alert the administrator or initiate an automated process for retraining a model or mitigating potential errors.

### Source data is processed and stored as features for use in model training and model flows. In many cases, features are stored in two storage systems: one for batch access (training, batch prediction, and so on) and one for online retrieval (for real-time or online serving). As a result, there may be two separate data processing pipelines, one using batch processing and the other using real-time (stream) processing.

### The data sources and processing logic will likely change over time, resulting in changes to the processed features and the model produced from that data. Therefore, applying versioning to the data, processing logic, and tracking data lineage are critical elements in any MLOps solution.

## Data versioning, lineage, and metadata management are a set of essential MLOps practices that address the following:

### Data quality
- Tracing data through an organization’s systems and collecting metadata and lineage information can help identify errors and inconsistencies. This makes it possible to take corrective action and improve data quality.

### Model reproducibility and traceability
- Access to historical data versions allows us to reproduce model results and can be used for model debugging, troubleshooting, and trying out different parameter sets.

### Data governance and auditability
- By understanding the origin and history of data, organizations can ensure that data follows expected policies and regulations, tracks sources of errors, and performs audits or investigations.

### Compliance
- Data lineage can help organizations demonstrate compliance with regulations such as GDPR and HIPAA.

### Simpler data management
- Metadata and lineage information enables better data discovery, mappings, profiling, integration, and migrations.

### Better collaboration
- Data versioning and lineage can facilitate cooperation between data scientists and ML engineers by providing a clear and consistent view of the data used in ML models and when handling upgrades.

### Dependency tracking
- Understanding how each data, parameter, or code change contributes to the results and providing insights into which data or model objects need to change due to data source modification.

## Pachyderm
### Pachyderm is a data pipeline and versioning tool built on a containerized infrastructure. It provides a versioned file system and allows users to construct multistage pipelines, where each stage runs on a container, accepts input data (as files), and generates output data files.

## Here are some examples of analytic transformations that can be performed on structured data:

### Filtering
- Selecting a subset of the data based on certain criteria, such as a specific date range or specific values in a column.

### Sorting
- Ordering the data based on one or more columns, such as sorting by date or by a specific value.

### Grouping
- Organizing the data into groups based on one or more columns, such as grouping by product category or by city.

### Aggregation
- Calculating summary statistics, such as count, sum, average, maximum, and standard deviation, for one or more columns.

### Joining
- Combining data from multiple tables or datasets based on common columns, such as joining a table of sales data with a table of customer data.

### Mapping
- Mapping values from one or more columns to new column values using user-defined operations or code. Stateful mapping can calculate new values based on original values and accumulated states from older entries (for example, time passed from the last login).

### Time series analysis
- Analyzing or aggregating data over time, such as identifying trends, patterns, or anomalies.

## The following techniques can be used to process unstructured data or convert it to structured data:

### Text mining
- Using NLP techniques to extract meaning and insights from text data. Text mining can extract information such as sentiment, entities, and topics from text data.

### Computer vision
- Using image and video processing techniques to extract information from visual data. Computer vision can extract information such as object recognition, facial recognition, and image classification.

### Audio and speech recognition
- Using speech-to-text and audio processing techniques to extract meaning and insights from audio data. Audio and speech recognition can extract information such as speech-to-text, sentiment, and speaker identification.

### Data extraction
- Using techniques such as web scraping and data extraction to pull out structured data from unstructured data sources.