# MLOps

MLOps stands for Machine Learning Operations.

MLOps is focused on streamlining the process of deploying machine learning models to production, and then maintaining and monitoring them. 
MLOps is a collaborative function, often consisting of data scientists, ML engineers, and DevOps engineers. 
The word MLOps is a compound of two different fields i.e. machine learning and DevOps from software engineering.


An optimal MLOps experience is one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. Machine Learning models can be deployed alongside the services that wrap them and the services that consume them as part of a unified release process.

While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation (software development lifecycle and continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.

Why MLOps?
There are many goals enterprises want to achieve through MLOps. Some of the common ones are:

Automation

Scalability

Reproducibility

Monitoring

Governance


MLOps vs DevOps

DevOps is an iterative approach to shipping software applications into production. MLOps borrows the same principles to take machine learning models to production. Either Devops or MLOps, the eventual objective is higher quality and control of software applications/ML models.

## MLOps stages and their Output

Development & Experimentation (ML algorithms, new ML models) >>>>> Source code for pipelines: Data extraction, validation, preparation, model training, model evaluation, model testing

Pipeline Continuous Integration (Build source code and run tests) >>>>> Pipeline components to be deployed: packages and executables.

Pipeline Continuous Delivery (Deploy pipelines to the target environment) >>>>> Deployed pipeline with new implementation of the model.

Automated Triggering (Pipeline is automatically executed in production. Schedule or trigger are used) >>>>> Trained model that is stored in the model registry.

Model Continuous Delivery (Model serving for prediction) >>>>> Deployed model prediction service (e.g. model exposed as REST API)

Monitoring (Collecting data about the model performance on live data) >>>>> Trigger to execute the pipeline or to start a new experiment cycle.

## MLOps setup components

Source Control >>>>> Versioning the Code, Data, and ML Model artifacts.

Test & Build Services >>>>> Using CI tools for (1) Quality assurance  for all ML artifacts, and (2) Building packages and executables for pipelines.

Deployment Services >>>>> Using CD tools for deploying pipelines to the target environment.

Model Registry >>>>> A registry for storing already trained ML models.

Feature Store >>>>> Preprocessing input data as features to be consumed in the model training pipeline and during the model serving.

ML Metadata Store >>>>> Tracking metadata of model training, for example model name, parameters, training data, test data, and metric results.

ML Pipeline Orchestrator >>>>> Automating the steps of the ML experiments.

## Continuous

MLOps is an ML engineering culture that includes the following practices:

Continuous Integration (CI) extends the testing and validating code and components by adding testing and validating data and models.

Continuous Delivery (CD) concerns with delivery of an ML training pipeline that automatically deploys another the ML model prediction service.

Continuous Training (CT) is unique to ML systems property, which automatically retrains ML models for re-deployment.

Continuous Monitoring (CM) concerns with monitoring production data and model performance metrics, which are bound to business metrics.

## Versioning

The goal of the versioning is to treat ML training scripts, ML models and data sets for model training as first-class citizens in DevOps processes by tracking ML models and data sets with version control systems.Every ML model specification should be versioned in a VCS to make the training of ML models auditable and reproducible. 

## Changes in ML model or data can be because of these:

ML models can be retrained based upon new training data.

Models may be retrained based upon new training approaches.

Models may be self-learning.

Models may degrade over time.

Models may be deployed in new applications.

Models may be subject to attack and require revision.

Models can be quickly rolled back to a previous serving version.

Corporate or government compliance may require audit or investigation on both ML model or data, hence we need access to all versions of the productionized ML model.

Data may reside across multiple systems.

Data may only be able to reside in restricted jurisdictions.

Data storage may not be immutable.

Data ownership may be a factor.



## Experiments Tracking

One way to track multiple experiments is to use different (Git-) branches, each dedicated to the separate experiment. The output of each branch is a trained model.

Depending on the selected metric, the trained ML models are compared with each other and the appropriate model is selected. 

## Testing

The complete development pipeline includes three essential components:

Data pipeline,

ML model pipeline, and

Application pipeline. 





In accordance with this separation we distinguish three scopes for testing in ML systems: 

tests for features and data, 

tests for model development, and 

tests for ML infrastructure.

## Features and Data Tests

## Data validation:
Automatic check for data and features schema/domain.

Action: In order to build a schema (domain values), calculate statistics from the training data. This schema can be used as expectation definition or semantic role for input data during training and serving stages.

## Features importance test to understand whether new features add a predictive power.
Action: Compute correlation coefficient on features columns.

Action: Train model with one or two features.

Action: Use the subset of features “One of k left out and train a set of different models.
Measure data dependencies, inference latency, and RAM usage for each new feature. Compare it with the predictive power of the newly added features.

Drop out unused/deprecated features from your infrastructure and document it.


Features and data pipelines should be policy-compliant (e.g. GDPR). These requirements should be programmatically checked in both development and production environments.

Feature creation code should be tested by unit tests (to capture bugs in features).

## ***Tests for Reliable Model Development***

We need to provide specific testing support for detecting ML-specific errors.

Testing ML training should include routines, which verify that algorithms make decisions aligned to business objective. This means that ML algorithm loss metrics (MSE, log-loss, etc.) should correlate with business impact metrics (revenue, user engagement, etc.)

Action: The loss metrics - impact metrics relationship, can be measured in small scale A/B testing using an intentionally degraded model.
Further reading: Selecting the Right Metric for evaluating Machine Learning Models.


## Model staleness test. 
The model is defined as stale if the trained model does not include up-to-date data and/or does not satisfy the business impact requirements. Stale models can affect the quality of prediction in intelligent software.

Action: A/B experiment with older models. Including the range of ages to produce an Age vs. Prediction Quality curve to facilitate the understanding of how often the ML model should be trained.

## Assessing the cost of more sophisticated ML models.

Action: ML model performance should be compared to the simple baseline ML model (e.g. linear model vs neural network).

## Validating performance of a model.

It is recommended to separate the teams and procedures collecting the training and test data to remove the dependencies and avoid false methodology propagating from the training set to the test set (source).

Action: Use an additional test set, which is disjoint from the training and validation sets. Use this test set only for a final evaluation.

## Fairness/Bias/Inclusion testing for the ML model performance.

Action: Collect more data that includes potentially under-represented categories.

Action: Examine input features if they correlate with protected user categories.

Further reading: “Tour of Data Sampling Methods for Imbalanced Classification”

## Conventional unit testing for any feature creation, ML model specification code (training) and testing.

## Model governance testing

## ML infrastructure test
Training the ML models should be reproducible, which means that training the ML model on the same data should produce identical ML models.>>>>>
Diff-testing of ML models relies on deterministic training, which is hard to achieve due to non-convexity of the ML algorithms, random seed generation, or distributed ML model training.
Action: determine the non-deterministic parts in the model training code base and try to minimize non-determinism.


Test ML API usage. Stress testing.>>>>>>
Action: Unit tests to randomly generate input data and training the model for a single optimization step (e.g gradient descent).
Action: Crash tests for model training. The ML model should restore from a checkpoint after a mid-training crash.

Test the algorithmic correctness.>>>>>>
Action: Unit test that it is not intended to completing the ML model training but to train for a few iterations and ensure that loss decreases while training.
Avoid: Diff-testing with previously build ML models because such tests are hard to maintain.

Integration testing: The full ML pipeline should be integration tested.>>>>>>>>>>>
Action: Create a fully automated test that regularly triggers the entire ML pipeline. The test should validate that the data and code successfully finish each stage of training and the resulting ML model performs as expected.
All integration tests should be run before the ML model reaches the production environment.

Validating the ML model before serving it.>>>>>>>>
Action: Setting a threshold and testing for slow degradation in model quality over many versions on a validation set.
Action: Setting a threshold and testing for sudden performance drops in a new version of the ML model.

ML models are canaried before serving.>>>>>>>
Action: Testing that an ML model successfully loads into production serving and the prediction on real-life data is generated as expected.

Testing that the model in the training environment gives the same score as the model in the serving environment.>>>>>>>>>
Action: The difference between the performance on the holdout data and the “next­day” data. Some difference will always exist. Pay attention to large differences in performance between holdout and “next­day” data because it may indicate that some time-sensitive features cause ML model degradation.
Action: Avoid result differences between training and serving environments. Applying a model to an example in the training data and the same example at serving should result in the same prediction. A difference here indicates an engineering error.

## Monitoring

Once the ML model has been deployed, it need to be monitored to assure that the ML model performs as expected. The following check list for model monitoring activities in production is adopted:

Monitor dependency changes throughout the complete pipeline result in notification.
Data version change.
Changes in source system.
Dependencies upgrade.

Monitor data invariants in training and serving inputs:>>>>>> Alert if data does not match the schema, which has been specified in the training step.
Action: tuning of alerting threshold to ensure that alerts remain useful and not misleading.

Monitor whether training and serving features compute the same value.>>>>>>>
Since the generation of training and serving features might take place on physically separated locations, we must carefully test that these different code paths are logically identical.
Action: (1) Log a sample of the serving traffic. 
(2) Compute distribution statistics (min, max, avg, values, % of missing values, etc.) on the training features and the sampled serving features and ensure that they match.

Monitor the numerical stability of the ML model.>>>>>>>
Action: trigger alerts for the occurrence of any NaNs or infinities.

Monitor computational performance of an ML system. Both dramatic and slow-leak regression in computational performance should be notified.
Action: measure the performance of versions and components of code, data, and model by pre-setting the alerting threshold.
Action: collect system usage metrics like GPU memory allocation, network traffic, and disk usage. These metrics are useful for cloud costs estimations.

Monitor how stale the system in production is.
Measure the age of the model. Older ML models tend to decay in performance.
Action: Model monitoring is a continuous process, therefore it is important to identify the elements for monitoring and create a strategy for the model monitoring before reaching production.

Monitor the processes of feature generation as they have impact on the model.
Action: re-run feature generation on a frequent basis.


Monitor degradation of the predictive quality of the ML model on served data. Both dramatic and slow-leak regression in prediction quality should be notified.
Degradation might happened due to changes in data or differing code paths, etc.
Action: Measure statistical bias in predictions (avg in predictions in a slice of data). Models should have nearly zero bias.
Action: If a label is available immediately after the prediction is made, we can measure the quality of prediction in real-time and identify problems.




## “ML Test Score” System

The “ML Test Score” measures the overall readiness of the ML system for production. The final ML Test Score is computed as follows:

For each test, half a point is awarded for executing the test manually, with the results documented and distributed.
A full point is awarded if the there is a system in place to run that test automatically on a repeated basis.
Sum the score of each of the four sections individually: Data Tests, Model Tests, ML Infrastructure Tests, and Monitoring.
The final ML Test Score is computed by taking the minimum of the scores aggregated for each of the sections: Data Tests, Model Tests, ML Infrastructure Tests, and Monitoring.

After computing the ML Test Score, we can reason about the readiness of the ML system for production.

## Reproducibility
Reproducibility in a machine learning workflow means that every phase of either data processing, ML model training, and ML model deployment should produce identical results given the same input.

Below we see what the different phases are,the challenges in them and how we aim to achieve reproducibility.

(A) Phase:Collecting Data

Challenges:Generation of the training data can't be reproduced (e.g due to constant database changes or data loading is random)

Method to Ensure Reproducibility: 
1) Always backup your data.
2) Saving a snapshot of the data set (e.g. on the cloud storage).
3) Data sources should be designed with timestamps so that a view of the data at any point can be retrieved.
4) Data versioning.

(B) Phase:Feature Engineering

Challenges:
Scenarios:
1) Missing values are imputed with random or mean values.
2) Removing labels based on the percentage of observation.
3) Non-deterministic feature extraction methods

Method to Ensure Reproducibility: 
1) Feature generation code should be taken under version control.
2) Require reproducibility of the previous step "Collecting Data"

(C) Model Training / Model Build

Challenges
Non-determinism

Method to Ensure Reproducibility: 
1) Ensure the order of features is always the same.
2) Document and automate feature transformation, such as normalization.
3) Document and automate hyperparameter selection.
4) For ensemble learning: document and automate the combination of ML models.

(D) Model Deployment

Challenges
1) Training the ML model has been performed with a software version that is different to the production environment.
2) The input data, which is required by the ML model is missing in the production environment.

Method to Ensure Reproducibility: 
1) Software versions and dependencies should match the production environment.
2) Use a container (Docker) and document its specification, such as image version.
3) Ideally, the same programming language is used for training and deployment.

## Loosely Coupled Architecture (Modularity)
This key architectural property enables teams to easily test and deploy individual components or services even as the organization and the number of systems it operates grow—that is, it allows organizations to increase their productivity as they scale.

According to Gene Kim et al., in their book “Accelerate”, “high performance [in software delivery] is possible with all kinds of systems, provided that systems—and the teams that build and maintain them — are loosely coupled. ”

Using a loosely coupled architecture affects the extent to which a team can test and deploy their applications on demand, without requiring orchestration with other services. 

Having a loosely coupled architecture allows your teams to work independently, without relying on other teams for support and services, which in turn enables them to work quickly and deliver value to the organization.

Regarding ML-based software systems, 

it can be more difficult to achieve loose coupling between machine learning components than for traditional software components. ML systems have weak component boundaries in several ways. 
For example, the outputs of ML models can be used as the inputs to another ML model and such interleaved dependencies might affect one another during training and testing.

Basic modularity can be achieved by structuring the machine learning project.

## ML-based Software Delivery Metrics

Four key metrics that capture the effectivenes of the software development and delivery of elite/high performing organisations:

(A) Deployment Frequency,

ML Model Deployment Frequency depends on
1) Model retraining requirements (ranging from less frequent to online training). Two aspects are crucial for model retraining
1.1) Model decay metric.
1.2) New data availability.
2) The level of automation of the deployment process, which might range between *manual deployment* and *fully automated CI/CD pipeline*.

(B) Lead Time for Changes,

ML Model Lead Time for Changes depends on
1) Duration of the explorative phase in Data Science in order to finalize the ML model for deployment/serving.
2) Duration of the ML model training.
3) The number and duration of manual steps during the deployment process.

(C) Mean Time To Restore (MTTR)

ML Model MTTR depends on the number and duration of manually performed model debugging, and model deployment steps. In case, when the ML model should be retrained, then MTTR also depends on the duration of the ML model training. Alternatively, MTTR refers to the duration of the rollback of the ML model to the previous version.

(D) Change Fail Percentage.

ML Model Change Failure Rate can be expressed in the difference of the currently deployed ML model performance metrics to the previous model's metrics, such as Precision, Recall, F-1, accuracy, AUC, ROC, false positives, etc. ML Model Change Failure Rate is also related to A/B testing.