# MLOps - Machine Learning OPerations

Some terms:

| Term | Definition |
| - | - |
| 

# ML project lifecycle

<img src="Media/ml_project_lifecycle.png" width=700>

Going from "Modelling" to "Deployment", a good thing to do could be to do the last Performance audit:

## Scoping

1. Define project
   1. Decide on key performance metrics, e.g. accuracy, latency, throughput
2. Define desired input and output
3. Estimate resource needed



## Data

1. Define data
   1. Is the data labeled consistently?
   2. Data normalization?
2. Establish baseline
3. Collect data
4. Label and organise data

Data augmentation:
- Data needs to be augmented for those data points on which the computer performs poorly, but a human does not
- Needs to be still recognisable by a human

**Data iteration**

<img src="Media/data_iteration_loop.png" width=400>

Unstructured data:
- Add data
- Data augmentation

Structured data:
- Add features


## Modelling

---

**Select and train model**

Model development is a highly-iterative process - **model iteration**.

The first step is to <u>establish a baseline level of performance</u>, e.g. desired accuracy, which canbe established:
- Human Level Performance (HLP): usually is more effective for establishing baseline for unstructured data, such as images, text, audio. For unstructured data problems, using human-level performance as the baseline can give an estimate of the irreducible error/Bayes error and what performance is reasonable to achieve.
- Literature search for state-of-the-art / open source
- Performance of older ML system (previous version of your ML model)

<u>Data-centric vs model-centric AI development</u>
- **Data-centric**: keep the algorithm / code fixed and iteratively improve the data
- **Model-centric**: keep the data fixed and iteratively work to improve / optimise algorithm / model
- *most academic research tends to be model-centric with fixed data as a benchmark.*
- *A reasonable algorithm with good data will often outperform a great algorithm with no so good data*

Milestones in the model development:
1. doing well on training set - FIRST MILESTONE
2. doing well on dev/test set
   1. not enough to do well only on test set. 
   2. for example, your model can perform well on average on test set, but on disproportionally important data points it could perform worse, which wouldn't be acceptable
   3. ml model can be biased and discriminate by gender, ethnicity, etc.
   4. rare classes / skewed data distribution; accuracy in rare classes
3. doing well on business metrics / project goals

Before starting train on large dataset, overfit a smaller portion of the dataset just to see that it would work and to find bugs

---

**Perform error analysis**

Error analysis is also an iterative process.

Prioritizing what to work on:
- Check how much room for improvement there is compared to the baseline (e.g. HLP)
- how frequently a category appears
- how easy it is to improve accuracy in a category
- how important it is to improve in a category

Improving performance on specific categories:
- collect more data for that category
- data augmentation
- improve label accuracy / data quality

Skewed datasets
- if it's highly-skewed, instead of accuracy use precision and recall

could check precision, recall, and f1 score for each of the groups / classes



## Deployment

1) deploy in production
2) monitor & maintain system
3) monitor data, if it changes - maybe retrain the model
4) deployment pattern:
   1) shadow mode deployment: model shadows the humans and runs in parallel; ML system's output is not used for any decisions during this phase. purpose - to monitor how the system is performing compared to human performance. Example: You’ve built a new system for making loan approval decisions. For now, its output is not used in any decision making process, and a human loan officer is solely responsible for deciding what loans to approve. But the system’s output is logged for analysis.
   2) Canary deployment: roll out to small fraction (5%) of traffic initially, then monitor system and gradually ramp up traffic. Allows to spot problems with your ML system early on. So you start by rolling out the new model to, let's say, 5% of the users. Then, you can gradually ramp up that number. 
   3) blue green deployment: just shift router sending data from old version of the model to the new one. Enables easier way to rollback to the older model

<u>Model can be run at:</u>
- Cloud deployment
- Edge deployment
  - Can function even when network connection is down
  - Less network bandwidth needed
  - Lower latency
  - Also less computational power is available

<u>Problems</u>
- **Concept drift**: x -> y mapping changes post-deployment, iow after deployment what we want to predict changes. occurs when the patterns the model learned no longer hold. the very meaning of what we are trying to predict evolves.
- **Data drift** (feature drift, population, covariate shift): data distribution changes post-deployment; the input data changed after deployment, so the trained model is not relevant for this new data; also could be because instances become increasing for a class for which our model didn't perform that well
- **Model decay / drift / staleness**: degradation of model performance over time, due to some model quality metric (accuracy, mean error rate, or some downstream business KPI e.g. click-through rate). 
  - reasons for model decay: data quality, data drift, concept drift



<img src="Media/mlops.png" width=400>

<img src="Media/degree-of-automation.png" width=400>

Example of partial automation: 

You’re building a healthcare screening system, where you input a patient’s symptoms, and for the easy cases (such as an obvious case of the common cold) the system will give a recommendation directly, and for the harder cases it will pass the case on to a team of in-house doctors who will form their own diagnosis independently. What degree of automation are you implementing in this example for patient care?


---

**monitoring**

Monitoring dashboard for monitoring:
- server load
- fraction of non-null outputs
- fraction of missing input values
- other things that could go wrong
- set thresholds for alarms
- metrics and threshold may be needed to adapted over time

Examples of metrics to track:
- Software metrics: memory, compute, latency, throughput, server load
- input metrics: average input length, average input volume, number of missing values, average image brightness
- Output metrics: number of times system returns null, number of times user redoes search, CTR

<img src="Media/mlops2.png" width=400>

Model maintenance:
- manual retraining
- automatic retraining



**pipeline monitoring**

metrics to monitor:
- software metrics:
- input metrics
- output metrics

how quickly do they change?
- user data generally has slower drift (exception - covid 19, new movie or trend)
- enterprise data (b2b applications) can shift fast (e.g. new coating for mobile phone, change the way the company operates)

