# MLOps - Machine Learning OPerations

Production ML combines two key disciplines: ML Development and Modern Software Development. 

Some terms:

| Term | Definition |
| - | - |
| Data Schema | a schema describes standard characteristics of your data such as column data types and expected data value range. |



# ML project lifecycle

<img src="Media/ml_project_lifecycle.png" width=800>
<br>
<img src="Media/ml-deployment-lifecycle.png" width=800>

Going from "Modelling" to "Deployment", a good thing to do could be to do the last Performance audit:

## Scoping

1. Define project
   1. generate ideas on how to improve a business
   2. pick the idea that is the most valuable / will result in the most improvement
   3. What project should we work on?
2. Define desired input and output
   1. Decide on key performance metrics, e.g. accuracy, latency, throughput
3. Estimate resource needed
   1. What are the resources (data, time, people) needed?


Scoping process:
1. **PROBLEM: Brainstorm business problems (not AI problems)**
   1. think about what you want to achieve
   2. "I want to hear your business problems, what needs to be improved business-wise, and it is my job to come up with an AI solution"
   3. "What are the top 3 things you wish were working better?" - e.g, Increase conversion, Reduce inventory, Increase margin (profit per item)
2. **SOLUTION: Brainstorm AI solutions**
   1. now, think about how to achieve it
3. **DILLIGENCE: Access the feasibility and value of potential solutions**
   1. Dilligence on feasibility: is this project technically feasible?
      1. Use external benchmark: literature, other company, competitor
   2. Dilligence on value:
      1. Have technical and business teams try to agree on (performance / error) metrics that both are comfortable with
      2. Fermi estimates - try to estimate how much ML engineer metrics improvements will influence improvement in the business metrics
4. **Determine milestones**
   1. Key specifiecations:
      1. ML metrics: accuracy, precision, recall, etc.
      2. Software metrics: latency, throughput, etc. given compute resources
      3. Business metrics: revenue, etc.
      4. resources needed: data, personnel, help from other teams
      5. timeline
   2. if unsure, consider benchmarking to other projects, or building a PoC (proof-of-concept) first
5. **Budget for resources**


Assessing technical feasibility:

| | Unstructured data | Structured |
| - | - | - |
| New (you haven't worked on that type of project before) | HLP | Predictive features available? Do we have features (past data) that seem predictive of future events? |
| Existing | HLP; history of project (based on data of model error per regular time frames, you could model / estimate what this error will approach in the future) | Identify new predictive features; look at the history of project |




## Data

1. Define data
   1. Is the data labeled consistently?
   2. Data normalization?
2. Establish baseline
3. Collect data
4. Label and organise data

For production, real-world dynamic data is used. After deployment, model-performance needs to be continuously monitored, and new data, ingested and re-trained.

Garbage in , garbage out.

Key points:
- Translate use needs into data problems.
- Ensure data coverage and high predictive signal
- Source, store, and monitor quality data responsibly
- Data availability and collection
  - what kind of / how much data is available
  - how often does the new data come in
  - is it annotated?

**Data iteration Loop**

<img src="Media/data_iteration_loop.png" width=400>

**Define data and establish baseline**

Major types of data problems

| | Unstructured | Structured | *Feature* |
| - | - | - | - |
| **Small data** <br>($n \le 10000$) | Manufacturing visual inspection from 100 training examples | Housing price prediction based on square footage, etc. from 50 training examples. | *Clean labels and label consistency are critical; is possible to manually look through dataset and fix labels* |
| **Big data** <br>($n \gt 10000$)| Speech recognition from 50 million training examples | Online shopping recommendations for 1 milllion users. | *Because too much data, emphasis is on data process, still, label consistency is also important; big data can also have small data challenges, e.g. considering rare events / classes and model performance on then* |
| *Features* | *Obtain more data by data augmentation; humans can label more (and more effectively & efficiently) unstructured data* | *Harder to obtain more data (e.g. by data augmentation); human labelling is also harder.* | |

Improving label consistency:
- **Have multiple labelers label the same example**
- When disagreement, have MLE, subject matter expert, and labelers discuss definitions of y to reach agreement
- Potentially change data points that labelers think doesn't contain enough information to label it
- **standardise the labels**
- **merge classes**: e.g. "deep scratch" and "shallow scratch" -> "scratch"

---

Data augmentation:
- Data needs to be augmented for those data points on which the computer performs poorly, but a human does not
- Needs to be still recognisable by a human
- can be done by GANs

Unstructured data:
- Add data
- Data augmentation
  - can be done with GANs

Structured data:
- Add features
  - E.g. restaurant recommendation system, could add feature "is_vegetarian?", "restaurant_has_veg_option?"

---

**Label and organise data**

Try to get into the Data Iteration Loop asap. Don't spend initially too much time collecting the data.

You could start with little data, then increase afterwards. Don't increase data by more than 10x at a time, to see if increasing data points leads to improvement. 

Data Pipelines (Cascades):
- E.g. `Raw Data -> Data Cleaning (with scripts) -> ML`
- replicability of data processing programs could be different at different stages of work: 
  - Proof-of-Concept phase, data processing can be manual (with lots of comments) with the sole aim of making stuff work; purpose of PoC system - to check feasibility and help decide if an application is workable and worth deploying. 
  - Production phase, use sophisticated tools to ensure the replicability of the entire data pipeline.
  - What if some stage of data pipeline changes? Need to keep track of the following (by e.g. extensive documentation, use of metadata):
    - Data provenance: where it comes from
    - Data lineage: sequence of steps
  - Meta-data can be useful for:
    - Error analysis - spotting unexpected effects
    - keeping track of data provenance





### Data Lifecycle

Managing the entire lifecycle of data:
- Labeling
- Feature space coverage
- Minimal dimensionality
- Maximum predictive data
- Fairness
- Rare conditions


### Data collection

Data pipeline: a series of data processing steps such as data collection, data ingestion, and data preparation.

Feature engineering: helps to maximise the predictive signal. 

Feature selection: helps to measure the predictive signals. 

Data sources:
- Build synthetic dataset
- Open source dataset
- Web scraping
- Build your own dataset
- Collect live data

---

**Responsible data**

Security and privacy:
- Protect personally identifiable information:
  - Aggregation: replace unique values with summary values
  - Redaction: remove some data to create less complete picture

Fairness:
- Avoid bias in the data, e.g. model doesn't work well on photos of black people compared to white people
- Group fairness, equal accuracy
- Bias in human labeled / collected data, e.g. because of one group being unrepresented 
  - accurate labels: are necessary for supervised learning
  - can arise from the data containing more data points for one group than the other; no representation of people's diversity

Problems:
- Representational harm: the amplification or negative reflection of certain groups' stereotypes.
- Opportunity denial
- Disproportionate product failure
- Harm by disadvantage

Mitigate bias in data:
- Collect data from equal proportions from different user groups








### Data labeling

Labeling can be done by:
- automation
- raters

Types of human raters (people who assign labels for training supervised models):
- Generalists: crowdsourcing tools
- Subject matter experts: specialised tools, e.g. radiologists labeling medical images for automated diagnosis tools
- Your users: derived labels, e.g. tagging photos


<u>Methods of obtaining labels:</u>
- **Process Feedback** (direct labeling):
  - Ex: actual vs predicted click-through - if person clicked on an ad, label as "positive"
  - Advantages: training dataset continuous creation; labels evolve quickly; captures strong label signals
  - Disadvantages: not possible for many problems; tends to be custom design for each problem;
  - Log analysis tools: Logstash, Fluentd, Google Cloud Logging, AWS ElasticSearch, Azure Monitor
- **Human Labeling**
  - Raters labeling data, e.g cardiologists labeling MRI images
  - Advantages: more labels; pure supervised learning;
  - Disadvantages: can be costly and hard depending on data (e.g. X-ray images); quality consistency; slow; expensive; small dataset curation
- **Semi-Supervised Labeling**
- **Active Learning**
- **Weak Supervision**





### validating data

TensorFlow Data Validation (TFDV) 
- Generates data statistics and browser visualisations
- Infers the data schema
- Performs validity checks against schema
- Detects training/serving skew

Schema skew: training and serving data do not conform to the same schema, e.g. `int != float`

Feature skew: features values are different between training and serving.

Distribution skew: distribution of serving and training dataset is significantly different. 





## Modelling

---

**Select and train model**

Model development is a highly-iterative process - **model iteration**.

The first step is to <u>establish a baseline level of performance</u>, e.g. desired accuracy, which canbe established:
- Human Level Performance (HLP): usually is more effective for establishing baseline for unstructured data, such as images, text, audio. For unstructured data problems, using human-level performance as the baseline can give an estimate of the irreducible error/Bayes error and what performance is reasonable to achieve.
  - HLP estimates Bayes error / irreducible error due to random chance. 
  - HLP can establish a respectable benchmark of performance to beat 
  - Raising / establishing HLP:
    - When the ground truth label is externally defined (e.g. how you vs the doctor predict some medical outcome compared to a <u>biopsy</u>), HLP gives an estimate for Bayes error / irreducible error;
    - Often ground truth is just label of a human (e.g. an inspector labeling the photos). In this case, low HLP could indicate inconsistent labeling instructions
    - HLP can be raised by making the labeling instructions more consistent
    - If a photo cannot be classified well even by a qualified person, then the quality of the photo(s) needs to be improved
- Literature search for state-of-the-art / open source
- Performance of older ML system (previous version of your ML model)

<u>Data-centric vs model-centric AI development</u>
- **Data-centric**: keep the algorithm / code fixed and iteratively improve the data
- **Model-centric**: keep the data fixed and iteratively work to improve / optimise algorithm / model
- *most academic research tends to be model-centric with fixed data as a benchmark.*
- *A reasonable algorithm with good data will often outperform a great algorithm with no so good data*

Milestones in the model development:
1. doing well on training set - FIRST MILESTONE
2. doing well on dev/test set
   1. not enough to do well only on test set. 
   2. for example, your model can perform well on average on test set, but on disproportionally important data points it could perform worse, which wouldn't be acceptable
   3. ml model can be biased and discriminate by gender, ethnicity, etc.
   4. rare classes / skewed data distribution; accuracy in rare classes
3. doing well on business metrics / project goals

Before starting train on large dataset, overfit a smaller portion of the dataset just to see that it would work and to find bugs

---

**Perform error analysis**

Error analysis is also an iterative process.

Prioritizing what to work on:
- Check how much room for improvement there is compared to the baseline (e.g. HLP)
- how frequently a category appears
- how easy it is to improve accuracy in a category
- how important it is to improve in a category

Improving performance on specific categories:
- collect more data for that category
- data augmentation
- improve label accuracy / data quality

Skewed datasets
- if it's highly-skewed, instead of accuracy use precision and recall

could check precision, recall, and f1 score for each of the groups / classes



## Deployment

1) deploy in production
2) monitor & maintain system
3) monitor data, if it changes - maybe retrain the model
4) deployment pattern:
   1) shadow mode deployment: model shadows the humans and runs in parallel; ML system's output is not used for any decisions during this phase. purpose - to monitor how the system is performing compared to human performance. Example: You’ve built a new system for making loan approval decisions. For now, its output is not used in any decision making process, and a human loan officer is solely responsible for deciding what loans to approve. But the system’s output is logged for analysis.
   2) Canary deployment: roll out to small fraction (5%) of traffic initially, then monitor system and gradually ramp up traffic. Allows to spot problems with your ML system early on. So you start by rolling out the new model to, let's say, 5% of the users. Then, you can gradually ramp up that number. 
   3) blue green deployment: just shift router sending data from old version of the model to the new one. Enables easier way to rollback to the older model

Model can be run at:
- **Cloud deployment**
- Edge deployment
  - Can function even when network connection is down
  - Less network bandwidth needed
  - Lower latency
  - Also less computational power is available


<img src="Media/mlops.png" width=400>

<img src="Media/degree-of-automation.png" width=400>

Example of partial automation: 

You’re building a healthcare screening system, where you input a patient’s symptoms, and for the easy cases (such as an obvious case of the common cold) the system will give a recommendation directly, and for the harder cases it will pass the case on to a team of in-house doctors who will form their own diagnosis independently. What degree of automation are you implementing in this example for patient care?


---

**monitoring**

Monitoring dashboard for monitoring:
- server load
- fraction of non-null outputs
- fraction of missing input values
- other things that could go wrong
- set thresholds for alarms
- metrics and threshold may be needed to adapted over time

Examples of metrics to track:
- Software metrics: memory, compute, latency, throughput, server load
- input metrics: average input length, average input volume, number of missing values, average image brightness
- Output metrics: number of times system returns null, number of times user redoes search, CTR

<img src="Media/mlops2.png" width=400>

Model maintenance:
- manual retraining
- automatic retraining



**pipeline monitoring**

metrics to monitor:
- software metrics:
- input metrics
- output metrics

how quickly do they change?
- user data generally has slower drift (exception - covid 19, new movie or trend)
- enterprise data (b2b applications) can shift fast (e.g. new coating for mobile phone, change the way the company operates)



### Software

Modern software development:
- Scalability
- Extensibility: can you extend it easily to add more stuff
- Configuration
- Consistency & reproducibility
- Safety & security
- Modularity
- Testability
- Monitoring
- Best practices in the industry




### Problems

Model performance decays over time.

Drift and skew:
- Drift: changes in data over time, such as data collected once a day.
- Skew: difference between two static versions, or different sources, such as training set and serving set.

Problems can be gradual and sudden:
- **Gradual**:
  - Data changes:
    - **Data drift** (feature drift, population, covariate shift): data (distribution) changes post-deployment, so the trained model is not relevant for this new data. Also could be because instances become increasing for a class for which our model didn't perform that well. Measured by Chebyshev distance (L-infinity)
  - World changes:
    - **Concept drift**: x -> y mapping changes post-deployment, iow after deployment what we want to predict changes, occurs when the patterns the model learned no longer hold, the very meaning of what we're trying to predict evolves
  - **Model decay / drift / staleness** (SLOW): degradation of model performance over time, due to some model quality metric (accuracy, mean error rate, or some downstream business KPI e.g. click-through rate). reasons for model decay: data quality, data drift, concept drift
- **Sudden**:
  - Data collection problem: bad sensor/camera or moved position, bad log data
  - System problem: bad software update, loss of network connectivity, system down, bad credentials

Shift:
- Dataset shift: $P_{train}(y,x) \ne P_{serve}(y,x)$
- Covariate shift: $P_{train}(y|x) = P_{serve}(y|x)$, $P_{train}(x) \ne P_{serve}(x)$
- Concept shift: $P_{train}(y|x) \ne P_{serve}(y|x)$, $P_{train}(x) = P_{serve}(x)$
- 

<img src="Media/skew-detection-workflow.png">

<u>Problems by level of difficulty:</u>
- **Easy problems**: 
  - E.g. classifying dogs and cats.
  - Ground truth changes slowly (months, years)
  - Model retraining driven by:
    - Model improvements, better data
    - Changes in software and / or systems
  - Labeling:
    - curated datasets
    - crowd-based
- **Harder problems**:
  - E.g. shoes
  - Ground truth changes faster (weeks)
  - Model retraining driven by:
    - *Declining model performance*
    - Model improvements, better data
    - Changes in software and/or system
  - Labeling:
    - direct feedback
    - crowd-based
- **Really hard problems**:
  - E.g. predicting financial markets
  - Ground truth changes very fast ()
  - Model retraining driven by:
    - *Declining model performance*
    - Model improvements, better data
    - Changes in software and/or system
  - Labeling:
    - direct feedback
    - weak supervision






# ML pipeline workflows

ML pipeline workflows are almost always DAGs.

TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. 

Tensorflow Extended (TFX):
- TensorFlow Data Validation (TFDV): helps to understand, validate, and monitor production machine learning data at scale.

**TFDV**

You can then validate new datasets (e.g. the serving dataset from your customers) against this schema to detect and fix anomalies. This helps prevent the different types of skew. That way, you can be confident that your model is training on or predicting data that is consistent with the expected feature types and distribution.

TFDV helps to understand, validate, and monitor production machine learning data at scale. It provides insight into some key questions in the data analysis process such as:

- What are the underlying statistics of my data?
- What does my training dataset look like?
- How does my evaluation and serving datasets compare to the training dataset?
- How can I find and fix data anomalies?

<img src="Media/tfdv-pipe.png" width=600>

Can capture schema of each feature:
- The expected type of each feature;
- The expected presence of each feature - minimum count and fraction of examples that must contain the feature
- Minimum and maximum number of values
- Possible categories for string feature, or range for an integer feature

how you would use Tensorflow Data Validation in a machine learning project.
- It allows you to scale the computation of statistics over datasets.
- You can infer the schema of a given dataset and revise it based on your domain knowledge.
- You can inspect discrepancies between the training and evaluation datasets by visualizing the statistics and detecting anomalies.
- You can analyze specific slices of your dataset.


In [10]:
import pandas as pd

df = pd.DataFrame({
    'name': ['Evgenii', 'Carolina', 'Rebeca', 'Judi'],
    'age': [26, 28, 25, 50]
})

df2 = pd.DataFrame({
    'name': ['Joel'],
    'age': [100]
})

df

import tensorflow as tf
"""
tensorflow_data_validation has lots of dependency conflicts. What worked:
python==3.9.0
tensorflow-data-validation==1.13.0
protobuf==3.20.0
"""
import tensorflow_data_validation as tfdv

from tensorflow_metadata.proto.v0 import schema_pb2


In [11]:
# Generate training dataset statistics
train_stats = tfdv.generate_statistics_from_dataframe(df)
# Visualize training dataset statistics
tfdv.visualize_statistics(train_stats)

check_stats = tfdv.generate_statistics_from_dataframe(df2)

In [8]:
# Infer schema from the computed statistics.
schema = tfdv.infer_schema(statistics=train_stats)
# Restrict the range of the `age` feature
tfdv.set_domain(schema, 'age', schema_pb2.IntDomain(name='age', min=17, max=90))
# Display the inferred schema
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'name',STRING,required,,'name'
'age',INT,required,,min: 17; max: 90


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'name',"'Carolina', 'Evgenii', 'Judi', 'Rebeca'"


In [12]:
# Check evaluation data for errors by validating the evaluation dataset statistics using the reference schema
anomalies =  tfdv.validate_statistics(statistics=check_stats, schema=schema)

# Visualize anomalies
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'name',Unexpected string values,Examples contain values missing from the schema: Joel (~100%).
'age',Out-of-range values,Unexpectedly large value: 100.
