# MLOps

Machine Learning Operations

Concept / Data drift: when data collected in production changes after the model has already been deployed. 
- Concept drift: x->y mapping changes post-deployment
- Data drift: data distribution changes post-deployment

Where does the model run: cloud vs edge/browser
- Edge deployment: can function even when network connection is down; less network bandwidth needed; lower latency; however, less computational power is available. 

Latency vs throughput (qps, queries per second; also how quickly the predictions is returned to the user)

logging: for analysing data and performance, errors

An edge device is any piece of hardware that controls data flow at the boundary between two networks. Edge devices are pieces of equipment that serve to transmit data between the local network and the cloud.

---

ML project lifecycle:
1) **Scoping**: 
   1) define project
      1) decide on key performance metrics, e.g. accuracy latency, throughput
   2) define desired input and output
   3) estimate resource needed
2) **Data**
   1) define data
      1) is the data labeled consistently?
      2) data normalization?
   2) establish baseline
   3) collect data
   4) label and organise data
3) **Modeling**
   1) select and train model
      1) usually, in research / academia we optimise code (algorithm/model) and hyperparamters
      2) in product team, may be better focus on hyperparameters and data
   2) perform error analysis
4) **Deployment**
   1) deploy in production
   2) monitor & maintain system
   3) monitor data, if it changes - maybe retrain the model
   4) deployment pattern:
      1) shadow mode deployment: model shadows the humans and runs in parallel; ML system's output is not used for any decisions during this phase. purpose - to monitor how the system is performing compared to human performance. Example: You’ve built a new system for making loan approval decisions. For now, its output is not used in any decision making process, and a human loan officer is solely responsible for deciding what loans to approve. But the system’s output is logged for analysis.
      2) Canary deployment: roll out to small fraction (5%) of traffic initially, then monitor system and gradually ramp up traffic. Allows to spot problems with your ML system early on. So you start by rolling out the new model to, let's say, 5% of the users. Then, you can gradually ramp up that number. 
      3) blue green deployment: just shift router sending data from old version of the model to the new one. Enables easier way to rollback to the older model

<img src="Media/ml_project_lifecycle.png">

<img src="Media/mlops.png">

---

<img src="Media/degree-of-automation.png">

Example of partial automation: 

You’re building a healthcare screening system, where you input a patient’s symptoms, and for the easy cases (such as an obvious case of the common cold) the system will give a recommendation directly, and for the harder cases it will pass the case on to a team of in-house doctors who will form their own diagnosis independently. What degree of automation are you implementing in this example for patient care?


# monitoring

Monitoring dashboard for monitoring:
- server load
- fraction of non-null outputs
- fraction of missing input values
- other things that could go wrong
- set thresholds for alarms
- metrics and threshold may be needed to adapted over time

Examples of metrics to track:
- Software metrics: memory, compute, latency, throughput, server load
- input metrics: average input length, average input volume, number of missing values, average image brightness
- Output metrics: number of times system returns null, number of times user redoes search, CTR

<img src="Media/mlops2.png">

Model maintenance:
- manual retraining
- automatic retraining



**pipeline monitoring**

metrics to monitor:
- software metrics:
- input metrics
- output metrics

how quickly do they change?
- user data generally has slower drift (exception - covid 19, new movie or trend)
- enterprise data (b2b applications) can shift fast (e.g. new coating for mobile phone, change the way the company operates)




model decay / drift / staleness - degradation of model performance over time. Due to some model quality metric, e.g. accuracy, mean error rate, or some downstream business KPI, e.g. click-through rate. 

reasons for model decay:
- data quality
- data drift (feature drift, population, covariate shift): the input data changed, so the trained model is not relevant for this new data. 
  - another example: instances become increasing for a class for which our model didn't perform that well
- concept drift: occurs when the patterns the model learned no longer hold. the very meaning of what we are trying to predict evolves. 