## ML System Monitoring and Deployment 
### Data Engineering

Last updated: September 29, 2022

---

### Sources

- Designing Machine Learning Systems, Chip Huyen
- Solution Architect's Handbook, 2nd Edition. Saurabh Shrivastava and Neelanjali Srivastav

---

### Concepts

- model degradation
- monitoring vs observability
- software failures vs ML failures
- data distribution shifts
- edge cases
- detecting performance issues
- performance monitoring plan
- deployment strategies including: blue-green deployment, red-black deployment, canary release

---

### 1. Model Degradation
Model performance inevitably degrades over time in production

Several reasons for this, some **software** related and some **ML** related

**Software failures** include:  
- dependency issue: the software changes, vanishes
- deployment issue: wrong version deployed, not deployed to correct machine(s)
- hardware issue

**ML failures** include:  
- training data distribution differs from production (inference) data distribution
- edge cases

Next, we dive deeper on ML failures.

#### A. Training data distribution differs from production (inference) data distribution

ML works well when patterns in data at production time match patterns in data at training time.  
This is generalization.

Several reasons why this might fail to be the case:

- **non-stationarity**: patterns change over time for various reasons:
    - major disruption like pandemic
    - seasonality
    - change in economic / market conditions
    - change in business strategy

---

**THINK ABOUT AND DISCUSS**

Can you think of other reasons why the pattern might change?

---


- **change in feature cardinality**: credit score range changes from 300-850 to 300-830

- **bad data** including:
  - incorrect inputs
  - unexpected data format
  - issue with data collection / pipeline  
 
Change is common
  
Can often be hard to detect as ML issues can faily silently.

Often need to run statistical test (e.g., two-sample test) to detect significant change.

Popular non-parameteric test: Kolmogorov-Smirov (KS) test of two distributions.

**Retraining the model**

A common practice for dealing w changes in data over time: retrain the model.

Alternative terminology: fine tune model, recalibrate model

This keeps architecture and features the same but changes the data

Some considerations when retraining:

- what time period to use? 

- expanding or sliding window?
  
- how often to retrain?

Run tests to decide

#### Edge Cases

This is situation where model performs poorly.

Example: Model trained on financial data when interest rates were always positive.  
In production, it is fed negative interest rates. This might produce poor results.

**Helps to include edge cases in training data to make more robust.**

Next, we talk about how to detect pattern changes.

---

### 2. Monitoring and Observability

*Monitoring* refers to the act of tracking, measuring and logging different metrics to help determine when something goes wrong.

*Observability* refers to setting up the system so that users have **visibility into the system** to determine when something goes wrong and where it happened. An example would be logging all events in the system as it runs. Sometimes called *instrumentation*.

Observability should allow drill down. For example: user wants to see all incorrect predictions for certain subset of customers over certain period of time. 

#### What Should be Monitored?

**Operational Metrics**  
Examples may include:
- latency: time elapsed between request and returned answer
- throughput: amount of data processed over given period
- CPU/GPU utilization
- memory utilization
- number of prediction requests received over given period
- uptime: percent of time that system is available to offer reasonable performance

Uptime example: At one time AWS EC2 offered monthly uptime percentage of at least 99.99% (four nines).
If the monthly uptime percentage fell below this level, they would give a credit.

**ML-Specific Metrics**  
Broad categories to monitor:
- model accuracy (very important)
- predictions (very important)
- features
- raw inputs

Since model degradation is the focus, it's most important to monitor model accuracy and predictions.  
Features may change in distribution, but if the model continues to perform well, this is not concerning.

For monitoring **model accuracy**, examples include:
- accuracy
- F1 score
- area under the ROC curve

For **monitoring predictions**, examples include:
- any invalid predicted probabilities: less than 0 or greater than 1?
- have all predictions over some period of time been identical? this would be worrisome.
- Run test cases with known answer. Does the prediction vary over time?

For **monitoring features**, examples include:
- statistics of each feature (quantiles, median, ...)
- do values fall within expected range (for continuous values)
- do values fall within predefined set (for discrete values)

For monitoring **raw inputs**, examples include:
- checking for missing data (example: your system scrapes web pages and the format has changed, returning no data)
- checking in invalid data formats (example: your system expects numeric but is capturing text)
- data falls outside expected ranges

**Visualizations** can be produced over time. These can be helpful for human review and further exploration, but aren't as useful in automated alerting.

#### Performance Monitoring Plan

A **performance monitoring plan** is recommended for each model.  
This should be crafted by stakeholders to include:
- metrics to monitor
- triggers for each metric (e.g., if AUC falls by 10% between review periods, then ALERT)
- monitoring frequency
- actions to take if ALERT (who does what by when)

Oftentimes there are three status levels such as RED/AMBER/GREEN (RAG Status).  
Stakeholders would define what each level means and what should be done.

Simple Example:

| LEVEL      | METRIC | ACTION |
| ----------- | ----------- | ----------- |
| GREEN      | AUROC>=0.8 | system functioning as expected       |
| AMBER   | 0.7<AUROC<0.8 | system not functioning as expected but not critical; retrain the model on new data and monitor closely        |
| RED   |  AUROC<=0.7| system not functioning as expected and critical; cease use of model and redevelop. |

---

### 3. Monitoring Tools

We summarize useful monitoring tools:

- logs
- dashboards
- alerts

**Logs** capture anything of interest. Ideally they include the process ID and metadata to easily track down issues. This can be hard as systems grow complex with multiple microservices running.

Can process logs in batch processes to find issues. Spark can be a useful tool.

Can look for anomalies in real time using a tool like Kafka or Amazon Kinesis.

---

**Dashboards** can be helpful for visualizing all important metrics in one place. Useful for non-technical audience as well (e.g., executives.

Include only important metrics and graphs.

Powerful tools include Tableau and Power BI.

---

**Alerts** are useful for engaging the right people when the system malfunctions.  
This was discussed above in the performance monitoring plan.  

Alerts consist of:
- conditions when to alert (AUC falls by 10%, AUC < 0.8, ...)
- who to alert and how (notify MLOps Team by email, Slack, ...)
- description of the alert

It's important that alerts are accurate.

Tools for setting and monitoring alerts include:
- Amazon CloudWatch
- GCP Cloud Monitoring
- [Datadog](https://www.datadoghq.com/)

Some firms create customized jobs that run monitoring

---

### 4. Deployment Strategies

As part of continuous deployment (CD), we'll want to carefully release updates to the system.  
Doing a full release to all users, while faster, carries greater risk.

In practice, different rollout patterns are used, including:

- **Red-black deployment**  
  This is the full release: instant cutover from the existing version to the new version.  
  While the new software gets out faster, if there are any issues, **all active users** might get exposed to them.  
  
    In practice, red-black deployment gets combined with **canary testing**.  
    It works like this:  

  1. Stand up a copy of the environment with the new code (the canary)
  2. Route 1% of production traffic to the canary
  3. If the canary clears the test, do the cutover
  4. If the canary fails, fix the error(s) and rety  


- **Blue-green deployment**:  
  Gradually replace the existing version with the new version.  
  
  It works like this:  
  1. The BLUE environment carries live traffic on the existing version  
  2. You provision a GREEN environment with an identical environment, but with the new code  
  3. Route some production traffic from BLUE to GREEN  
  4. If there are issue with GREEN, fall back to BLUE. Otherwise route more traffic to GREEN until 100% runs there.

