# An end-to-end ML Model Monitoring Workflow with NannyML in Python

## Table of contents

1. Why monitor models?
2. What will you learn in this tutorial?
3. Prerequisites concepts covered
4. What does an end-to-end ML model monitoring workflow look like?
5. Step 1: Prepare the data
6. Step 2: Estimating performance
7. Step 3: Estimated vs. realized performance
8. Step 4: Calculating and estimating business value
9. Step 5: Multivariate drift detection
10. Step 6: Univariate drift detection
11. Step 7: Issue resolution
12. Conclusion

## Why monitor models?

Machine learning projects are iterative processes. You don't just stop at a successful model inside a Jupyter notebook. You don't even stop after the model is online and people can access it. Even after deployment, you have to constantly babysit it so that it works just as well as it did during the development phase.

Zillow's scandal is a perfect example of what happens if you don't. In 2021, Zillow lost a stunning 304 million dollars because of their machine learning model that estimates houses for purchase. Zillow overpaid for more than 7000 homes and had to offload them at a much lower price. The company was "ripped off" by its own model and had to reduce their workforce by 25%. 

These types of silent model failures are common with real-world models, so they need to be constantly updated before their production performance drops. Failing to do so damages companies' reputation, trust with stakeholders and ultimately, pockets. 

This article will teach you how to implement an end-to-end workflow to monitor machine learning models after deployment with NannyML.

## What is NannyML?

NannyML is a growing open-source library focused on post-deployment machine learning. It offers a wide range of features to solve all types of problems that arise in production ML environments. To name a few:n-technical users.

- **Drift Detection**: Detects data distribution changes between training and production data.
- **Performance Estimation**: Estimates model performance in production without immediate ground truth.
- **Automated Reporting**: Generates reports on deployed model health and performance.
- **Alerting System**: Provides alerts for data drift and performance issues.
- **Model Fairness Assessment**: Monitors model fairness to prevent biases.
- **Compatibility with ML Frameworks**: Integrates with popular machine learning frameworks.
- **User-Friendly Interface**: Offers a familiar scikit-learn like interface.

We will learn the technical bits of these features one-by-one.

## What will you learn in this tutorial?

Apart from NannyML's API, you will learn several key concepts regarding post-production machine learning. The article will provide you a framework you can follow in any ML project for monitoring. Here is the table of contents for full details:

PASTE the table of contents here.

## Prerequisite concepts covered

We will learn the fundamental concepts of model monitoring through the analogy of a robot mastering archery.

![DALL·E 2024-01-31 14.54.21 - A high-tech robot learning to shoot a single arrow at a target in an outdoor setting. The robot, equipped with a sophisticated mechanical arm, is prec.png](attachment:8d2a9cdd-9678-4172-a9b7-4174c0f7205e.png)

In our analogy:
- Robot represents our machine learning model.
- The target represents the goal or objective of our model.
- We can say this is a regression problem since scores are calculated based on how close the arrows are shot to the bull's eye - the red dot in the center.
- The characteristics of the arrows and bow, alongside the robot’s physical attributes and environmental conditions (like wind and weather), are the features or input variables of our model.

So, let's start:

### Data Drift

Imagine we've carefully prepared the bow, arrows and target (like data preparation). Our robot, equipped with many sensors and cameras, shoots 10000 times during training. Over time, it starts to hit the bull's eye with impressive frequency. We are thrilled with the performance and begin selling our robot and its copies to archery lovers (deploying the model).

But soon, we get a stream of complaints. Some users report that the robot is totally missing the target. Surprised, we gather a team to discuss what went wrong. 

What we find is a classic case of data drift. The environment in which the robots are operating has changed - different wind patterns, varying humidity levels, and even changes in the physical characteristics of the arrows (weight, balance) and bow. This real-world shift in input features has thrown off our robot's accuracy, similar to how a machine learning model might underperform when input data evolves over time. 

First, we prepare the target, arrows and the bow (data preparation). Then, we will assume that the robot knows how to hold the bow and shoot arrows at a target by using its sensors and cameras. 

The robot shoots 10000 arrows, its accuracy improving incrementally. Eventually, it can hit the bull's eye fairly often. Pretty happy with our results, we multiply and sell our robot to archery enthusiasts (deploying the model). 

However, we start getting various complaints from some users that the robot can't hit the target at all, let alone the bull's eye. So, we gather the ML team again and sit down to discuss what went wrong. 

### Data drift

The first problem we detect is that our robots weren't shipped with a standard bow and arrows. Different users have different arrows and thus, different shooting dynamics. Our robots, which learned the correct positioning, grip strength, draw strength and shooting angles for a certain type of bow/arrows, couldn't perform well with new types of equipment. 

Besides, we didn't consider environmental factors such as time of day and weather conditions. If users tested our robots in rainy, cloudy or windy weathers, they wouldn't shoot well because we trained them only in sunny and mild conditions. These factors had surely interfered with our robots' sensors. 

One of the engineers in our team called this phenomenon as a drift (data drift), because the features our robot learned changed in the real-world (production).

### Concept drift

After fixing these issues, we sell another batch of robots. In a few weeks, we start getting the same complaints yet again. Wondering what could possibly be the problem, we investigate. 

This time, we learn that users frequently replaced their shooting targets. They were newer and differed in size. Besides, they were positioned at different distances from our robots, all of which requires changes in the robot's shooting dynamics. 

We find out that this phenomenon is called a concept drift because our robots were using the same patterns they used in development to shoot at completely new types of targets. 

## What does an end-to-end ML model monitoring workflow look lie?

## Step 1: Preparing the data for NannyML

### Defining features and target

### Splitting the data into four sets

### Training a model

### Creating a reference set

### Creating an analysis set

## Step 2: Estimating performance in NannyML

### When to estimate performance?

### Estimating performance using DLE in NannyML

### Plotting estimation results in NannyML

## Step 3: Estimated vs. realized performance in monitoring

### When to calculate realized performance?

### Calculating realized performance in NannyML

### Comparing estimation vs. realized performance visually

## Step 4: Drift detection methods

### Multivariate drift detection

### Univariate drift detection

## Step 5: Solutions to monitoring problems

## Conclusion