# BLU15 - Part 1 of 2 - When to retrain your model
In this BLU, we'll talk about updating models over time. We have briefly touched this topic in the time series specialization. Some models are stable over time because the population about which we're making predictions doesn't change. Many real world populations are changing though, so we also have to change our models by retraining them on newer data.

We'll consider several retraining strategies in this notebook and in the next one, we'll explain how to diagnose a model before retraining and retrain our baseline model from BLU14.

## 1. The need for retraining

*Train, test and deploy* – that’s it, right? Is your work done? Not quite!

So far the process was this:

1. Model building starts by learning the dependencies between a set of independent features and the target variable on a set of historical data.
2. The best model is found by minimizing the predictions error on the validation dataset which is measured by the selected metric.
3. The best model is then deployed to production with the expectation of making accurate predictions on incoming unseen data for as long as possible.

One of the biggest mistake a data scientist can make is assume that their models will keep working properly forever after deployment. *But what about the data, which will inevitably keep changing?* 

A model deployed to production and left to itself won’t be able to adapt to changes in the data by itself.

Let's look at the following example. In a [UK bank survey from August 2020](https://www.bankofengland.co.uk/bank-overground/2021/how-has-covid-affected-the-performance-of-machine-learning-models-used-by-uk-banks), 35% of asked bankers reported a negative impact on ML model performance because of the pandemic:

<img src="media/model_impact_of_change.png" alt="drift" width="800"/>

Unpredictable events like this are a great example of why continuous monitoring and retraining of ML models in production is important compared to static validation and testing techniques. 

In the simplest case, retraining involves running the entire existing pipeline with new data, without changing the code or re-building the pipeline. However, if you end up exploring a new algorithm or a feature which was not available at the time of previous model training, then incorporating these changes into the retrained model may further improve the model performance.

But what exactly can cause the decrease in model performance?

### 1.1 Data drift

To understand this, let us recall one of the most critical assumptions in ML modeling:

> The train and test data set should be drawn from the same distribution. 

The model will perform well if the new data is similar to the data observed in the past on which the model was trained. Therefore, it's understandable that if the test data distribution deviates from that of the train data, the model will not hold well. 

There are many factors that can cause such deviation. Depending on the business case, it can be a change in consumer preferences, fast moving competitive space, geographic shift, economic conditions, a pandemic, etc. Hence, a drifting data distribution calls for periodically checking the validity of the model. In short, it is critical to keep your machine learning model updated; but the key is when? We will discuss this in a bit.

### 1.2 Robustness

As you remember from [SLU17 - Ethics and Fairness](https://github.com/LDSSA/batch8-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU17%20-%20Ethics%20and%20Fairness), a model has an impact in the world that it learned from. And that impact can change the *a priori* assumptions that once were true. 

People/entities that get affected by the outcome of the ML models may deliberately alter their response in order to send spurious input to the model, thereby escaping the impact of the model predictions. For example, models dealing with fraud detection and cyber-security receive manipulated and distorted inputs which cause the model to output misclassified predictions. Such type of adversaries also drives down the model performance.

### 1.3 When the ground truth is not available at the time of model training

In some ML models, the ground truth labels are not available to train the model. For example, target variable which captures the response of the end user is not known. In that case, your best bet could be to mock the user action based on certain set of rules coming from business understanding or leverage the open source datasets to initiate model training. But, this model might not necessarily represent the actual data and hence will not perform well until after a burn-in period where it starts picking up (aka learning) the true actions of the end user.

### 1.4 Concept drift

Concept drift is a phenomenon where the meaning the labels of the target variable you’re trying to predict changes over time. This means that the concept has changed but the model doesn’t know about the change. 

Concept drift happens when the original idea your model had about the target class changes. For example, you build a model to classify positive and negative sentiment of tweets around certain topics, and over time people’s sentiment about these topics changes. Tweets belonging to positive sentiment may evolve over time to be negative.

## 2. How to measure the decline in model performance?

If the ground truth values are stored alongside the predictions, such as with the success of a search, the model accuracy is calculated on a continuous basis to assess the drift.

<img src="media/model_decay_retraining.png" alt="retraining" width="550"/>

But what if the prediction horizon is farther into the future and we can’t wait till the ground truth label is observed to assess the model goodness? In that case, we can roughly estimate the retraining window from back-testing. This involves using the ground truth labels and predictions from the historical data to estimate the time frame around which the accuracy begins to taper off.

Effectively, the whole exercise of finding the model drift boils down to inferring whether the two data sets (training and test) are coming from the same distribution, or if the performance has fallen below acceptable range.

Lets look at some of the ways to assess the distribution drift:

### 2.1 Histogram
A quick way to visualize the populations is to draw the histogram — the degree of overlap between the two histograms gives a measure of similarity.

<img src="media/histogram_data_drift.png" alt="histogram" width="600"/>

### 2.2 K-S statistic

The [Kolmogorov–Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) is a useful tool to check if the upcoming new data belongs to the same distribution as that of the training data. In short, this test quantifies the distance between two distributions. Python has an implementation of this test provided by SciPy ([scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html)).

See an illustration of the Kolmogorov–Smirnov statistic below. The red line is the reference cumulative distribution function and the blue line is the sample empirical cumulative distribution function. The difference between the two curves illustrated by the black arrow is the K–S statistic.

<img src="https://upload.wikimedia.org/wikipedia/commons/c/cf/KS_Example.png" alt="ks" width="350"/>

### 2.3 Target distribution
Another quick way to check the consistency of the predictive power of the ML model is to examine the distribution of the target variable. For example, if your training data set is imbalanced with 99% data belonging to class 1 and the remaining 1% to class 0 and the predictions reflect this distribution to be around 90%-10%, then it should be treated as an alert for further investigation.

### 2.4 Correlation

Additionally, monitoring pairwise correlations between variables will help bring out an underlying drift. Variables which were previously correlated may start to diverge and vice versa.

## 3. Retraining strategy

There are three approaches to handling model retraining.

### 3.1 Retraining at a fixed periodic intervals
If the incoming data is changing frequently, the model retraining can happen even daily!

### 3.2 Retrained based on monitoring results
#### 3.2.1 Trigger based on performance metrics
The model is retrained when the monitored metric crosses a threshold. This approach is more effective than the one above but the threshold specifying the acceptable level of performance divergence needs to be decided beforehand. The following factors need to be considered while choosing the threshold:
- Too low a threshold will lead to frequent retraining which will lead to increased overhead in terms of computing cost.
- Too high a threshold may lead to “strayed predictions”.

<img src="media/retraining_model_graph.png" alt="histogram" width="400">

#### 3.2.2 Trigger based on data changes
By monitoring the incoming data during production, you can identify changes in the distribution of your data. This can indicate that your model is outdated or that you’re in a dynamic environment. This is a good approach in situations where don't get immediate feedback on your predictions, so you can't compare them to the ground truth.
<img src="media/trigger_data_changes.png" alt="histogram" width="500"/>

### 3.3 Retraining on demand
Of all the options, this is the least efficient as it does not rely on automation, but it's the most simple to implement and therefore sometimes favoured.

## 4. How much data is needed for retraining?

In addition to knowing why and when you need to retrain your models, it’s also important to know how to select the right data for retraining and whether or not to drop the old data. 

Three things to consider when choosing the data for retraining:
- What is the size of your data?
- Is your data drifting?
- How often do you get new data?

### 4.1 Fixed window size

This is a straightforward approach to selecting the training data. Selecting the right window size is a major drawback to using this approach.

- If the window size is too large, we may introduce noise into the data. 
- If it’s too narrow, it might lead to underfitting.

Overall, this approach is a simple heuristic approach that will work well in some cases, but will fail in a dynamic environment where data is constantly changing.

### 4.2 Dynamic window size

This is an alternative to the fixed window size approach. This approach helps to determine how much historical data should be used to retrain your model by measuring a metric on a changing window size. It’s an approach to consider if your data is large and you also get new data frequently. 

<img src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Training-data-vs-test-data.png?resize=900%2C420&ssl=1" alt="histogram" width="700"/>

### 4.3 Combining all of the data

The simplest way to handle this problem, resources permitting, is simply to combine all of the data and retrain your model. This approach is suitable if the data has not drifted too much. It may not be viable option in production due to increase in computational load as the data continues to grow.

## 5. Final considerations

Before we move on to a more practical demonstration, I hope you're now aware that retraining and redeployment is a constant need for any ML model. The *when* and the *how* are key questions that rely on the sensitivity of not only the methods, but of the data scientists. For the *data scientists*, a critical evaluation of results is a fundamental skill. Now let's get our hands dirty in Part 2! 