Skip to content

Latest commit

 

History

History
132 lines (90 loc) · 5.73 KB

drift_detection_for_regression_model_targets.rst

File metadata and controls

132 lines (90 loc) · 5.73 KB

Drift Detection for Regression Model Targets

Why Perform Drift Detection for Model Targets

The performance of a machine learning model can be affected if the distribution of targets changes. The target distribution can change both because of data drift but also because of label shift.

A change in the target distribution may mean that business assumptions on which the model is used may need to be revisited.

NannyML uses ~nannyml.drift.target.target_distribution.calculator.TargetDistributionCalculator in order to monitor drift in the Target distribution. It can calculate the KS statistic (from the Kolmogorov-Smirnov test) for aggregated drift results but also show the target distribution results per chunk with joyploys.

Note

The Target Drift detection process can handle missing target values across all data periods<Data Period>.

Note

The following example uses timestamps<Timestamp>. These are optional but have an impact on the way data is chunked and results are plotted. You can read more about them in the data requirements<data_requirements_columns_timestamp>.

Just The Code

Walkthrough

In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods<data-drift-periods>.

Let's start by loading some synthetic car pricing data<dataset-synthetic-regression> provided by the NannyML package, and setting it up as our reference and analysis dataframes.

The analysis_targets dataframe contains the target results of the analysis period. This is kept separate in the synthetic data because it is not used during performance estimation<performance-estimation>. But it is required to detect drift for the targets, so the first thing we need to in this case is set up the right data in the right dataframes. The analysis target values are expected to be ordered correctly, just like in sklearn.

Now that the data is in place we'll create a new ~nannyml.drift.target.target_distribution.calculator.TargetDistributionCalculator instantiating it with the appropriate parameters. We need the name for the target, y_true, and the timestamp columns. We also need to specify the machine learning problem we are working on.

Afterwards, the ~nannyml.drift.target.target_distribution.calculator.TargetDistributionCalculator.fit method gets called on the reference period<Data Period>, which represent an accepted target distribution which we will compare against the analysis period<Data Period>.

Then the ~nannyml.drift.target.target_distribution.calculator.TargetDistributionCalculator.calculate method is called to calculate the target drift results on the data provided. We use the previously assembled data as an argument.

We can display the results of this calculation in a dataframe.

We can also display the results from the reference dataframe.

The results can be also easily plotted by using the ~nannyml.drift.target.target_distribution.result.TargetDistributionResult.plot method. We first plot the KS Statistic drift results for each chunk.

Note that a dashed line, instead of a solid line, will be used for chunks that have missing target values.

image

And then we create the joyplot to visualize the target distribution values for each chunk.

image

Insights

Looking at the results we can see that there has been some target drift towards lower car prices. We should also check to see if the performance of our model has been affected through realized performance monitoring<regression-performance-calculation>. Lastly we would need to check with the business stakeholders to see if the changes observed can affect the company's sales and marketing policies.

What Next

The performance-calculation functionality of NannyML can can add context to the target drift results showing whether there are associated performance changes. Moreover the Univariate Drift Detection<univariate_drift_detection> as well as the Multivariate Drift Detection<multivariate_drift_detection> can add further context if needed.