# Notebook overview

The goal of this notebook is to analyze the data drift in the provided datasets with the help of Evidently library (open-source Python library to evaluate data stability and data drift). 

## What is Data Drift?
Data Drift refers to the change over time in the statistical properties of the historical data that was used to train a machine learning model. In the real world, data might become outdated causing a different behaviour and an accuracy loss of the trained model. That is why it is important to monitor the performance of the model by using a drift detection system and retrain it regularly on updated data to ensure  consistent outputs.

# Imports
## Libraries

In [2]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing

from evidently import ColumnMapping

from evidently.report import Report
from evidently.metrics.base_metric import generate_column_metrics
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import *

from evidently.test_suite import TestSuite
from evidently.tests.base_test import generate_column_tests
from evidently.test_preset import DataStabilityTestPreset, NoTargetPerformanceTestPreset
from evidently.tests import *

In [5]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Data

The application train dataset will be our reference baseline as it was used for the model training. The second dataset will be the applicaton test dataset, which is our current production data. Evidently will compare the current data to the reference.

In [23]:
df_reference = pd.read_csv('./data/cleaned/train_processed.csv')
df_reference.head()

Unnamed: 0,CODE_GENDER_F,NAME_EDUCATION_TYPE_Higher_education,NAME_FAMILY_STATUS_Married,ORGANIZATION_TYPE_Self_employed,FLAG_OWN_CAR,AMT_CREDIT,DAYS_BIRTH,DAYS_EMPLOYED,OWN_CAR_AGE,REGION_RATING_CLIENT,...,PREV_AMT_ANNUITY_MIN,PREV_DAYS_DECISION_MIN,PREV_CNT_PAYMENT_MEAN,REFUSED_AMT_APPLICATION_MIN,POS_MONTHS_BALANCE_MAX,INSTAL_DPD_MAX,INSTAL_PAYMENT_DIFF_MEAN,CC_AMT_BALANCE_MIN,SK_ID_CURR,TARGET
0,0,0,0,0,0,406597.5,-9461,-637.0,,2,...,9251.775,-606.0,24.0,,-1.0,0.0,0.0,,100002,1
1,1,1,1,0,0,1293502.5,-16765,-1188.0,,1,...,6737.31,-2341.0,10.0,,-18.0,0.0,0.0,,100003,0
2,0,0,0,0,1,135000.0,-19046,-225.0,26.0,2,...,5357.25,-815.0,4.0,,-24.0,0.0,0.0,,100004,0
3,1,0,0,0,0,312682.5,-19005,-3039.0,,2,...,2482.92,-617.0,23.0,688500.0,-1.0,0.0,0.0,0.0,100006,0
4,0,0,0,0,0,513000.0,-19932,-3038.0,,2,...,1834.29,-2357.0,20.666666,,-1.0,12.0,452.384318,,100007,0


In [24]:
df_reference = df_reference.drop(columns=['TARGET', 'SK_ID_CURR'], axis=1)

In [19]:
df_current = pd.read_csv('./data/cleaned/test_processed.csv')
df_current.head()

Unnamed: 0,CODE_GENDER_F,NAME_EDUCATION_TYPE_Higher_education,NAME_FAMILY_STATUS_Married,ORGANIZATION_TYPE_Self_employed,FLAG_OWN_CAR,AMT_CREDIT,DAYS_BIRTH,DAYS_EMPLOYED,OWN_CAR_AGE,REGION_RATING_CLIENT,...,ACTIVE_AMT_CREDIT_MAX_OVERDUE_MEAN,PREV_AMT_ANNUITY_MIN,PREV_DAYS_DECISION_MIN,PREV_CNT_PAYMENT_MEAN,REFUSED_AMT_APPLICATION_MIN,POS_MONTHS_BALANCE_MAX,INSTAL_DPD_MAX,INSTAL_PAYMENT_DIFF_MEAN,CC_AMT_BALANCE_MIN,SK_ID_CURR
0,1,1,1,0,0,568800.0,-19241,-2329.0,,2,...,,3951.0,-1740.0,8.0,,-53.0,11.0,0.0,,100001
1,0,0,1,1,0,222768.0,-18064,-4469.0,,2,...,0.0,4813.2,-757.0,12.0,,-15.0,1.0,0.0,,100005
2,0,1,1,0,1,663264.0,-20038,-4458.0,5.0,2,...,,4742.415,-1999.0,17.333334,,-3.0,21.0,1157.662742,0.0,100013
3,1,0,1,0,0,1575000.0,-13976,-1866.0,,2,...,0.0,6028.02,-1805.0,11.333333,,-20.0,7.0,622.550708,0.0,100028
4,0,0,1,0,1,625500.0,-13040,-2191.0,16.0,2,...,,11100.6,-821.0,24.0,,-15.0,0.0,0.0,,100038


In [25]:
df_current = df_current.drop(columns=['SK_ID_CURR'], axis=1)

# Data Drift

Evidently Reports help explore and debug data and model quality. They calculate various metrics and generate a dashboard with rich visuals. We will store the report in an html file.

This Data Drift Preset compares the distributions of the model features and show which have drifted. 

In [26]:
report = Report(metrics=[
    DataDriftPreset(), 
])

report.run(reference_data=df_reference, current_data=df_current)
report

In [27]:
report.save_html("data_drift_report.html")

The drift was detected for 8% of columns (2 out of 25). The drift detection threshold being set to 0.5, the detected drift should not impact the model performance.

The list of drifted columns:
- AMT_CREDIT
- DAYS_LAST_PHONE_CHANGE
