# Using  Evidently to Evaluate Data Drift for Dataset

This notebook shows how you can use Evidently to check the data for data drift.

Acknowledgments:

The dataset used in the example is from: https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv
Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
More information about the dataset can be found in UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

## Getting Started¶
To run this tutorial:

1. Install MLflow
You can install MLflow with the following command `pip install mlflow` or install MLflow with scikit-learn via `pip install mlflow[extras]`
More details:https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#id5

2. Install Evidently
You can install Evidently with the following command `pip install evidently`
In case you are also interested in Evidently Dashboard visualization in Jupyter install jupyter nbextention:
`jupyter nbextension install --sys-prefix --symlink --overwrite --py evidently`
And activate it:
`jupyter nbextension enable evidently --py --sys-prefix`
More details: https://docs.evidentlyai.com/install-evidently 

3. Optionally, you can load data from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset and save in locally or skip this step and download data with  ```requests```  using instructions below

In [1]:
import json
import pandas as pd
import numpy as np
import requests
import zipfile
import io
import os

import plotly.offline as py #working offline
import plotly.graph_objs as go

from evidently.test_suite import TestSuite
from evidently.test_preset import DataStabilityTestPreset

from evidently.pipeline.column_mapping import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

In [2]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [3]:
py.init_notebook_mode()

In [4]:
#evaluate data drift with Evidently Profile
def eval_drift(reference, production, column_mapping):
    """
    Returns a list with pairs (feature_name, drift_score)
    Drift Score depends on the selected statistical test or distance and the threshold
    """    
    data_drift_report = Report(metrics=[DataDriftPreset()])
    data_drift_report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)
    report = data_drift_report.as_dict()

    drifts = []

    for feature in column_mapping.numerical_features + column_mapping.categorical_features:
        drifts.append((feature, report["metrics"][1]["result"]["drift_by_columns"][feature]["drift_score"]))

    return drifts

## Load Data

In [5]:
#load data
application_train = pd.read_csv(os.path.join('..','..', 'data', "application_train.csv")).drop(['SK_ID_CURR',
                                                            'TARGET'],
                                                           axis=1)
application_test = pd.read_csv(os.path.join('..','..', 'data', "application_test.csv")).drop(['SK_ID_CURR'],
                                                          axis=1)

In [6]:
print('application_train : {} clients.'.format(application_train.shape[0]))
print('application_test : {} clients.'.format(application_test.shape[0]))

application_train : 307511 clients.
application_test : 48744 clients.


In [7]:
#observe data structure
application_train.head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,...,0,0,0,0,,,,,,
4,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
#observe data structure
application_test.head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,Unaccompanied,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,Unaccompanied,...,0,0,0,0,,,,,,


In [9]:
categorical_list = []
numerical_list = []
for i in application_train.columns.tolist():
    if application_train[i].dtype=='object':
        categorical_list.append(i)
    else:
        numerical_list.append(i)

In [10]:
#set column mapping for Evidently Profile
data_columns = ColumnMapping()
data_columns.numerical_features = numerical_list
data_columns.categorical_features = categorical_list

In [11]:
# Sampling si taille du dataset trop grand
#application_train = application_train.sample(n=5000,random_state=42,replace=False)
#application_test = application_test.sample(n=5000,random_state=42,replace=False)

# Affichage des résultats

In [12]:
data_stability= TestSuite(tests=[
    DataStabilityTestPreset(),
])
data_stability.run(current_data=application_test, reference_data=application_train, column_mapping=None)
data_stability.show(mode='inline')

In [13]:
data_drift_report = Report(metrics=[
    DataDriftPreset(),
])

data_drift_report.run(current_data=application_test, reference_data=application_train, column_mapping=None)
data_drift_report.show(mode='inline')

# MLFlow log

In [None]:
#log into MLflow
client = MlflowClient()

# set the path where it is stored
mlflow.set_tracking_uri("http://127.0.0.1:5000")

#set experiment
mlflow.set_experiment('Data Drift Evaluation with Evidently')

#start new run
with mlflow.start_run() as run: #inside brackets run_name='test'

    # Log metrics
    metrics = eval_drift(application_train, 
                         application_test, 
                         column_mapping=data_columns)
    for feature in metrics:
        mlflow.log_metric(feature[0], round(feature[1], 3))

    print(run.info)

2023/02/25 13:20:35 INFO mlflow.tracking.fluent: Experiment with name 'Data Drift Evaluation with Evidently' does not exist. Creating a new experiment.


<RunInfo: artifact_uri='mlflow-artifacts:/647971788708510716/44bfca540b9d4b4eb2d7b26fea98e866/artifacts', end_time=None, experiment_id='647971788708510716', lifecycle_stage='active', run_id='44bfca540b9d4b4eb2d7b26fea98e866', run_name='mysterious-panda-89', run_uuid='44bfca540b9d4b4eb2d7b26fea98e866', start_time=1677327636401, status='RUNNING', user_id='alexandredelaguillaumie'>


In [None]:
data_drift_report.save_html("data_drift.html")