# Detect High Model Drift 
<b>With this tutorial you:</b><br />
Understand how to use Eurybia to detect datadrift

Contents:
- Detect data drift  
- Compile Drift over years

This public dataset comes from :

https://www.kaggle.com/sobhanmoosavi/us-accidents/version/10

---
Acknowledgements
- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
---

In this tutorial, the data are not loaded raw, a data preparation to facilitate the use of the tutorial has been done. You can find it here : 
https://github.com/MAIF/eurybia/blob/master/eurybia/data/dataprep_US_car_accidents.ipynb

**Requirements notice** : the following tutorial may use third party modules not included in Eurybia.  
You can find them all in one file [on our Github repository](https://github.com/MAIF/eurybia/blob/master/requirements.dev.txt) or you can manually install those you are missing, if any.

In [2]:
import pandas as pd
from category_encoders import OrdinalEncoder
import catboost
from eurybia import SmartDrift
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


## Import Dataset and split in training and production dataset

In [3]:
from eurybia.data.data_loader import data_loading

In [4]:
#df_car_accident = data_loading("us_car_accident")
df_car_accident = pd.read_csv("../../eurybia/data/US_Accidents_extract.csv", engine="python")

In [5]:
df_car_accident.head()

Unnamed: 0,Start_Lat,Start_Lng,Distance(mi),Temperature(F),Humidity(%),Visibility(mi),day_of_week_acc,Nautical_Twilight,season_acc,target,target_multi,year_acc,Description
0,26.5,-81.8,0.0,77.0,50.0,10.0,1,Day,spring,0,2,2017,Accident on Winged Foot Dr at Lee Rd.
1,29.5,-98.5,0.0,50.0,83.0,10.0,4,Day,winter,0,2,2017,Ramp to I-410 - Accident.
2,33.9,-118.3,0.0,49.0,36.0,10.0,2,Day,winter,1,3,2018,Right hand shoulder blocked due to accident on...
3,42.5,-83.2,1.0,28.0,93.0,7.0,3,Day,winter,0,2,2023,Slow traffic on John C Lodge Fwy S - Morris Ad...
4,42.4,-83.2,0.0,40.0,82.0,10.0,3,Day,autumn,0,2,2016,Accident on 7 Mile Rd at Lindsay St.


In [7]:
df_car_accident.shape

(50000, 13)

In [8]:
# Let us consider that the column "year_acc" corresponds to the reference date. 
#In 2016, a model was trained using data. And in next years, we want to detect data drift on new data in production to predict
df_accident_baseline = df_car_accident.loc[df_car_accident['year_acc'] == 2016]
df_accident_2017 = df_car_accident.loc[df_car_accident['year_acc'] == 2017]
df_accident_2018 = df_car_accident.loc[df_car_accident['year_acc'] == 2018]
df_accident_2019 = df_car_accident.loc[df_car_accident['year_acc'] == 2019]
df_accident_2020 = df_car_accident.loc[df_car_accident['year_acc'] == 2020]
df_accident_2021 = df_car_accident.loc[df_car_accident['year_acc'] == 2021]
df_accident_2022 = df_car_accident.loc[df_car_accident["year_acc"] == 2022]

In [9]:
#We will train a classification model to predict the severity of an accident. 0 for a less severe accident and 1 for a severe accident.
#Let's check percentage in class 0 and 1
pd.crosstab(df_car_accident.year_acc, df_car_accident.target, normalize = 'index')*100

target,0,1
year_acc,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,66.512,33.488
2017,64.549672,35.450328
2018,63.914226,36.085774
2019,71.787486,28.212514
2020,81.856,18.144
2021,88.8,11.2
2022,92.657175,7.342825
2023,97.136,2.864


In [10]:
y_df_learning=df_accident_baseline['target'].to_frame()
X_df_learning=df_accident_baseline[df_accident_baseline.columns.difference(["target", "target_multi", "year_acc", "Description"])]

y_df_2017=df_accident_2017['target'].to_frame()
X_df_2017=df_accident_2017[df_accident_2017.columns.difference(["target", "target_multi", "year_acc", "Description"])]

y_df_2018=df_accident_2018['target'].to_frame()
X_df_2018=df_accident_2018[df_accident_2018.columns.difference(["target", "target_multi", "year_acc", "Description"])]

y_df_2019=df_accident_2019['target'].to_frame()
X_df_2019=df_accident_2019[df_accident_2019.columns.difference(["target", "target_multi", "year_acc", "Description"])]

y_df_2020=df_accident_2020['target'].to_frame()
X_df_2020=df_accident_2020[df_accident_2020.columns.difference(["target", "target_multi", "year_acc", "Description"])]

y_df_2021=df_accident_2021['target'].to_frame()
X_df_2021=df_accident_2021[df_accident_2021.columns.difference(["target", "target_multi", "year_acc", "Description"])]

y_df_2022=df_accident_2022["target"].to_frame()
X_df_2022=df_accident_2022[df_accident_2022.columns.difference(["target", "target_multi", "year_acc", "Description"])]

## Building Supervized Model

In [11]:
features = ['Start_Lat', 'Start_Lng', 'Distance(mi)', 'Temperature(F)',
       'Humidity(%)', 'Visibility(mi)', 'day_of_week_acc', 'Nautical_Twilight',
       'season_acc']

In [12]:
features_to_encode = [col for col in X_df_learning[features].columns if X_df_learning[col].dtype not in ('float64','int64')]

encoder = OrdinalEncoder(cols=features_to_encode)
encoder = encoder.fit(X_df_learning[features])

X_df_learning_encoded=encoder.transform(X_df_learning)

In [13]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)

In [14]:
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)

In [15]:
model = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model = model.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [16]:
proba = model.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

0.7039381961780464


In [17]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_2017, y_df_2017, train_size=0.75, random_state=1)
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)
model_2017 = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model_2017 = model_2017.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)
proba = model_2017.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0.7272848536244942


In [18]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_2018, y_df_2018, train_size=0.75, random_state=1)
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)
model_2018 = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model_2018 = model_2018.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)
proba = model_2018.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0.7039528919958548


In [19]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_2019, y_df_2019, train_size=0.75, random_state=1)
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)
model_2019 = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model_2019 = model_2019.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)
proba = model_2019.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0.7246346082588826


In [20]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_2020, y_df_2020, train_size=0.75, random_state=1)
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)
model_2020 = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model_2020 = model_2020.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)
proba = model_2020.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0.7728054961205126


In [21]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_2021, y_df_2021, train_size=0.75, random_state=1)
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)
model_2021 = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model_2021 = model_2021.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)
proba = model_2021.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0.7536908907624633


In [22]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_2022, y_df_2022, train_size=0.75, random_state=1)
train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)
test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)
model_2022 = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
                                    learning_rate=0.143852,
                                    iterations=300,
                                    l2_leaf_reg=15,
                                    max_depth = 4,
                                    use_best_model=True,
                                    custom_loss=['Accuracy', 'AUC', 'Logloss'])

model_2022 = model_2022.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)
proba = model_2022.predict_proba(Xtest)
print(metrics.roc_auc_score(ytest,proba[:,1]))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0.7688205736224029
