# Data Monitoring

**Here want want to look at how we can monitor data using feature drift and target drift**

In [None]:
!pip install evidently alibi_detect

**First we will test "evidently", a nice dashboarding solution displaying data drift between current inference and a reference**

In [3]:
import pandas as pd
from evidently.dashboard import Dashboard
from evidently.pipeline.column_mapping import ColumnMapping
from evidently.dashboard.tabs import DataDriftTab, NumTargetDriftTab

from evidently.model_profile import Profile
from evidently.model_profile.sections import DataDriftProfileSection, NumTargetDriftProfileSection

In [4]:
### get the usual csv chicago data, drop the null values
df = pd.read_csv('chicagodata/trip.csv').dropna()


In [None]:
### verify you get what you want
df.head()

In [7]:
### autoclean data to allow only copatible types in features
numerics = ['int','float']
df = df.select_dtypes(include=numerics)

In [8]:
### verify you got only numeric data, and no null data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8258 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   tips                       8258 non-null   float64
 1   trip_seconds               8258 non-null   float64
 2   trip_miles                 8258 non-null   float64
 3   pickup_community_area      8258 non-null   float64
 4   pickup_centroid_latitude   8258 non-null   float64
 5   pickup_centroid_longitude  8258 non-null   float64
 6   dropoff_community_area     8258 non-null   float64
 7   fare                       8258 non-null   float64
 8   tolls                      8258 non-null   float64
 9   extras                     8258 non-null   float64
 10  trip_total                 8258 non-null   float64
dtypes: float64(11)
memory usage: 774.2 KB


In [11]:
### define the label (tips as usual)
target = 'tips'
### define the features as all numerical columns except target in the data frame
features = df.drop(columns=[target]).columns.tolist()

### Drifts detections with Evidently

In [13]:
column_mapping = ColumnMapping()
### assign target to column_mapping
column_mapping.target = target
### assign features to column_mapping
column_mapping.numerical_features = features

In [14]:
### create ref sample data with 1000 of the 5000 first lines of the DF, randomly picked
ref_data_sample = df[:5000].sample(1000, random_state=0)
### create "prod" sample data with 50 of the 5000 last lines of the DF, randomly picked
prod_data_sample = df[5000:].sample(50, random_state=0)

In [20]:
### Create a dashboard bundle with feature drift and target drift
ca_data_and_target_drift_dashboard = Dashboard(
    tabs=[
        DataDriftTab(verbose_level=0), 
    NumTargetDriftTab(verbose_level=0)
    ]
)


In [21]:
### Calculate the drifts
ca_data_and_target_drift_dashboard.calculate(
    ref_data_sample, 
    prod_data_sample, 
    column_mapping=column_mapping
)


In [22]:
### save dashboard as reusable html
ca_data_and_target_drift_dashboard.save('./evi_dashboard.html')

Launch the html file to see the results, on the top of the viewer, click on "trust html"

for each features, you can access the p-value, distribution plot, drift plot

![evid](./images/evid.png)

### Drift detection with Alibi detect

In [25]:
from alibi_detect.cd import TabularDrift
### implement another type of drift detector with "Alibi", that is used more in productions stacks, like embedded in inferenceService
cd = TabularDrift(x_ref=ref_data_sample[features].values, p_val=0.05, categories_per_feature={0: None, 3: None})

In [26]:
### get all features from df
X = df[features]

In [27]:
### Here we define 3 data scope to determine some drift:

# reference (taken from the first example)
X_ref = ref_data_sample[features]

# t0 data simulation (50 lines from 5000 last lines of the dataset)
X_t0 = df[5000:].sample(50, random_state=44)[features]

# t1 data simulation (25 lines from 5000 last lines of the dataset)
X_t1 = df[5000:].sample(25, random_state=66)[features]


X_ref.shape, X_t0.shape, X_t1.shape

((1000, 10), (50, 10), (25, 10))

In [28]:
X_t0.head()

Unnamed: 0,trip_seconds,trip_miles,pickup_community_area,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_community_area,fare,tolls,extras,trip_total
9460,3849.0,34.54,76.0,41.979071,-87.90304,76.0,84.5,0.0,31.0,115.5
7329,557.0,1.0,8.0,41.900266,-87.632109,8.0,6.75,0.0,0.0,6.75
8840,478.0,1.37,32.0,41.880994,-87.632746,28.0,7.0,0.0,0.0,7.0
9064,780.0,2.8,8.0,41.899156,-87.626211,28.0,10.5,0.0,0.0,14.0
7502,480.0,0.1,32.0,41.880994,-87.632746,8.0,8.0,0.0,0.0,10.0


In [32]:
### here we computes data related to the reference, to see the distances
preds = cd.predict(x=X_t0.values, drift_type='batch', return_p_val=True, return_distance=True)

In [53]:
### let's see the pvalues for all features
pd.DataFrame(preds['data']['p_val'],index=features,columns=['p_values']).head(10)

Unnamed: 0,p_values
trip_seconds,0.612817
trip_miles,0.006438
pickup_community_area,0.417747
pickup_centroid_latitude,2.4e-05
pickup_centroid_longitude,0.308534
dropoff_community_area,0.139406
fare,0.012317
tolls,1.0
extras,0.087117
trip_total,0.035509


In [30]:
### Ask the drift detector to give a global report about feature drift
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? Yes!


In [54]:
### Ask about the treshold that separate drift data from the rest
preds['data']['threshold']

0.005