# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
### Feature drift


  4. Load Adult dataset
  
  5. Create drifted dataset based on Adult - for example purposes
  
  6. Load your config file and create your snapshot
  
  7. Scan for feature drift 
  
  
### Concept & target drift
  
  8. Create drifted datasets based on Adult - for example purposes 
  
  9. Load your config file and create your snapshot
  
  10. Scan for feature, concept and target drift
  


## What is drift?

Drift can impact your model in production and make it perform worse than you initially expected. 

There are a few different kinds of drift:

1. Feature drif

Feature drift takes place when the distributions of the input features changes. For instance, perhaps you built your model on a sample dataset from the winter period and it's now summer, and your model predicting what kind of dessert people are more likely to buy is not longer as accurate. 


2. Target drift 

Similarly to feature drift, target drif is about distribution of the predicted feature changing from one time period to the next. 


3. Concept drift 

Concept drift occurs when the relationships between the features and the predicted changes over time. 


4. Prediction drift

We do not include scans related to prediction drift as we think that other scans will probably be more likely to uncover these issues and given our main use cases right now (classification). Prediction drift refers to those instances when something happened to the model scoring itself when running in production and the relationship, which means that somehow with the same or similar input dataset you'd get different predictions in the post-period as you did in the previous period.



## Set-up

In [1]:

import etiq



Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



In [2]:
from etiq import login as etiq_login
etiq_login("https://dashboard.etiq.ai/", "<token>")


(Dashboard supplied updated license information)


Connection successful. Projects and pipelines will be displayed in the dashboard. 😀

In [3]:
# Can get/create a single named project
project = etiq.projects.open(name="Drift_Scans")

## Create the test datasets based on the Adult Income Dataset


To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.

In [4]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
from etiq.transforms import LabelEncoder
import pandas as pd
import numpy as np 

# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()


In [6]:

# Create the "drifted" dataset
todays_dataset_df = data_encoded.copy()
todays_dataset_df["hours-per-week"] = todays_dataset_df["hours-per-week"].multiply(1.2)


## Calculate Feature Drift

1. Loading the config 

2. Log the datasets & create the snapshot - no model needed for this scan

3. Run the feature drift scan

This can happen at any point in the pipeline and through a variety of ways

In [7]:
etiq.load_config("./drift-config.json")


{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.0, 1.0, 0.0],
  'remove_protected_from_features': True},
 'scan_drift_metrics': {'thresholds': {'psi': [0.0, 0.15],
   'kolmogorov_smirnov': [0.05, 1.0]},
  'drift_measures': ['kolmogorov_smirnov', 'psi']}}

In [8]:
# Create a dataset with the comparison data
dataset_s = etiq.SimpleDatasetBuilder.from_dataframe(data_encoded, target_feature='income').build()
# Create a dataset with the data
todays_dataset_s = etiq.SimpleDatasetBuilder.from_dataframe(todays_dataset_df, target_feature='income').build()


In [9]:
# Create the snapshot
snapshot = project.snapshots.create(name="Test Snapshot", dataset=todays_dataset_s, comparison_dataset=dataset_s, model=None)


In [10]:
# Run the drift scan
(segments, issues, issue_summary) = snapshot.scan_drift_metrics()

INFO:etiq.pipeline.DriftPipeline0382:Starting pipeline
INFO:etiq.pipeline.DriftPipeline0382:Calculated drift measures.
INFO:etiq.pipeline.DriftPipeline0382:Identifying drift measure issues.
INFO:etiq.pipeline.DriftPipeline0382:Completed pipeline


In [11]:
issues

Unnamed: 0,name,feature,segment,measure,measure_value,metric,metric_value,threshold
0,feature_drift_above_threshold,hours-per-week,all,<function psi at 0x7f02d045aa60>,0.182449,,,"[0.0, 0.15]"
1,feature_drift_below_threshold,hours-per-week,all,<function kolmogorov_smirnov at 0x7f02d045aaf0>,0.0,,,"[0.05, 1.0]"


In [12]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,feature_drift_below_threshold,,<function psi at 0x7f02d045aa60>,{},{},14,0,"[0.0, 0.15]"
1,feature_drift_above_threshold,,<function psi at 0x7f02d045aa60>,{hours-per-week},{all},14,1,"[0.0, 0.15]"
2,feature_drift_below_threshold,,<function kolmogorov_smirnov at 0x7f02d045aaf0>,{hours-per-week},{all},14,1,"[0.05, 1.0]"
3,feature_drift_above_threshold,,<function kolmogorov_smirnov at 0x7f02d045aaf0>,{},{},14,0,"[0.05, 1.0]"


## Target Drift and Concept Drift 

1. Create the test datasets to illustrate drift

2. Load the config file

3. Create the snapshot 

4. Scan for the different drift types

In [13]:
#Create a test dataset to illustrate drift


# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
adult = etiq.load_sample('adultdataset')
# Randomly permutate the targets
data_target_permutated = data.copy()
data_target_permutated['income'] = np.random.permutation(data['income'])


# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
data_target_permutated_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    data_target_permutated_encoded[i] = label.transform(data_target_permutated[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()

data_target_permutated_encoded.set_index(data_target_permutated.index, inplace=True)
data_target_permutated_encoded = pd.concat([data_target_permutated.loc[:, cont_vars], data_target_permutated_encoded], axis=1).copy()



In [14]:
# Load the config file
etiq.load_config("drift-config_concept.json")

{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.0, 1.0, 0.0],
  'remove_protected_from_features': True},
 'scan_drift_metrics': {'thresholds': {'psi': [0.0, 0.2],
   'kolmogorov_smirnov': [0.05, 1.0]},
  'drift_measures': ['kolmogorov_smirnov', 'psi']},
 'scan_concept_drift_metrics': {'thresholds': {'earth_mover_distance': [0.0,
    0.05],
   'kl_divergence': [0.0, 0.2],
   'jensen_shannon_distance': [0.0, 0.2]},
  'drift_measures': ['earth_mover_distance']},
 'scan_target_drift_metrics': {'thresholds': {'psi': [0.0, 0.2],
   'kolmogorov_smirnov': [0.05, 1.0]},
  'drift_measures': ['psi']}}

In [15]:
#Using the test datasets, check for drift 

dataset1 = etiq.SimpleDatasetBuilder.from_dataframe(data_encoded, target_feature='income').build()
dataset1.categorical_features = cat_vars
dataset1.continuous_features = cont_vars
dataset1.target_categorical = True

dataset2 = etiq.SimpleDatasetBuilder.from_dataframe(data_target_permutated_encoded, target_feature='income').build()
dataset2.categorical_features = cat_vars
dataset2.continuous_features = cont_vars
dataset2.target_categorical = True


# Creating a snapshot
snapshot = project.snapshots.create(name="Test Snapshot", dataset=dataset1, comparison_dataset=dataset2, model=None)



In [16]:
#Scan for different drift types

(segments_f, issues_f, issue_summary_f) = snapshot.scan_drift_metrics()
(segments_t, issues_t, issue_summary_t) = snapshot.scan_target_drift_metrics()
(segments_c, issues_c, issue_summary_c) = snapshot.scan_concept_drift_metrics()

INFO:etiq.pipeline.DriftPipeline0029:Starting pipeline
INFO:etiq.pipeline.DriftPipeline0029:Calculated drift measures.
INFO:etiq.pipeline.DriftPipeline0029:Identifying drift measure issues.
INFO:etiq.pipeline.DriftPipeline0029:Completed pipeline
INFO:etiq.pipeline.DriftPipeline0579:Starting pipeline
INFO:etiq.pipeline.DriftPipeline0579:Calculated target drift measures.
INFO:etiq.pipeline.DriftPipeline0579:Identifying target drift measure issues.
INFO:etiq.pipeline.DriftPipeline0579:Completed pipeline
INFO:etiq.pipeline.DriftPipeline0407:Starting pipeline
INFO:etiq.pipeline.DriftPipeline0407:Calculated concept drift measures
INFO:etiq.pipeline.DriftPipeline0407:Identifying concept drift issues.
INFO:etiq.pipeline.DriftPipeline0407:Completed pipeline


In [17]:
issue_summary_f

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,feature_drift_below_threshold,,<function psi at 0x7f02d045aa60>,{},{},14,0,"[0.0, 0.2]"
1,feature_drift_above_threshold,,<function psi at 0x7f02d045aa60>,{},{},14,0,"[0.0, 0.2]"
2,feature_drift_below_threshold,,<function kolmogorov_smirnov at 0x7f02d045aaf0>,{},{},14,0,"[0.05, 1.0]"
3,feature_drift_above_threshold,,<function kolmogorov_smirnov at 0x7f02d045aaf0>,{},{},14,0,"[0.05, 1.0]"


In [18]:
issue_summary_t

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,target_drift_below_threshold,,<function psi at 0x7f02d045aa60>,{},{},1,0,"[0.0, 0.2]"
1,target_drift_above_threshold,,<function psi at 0x7f02d045aa60>,{},{},1,0,"[0.0, 0.2]"


In [19]:
issue_summary_c

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,concept_drift_below_threshold,,<function earth_mover_distance at 0x7f02d045ab80>,{},{},14,0,"[0.0, 0.05]"
1,concept_drift_above_threshold,,<function earth_mover_distance at 0x7f02d045ab80>,"{gender, workclass, race, native-country, educ...",{all},14,11,"[0.0, 0.05]"
2,concept_drift_below_threshold,,<function jensen_shannon_distance at 0x7f02d04...,{},{},14,0,"[0.0, 0.2]"
3,concept_drift_above_threshold,,<function jensen_shannon_distance at 0x7f02d04...,"{workclass, native-country, education, marital...",{all},14,7,"[0.0, 0.2]"
4,concept_drift_below_threshold,,<function kl_divergence at 0x7f02d045aca0>,{},{},14,0,"[0.0, 0.2]"
5,concept_drift_above_threshold,,<function kl_divergence at 0x7f02d045aca0>,"{workclass, native-country, education, age, oc...",{all},14,6,"[0.0, 0.2]"
