# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
### Feature drift


  4. Load Adult dataset
  
  5. Create drifted dataset based on Adult - for example purposes
  
  6. Load your config file and create your snapshot
  
  7. Scan for feature drift 
  
  
### Concept & target drift
  
  8. Create drifted datasets based on Adult - for example purposes 
  
  9. Load your config file and create your snapshot
  
  10. Scan for feature, concept and target drift
  


## What is drift?

Drift can impact your model in production and make it perform worse than you initially expected. 

There are a few different kinds of drift:

1. Feature drift

Feature drift takes place when the distributions of the input features changes. For instance, perhaps you built your model on a sample dataset from the winter period and it's now summer, and your model predicting what kind of dessert people are more likely to buy is not longer as accurate. 


2. Target drift 

Similarly to feature drift, target drift is about distribution of the predicted feature changing from one time period to the next. 


3. Concept drift 

Concept drift occurs when the relationships between the features and the predicted changes over time. 


4. Prediction drift

We do not include scans related to prediction drift as we think that other scans will probably be more likely to uncover these issues and given our main use cases right now (classification). Prediction drift refers to those instances when something happened to the model scoring itself when running in production and the relationship, which means that somehow with the same or similar input dataset you'd get different predictions in the post-period as you did in the previous period.



## Set-up

In [1]:

import etiq



Thanks for trying out the ETIQ.AI toolkit!

This is a trial version, you have 14 days remaining in your trial period.
Please consider purchasing the full version to continue enjoying all the features of our library.

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.
Help improve our product: Call `etiq.enable_telemetry()` to provide
anonymous library usage statistics.
        


In [3]:
from etiq import login as etiq_login
etiq_login("https://dashboard.etiq.ai/", "<token>")


(Dashboard supplied updated license information)


'Connection successful. Projects and pipelines will be displayed in the dashboard. 😀'

In [2]:
# Can get/create a single named project
project = etiq.projects.open(name="Drift_Scans")

## Create the test datasets based on the Adult Income Dataset


To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.

In [3]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdataset.csv")
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
from etiq.transforms import LabelEncoder
import pandas as pd
import numpy as np 

# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()


In [5]:

# Create the "drifted" dataset
todays_dataset_df = data_encoded.copy()
todays_dataset_df["hours-per-week"] = todays_dataset_df["hours-per-week"].multiply(1.2)


## Calculate Feature Drift

1. (Optional) Define custom drift measures

2. Loading the config 

3. Log the datasets & create the snapshot - no model needed for this scan

4. Run the feature drift scan

This can happen at any point in the pipeline and through a variety of ways


It may be useful to define a custom drift measure to be used in drift detection. Etiq allows a custom drift measure using a decorator.
See the code example below to define an earth mover distance between two empirical distributions (after appropriately binning the distributions)

In [6]:
from etiq.drift_measures import drift_measure
from scipy.stats import wasserstein_distance

@drift_measure
def earth_mover_drift_measure(expected_dist, new_dist, number_of_bins=10, bucket_type='bins', **kwargs) -> float:
    def scale_range (input, min, max):
        input += -(np.min(input))
        input *= (max - min)/np.max(input)
        input += min
        return input

    breakpoints = np.arange(0, number_of_bins + 1) / (number_of_bins) * 100
    if bucket_type == 'bins':
        breakpoints = scale_range(breakpoints, np.min(expected_dist), np.max(expected_dist))
    elif bucket_type == 'quantiles':
        breakpoints = np.stack([np.percentile(expected_dist, b) for b in breakpoints])

    expected_percents = np.histogram(expected_dist, breakpoints)[0] / len(expected_dist)
    actual_percents = np.histogram(new_dist, breakpoints)[0] / len(new_dist)

    return wasserstein_distance(expected_percents, actual_percents)

This now allows the measure to be used in just the same way as an in built measure.

In [7]:
with etiq.etiq_config('drift-config.json'):
    # Create a dataset with the comparison data
    dataset_s = etiq.SimpleDatasetBuilder.dataset(data_encoded)
    # Create a dataset with the data
    todays_dataset_s = etiq.SimpleDatasetBuilder.dataset(todays_dataset_df)
    # Create the snapshot
    snapshot = project.snapshots.create(name="Test Snapshot", dataset=todays_dataset_s, comparison_dataset=dataset_s, model=None)
    # Run the drift scan
    (segments, issues, issue_summary) = snapshot.scan_drift_metrics()
    
issues, issue_summary

INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Starting pipeline
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Calculated breakpoints for capital-loss
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Calculated breakpoints for fnlwgt
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Calculated breakpoints for hours-per-week
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Calculated breakpoints for capital-gain
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Calculated breakpoints for age
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Calculated drift measures.
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Identifying drift measure issues.
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:psi for native-country = 0.0
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0712:Threshold = [0.0, 0.15]
INFO:etiq.pipeline.IdentifyFeat

(                                                name         feature segment  \
 0                feature_drift_(psi)_above_threshold  hours-per-week     all   
 1  feature_drift_(earth_mover_drift_measure)_abov...  hours-per-week     all   
 
                                              measure  measure_value metric  \
 0                   <function psi at 0x7ff3194a5ca0>       6.584039   None   
 1  <function earth_mover_drift_measure at 0x7ff30...       0.012590   None   
 
    metric_value    threshold value record  
 0           NaN  [0.0, 0.15]  None   None  
 1           NaN  [0.0, 0.01]  None   None  ,
                                                 name metric  \
 0                feature_drift_(psi)_above_threshold   None   
 1  feature_drift_(kolmogorov_smirnov)_above_thres...   None   
 2  feature_drift_(earth_mover_drift_measure)_abov...   None   
 
                                              measure          features  \
 0                   <function psi at 0x7ff3194

## Target Drift and Concept Drift 

1. Create the test datasets to illustrate drift

2. (Optional) Define custom concept measures 

2. Load the config file

3. Create the snapshot 

4. Scan for the different drift types

In [8]:
#Create a test dataset to illustrate drift


# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdataset.csv")
adult = etiq.load_sample('adultdataset')
# Randomly permutate the targets
data_target_permutated = data.copy()
data_target_permutated['income'] = np.random.permutation(data['income'])


# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
data_target_permutated_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    data_target_permutated_encoded[i] = label.transform(data_target_permutated[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()

data_target_permutated_encoded.set_index(data_target_permutated.index, inplace=True)
data_target_permutated_encoded = pd.concat([data_target_permutated.loc[:, cont_vars], data_target_permutated_encoded], axis=1).copy()



Define a new concept drift measure.

In [9]:
from etiq.drift_measures import concept_drift_measure
from scipy.stats import wasserstein_distance

@concept_drift_measure
def total_variational_distance(expected_dist, new_dist):
    return sum(0.5 * abs(x-y) for (x,y) in zip(expected_dist, new_dist))
    

In [11]:
# Load the config file
with etiq.etiq_config("drift-config_concept.json"):
    #Using the test datasets, check for drift 

    dataset1 = etiq.SimpleDatasetBuilder.dataset(data_encoded,
                                                 cat_col=cat_vars,
                                                 cont_col=cont_vars)

    dataset2 = etiq.SimpleDatasetBuilder.dataset(data_target_permutated_encoded,
                                                 cat_col=cat_vars,
                                                 cont_col=cont_vars)

    # Creating a snapshot
    snapshot = project.snapshots.create(name="Test Snapshot", 
                                        dataset=dataset1, 
                                        comparison_dataset=dataset2,
                                        model=None)
    
    #Scan for different drift types
    (segments_f, issues_f, issue_summary_f) = snapshot.scan_drift_metrics()
    (segments_t, issues_t, issue_summary_t) = snapshot.scan_target_drift_metrics()
    (segments_c, issues_c, issue_summary_c) = snapshot.scan_concept_drift_metrics()



INFO:etiq.charting:Histogram summary already created for this data.
INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Starting pipeline
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated breakpoints for educational-num
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated breakpoints for capital-loss
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated breakpoints for fnlwgt
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated breakpoints for hours-per-week
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated breakpoints for capital-gain
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated breakpoints for age
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Calculated drift measures.
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:Identifying drift measure issues.
INFO:etiq.pipeline.IdentifyFeatureDriftPipeline0655:psi for native-country = 0.0
INFO:etiq

In [20]:
issue_summary_f

Unnamed: 0,0
0,IssueAggregate(name='feature_drift_above_thres...
1,IssueAggregate(name='feature_drift_above_thres...
2,IssueAggregate(name='feature_drift_above_thres...


In [21]:
issue_summary_t

Unnamed: 0,0
0,IssueAggregate(name='target_drift_above_thresh...


In [22]:
issue_summary_c

Unnamed: 0,0
0,IssueAggregate(name='concept_drift_above_thres...
1,IssueAggregate(name='concept_drift_above_thres...
2,IssueAggregate(name='concept_drift_above_thres...
3,IssueAggregate(name='concept_drift_above_thres...


In the above results we have concept drift, but no target or feature drift. This is because neither the target nor feature data changed in the comparison dataset, so their distributions stayed the same. However, the targets were shuffled, creating new pairings between the features and targets. This created a change in the overall relationship between the features and targets, causing concept drift. 