# Notebook Summary


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/)

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on our own AWS instance, you can do so via the AWS Marketplace. To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project

### Feature drift (RCA)


  4. Load Adult dataset

  5. Create drifted dataset based on a defined segment of the Adult data - for example purposes

  6. Load your config file and create your snapshot

  7. Scan for RCA feature drift


### Target drift (RCA)

  8. Create drifted datasets based on a defined segment of the Adult datatset - for example purposes

  9. Load your config file and create your snapshot

  10. Scan for feature, concept and target drift
  

## What is drift?

Drift can impact your model in production and make it perform worse than you initially expected.

There are a few different kinds of drift:

1. Feature drift

Feature drift takes place when the distributions of the input features changes. For instance, perhaps you built your model on a sample dataset from the winter period and it's now summer, and your model predicting what kind of dessert people are more likely to buy is not longer as accurate.


2. Target drift

Similarly to feature drift, target drif is about distribution of the predicted feature changing from one time period to the next.

## What is RCA Drift?

Imagine if your drift tests have picked up issues, this test finds out exactly which segment has the issue, which should help you fix it sooner. In addition if only a part of the data drifted your overall tests might not pick up on it, but this test would. The scan it will auto-discover problematic segments on its own without the need for the user to specify segments to test.

Currently we only provide RCA drift for feature and target drift.

This test pipeline is experimental.




## QUICKSTART

In [1]:
import etiq
import numpy as np

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



In [2]:
etiq.login("https://dashboard.etiq.ai/", "<token>")

(Dashboard supplied updated license information)


Connection successful. Projects and pipelines will be displayed in the dashboard. 😀

In [3]:
# Can get/create a single named project
project = etiq.projects.open(name="RCA_Drift_Scans")

## Create the test datasets based on the Adult Income Dataset

To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.

In [4]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data.head()
data = data.replace('?', np.nan)
data.dropna(inplace=True)

In [5]:

from etiq.transforms import LabelEncoder
import pandas as pd
import numpy as np 

# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()


In [6]:
# Create the "drifted" dataset
data_encoded_comparison = data_encoded.copy()
# Expand hours worked by the self employed by 25% in the comparison dataset
# This is small enough that this drift will not show up for the overall dataset for 'hours-per-week'
self_emp_mask = data_encoded.eval(f'workclass in {list(label_encoders["workclass"].transform(["Self-emp-inc"]))}')
data_encoded_comparison.loc[self_emp_mask, 'hours-per-week'] = 1.25 * data_encoded[self_emp_mask]['hours-per-week']

In [7]:
# Construct etiq style datasets
base_dataset = etiq.SimpleDatasetBuilder.from_dataframe(data_encoded, target_feature='income').build()
comparison_dataset = etiq.SimpleDatasetBuilder.from_dataframe(data_encoded_comparison, target_feature='income').build()

## Calculate RCA Feature Drift

1. (Optional) Define custom drift measures (See the regular drif scan for an example)

2. Loading the config 

3. Log the datasets & create the snapshot - no model needed for this scan

4. Run the RCA feature drift scan

This can happen at any point in the pipeline and through a variety of ways

In [8]:
with etiq.etiq_config("./drift-config-rca.json"):
    snapshot = project.snapshots.create(name="Test Drift RCA Snapshot", 
                                        dataset=base_dataset,
                                        comparison_dataset=comparison_dataset,
                                        model=None)
    # Scan for feature drift
    (segments, issues, issue_summary) = snapshot.scan_drift_metrics_rca()
    # Scan for target drift
    (segments_t, issues_t, issue_summary_t) = snapshot.scan_target_drift_metrics_rca()

INFO:etiq.pipeline.IdentifyPipeline0747:Starting pipeline
INFO:etiq.pipeline.IdentifyPipeline0747:RCA drift scan for feature hours-per-week
INFO:etiq.pipeline.IdentifyPipeline0747:Measure = psi
INFO:etiq.pipeline.IdentifyPipeline0747:Searching for segments above the threshold 0.15
INFO:etiq.pipeline.IdentifyPipeline0747:Completed pipeline
INFO:etiq.pipeline.IdentifyPipeline0614:Starting pipeline
INFO:etiq.pipeline.IdentifyPipeline0614:RCA target drift scan.
INFO:etiq.pipeline.IdentifyPipeline0614:Measure = psi
INFO:etiq.pipeline.IdentifyPipeline0614:Searching for segments above the threshold 0.15


  actual_percents = np.histogram(new_dist, breakpoints)[0] / len(new_dist)


INFO:etiq.pipeline.IdentifyPipeline0614:Completed pipeline


In [9]:
#to see what config you're using you can also load it in the interface, but this is not required
etiq.load_config("./drift-config-rca.json")


{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.8, 0.2, 0.0],
  'cat_col': 'cat_vars',
  'cont_col': 'cont_vars'},
 'scan_drift_metrics': {'thresholds': {'psi': [0.0, 0.15],
   'kolmogorov_smirnov': [0.05, 1.0]},
  'drift_measures': ['kolmogorov_smirnov', 'psi']},
 'scan_target_drift_metrics_rca': {'thresholds': {'psi': [0.0, 0.15]},
  'drift_measures': ['psi'],
  'ignore_lower_threshold': True,
  'ignore_upper_threshold': False,
  'minimum_segment_size': 1000},
 'scan_drift_metrics_rca': {'thresholds': {'psi': [0.0, 0.15]},
  'drift_measures': ['psi'],
  'ignore_lower_threshold': True,
  'ignore_upper_threshold': False,
  'minimum_segment_size': 1000,
  'features': ['hours-per-week']}}

In [10]:
issues

Unnamed: 0,name,feature,segment,measure,measure_value,metric,metric_value,threshold
0,feature_drift_rca_above_threshold,hours-per-week,1,<function psi at 0x7ff80fb23430>,1.44837,,,"[0.0, 0.15]"


In [11]:
segments.iloc[1]

name                                                             1
business_rule                                    `workclass`  == 3
mask             [False, False, False, False, False, False, Fal...
Name: 1, dtype: object

In [12]:
print(f'Workclass 3 == "{label_encoders["workclass"].inverse_transform([3])[0]}"')

Workclass 3 == "Self-emp-inc"


In [13]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,feature_drift_rca_above_threshold,,<function psi at 0x7ff80fb23430>,{hours-per-week},{1},85,1,"[0.0, 0.15]"


In [14]:
issue_summary_t

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,target_drift_rca_above_threshold,,<function psi at 0x7ff80fb23430>,{income},{},83,0,"[0.0, 0.15]"


## Calculate RCA Target Drift

1. (Optional) Define custom drift measures (See the normal drift measure scan for an example)

2. Loading the config 

3. Log the datasets & create the snapshot - no model needed for this scan

4. Run the RCA target drift scan

This can happen at any point in the pipeline and through a variety of ways

In [15]:
# Create the "drifted" dataset
data_encoded_target_drifted = data_encoded.copy()
# For all pacific islanders slip income
pacific_islander_mask = data_encoded.eval(f'race in {list(label_encoders["race"].transform(["Asian-Pac-Islander"]))}')
data_encoded_target_drifted.loc[pacific_islander_mask, 'income'] = np.logical_not(data_encoded[pacific_islander_mask]['income']).astype(int)

In [16]:
comparison_dataset2 = etiq.SimpleDatasetBuilder.from_dataframe(data_encoded_target_drifted, target_feature='income').build()
with etiq.etiq_config("./drift-config-rca.json"):
    snapshot2 = project.snapshots.create(name="Test Drift RCA Snapshot", 
                                        dataset=base_dataset,
                                        comparison_dataset=comparison_dataset2,
                                        model=None)
    # Scan for feature drift
    (segments_2, issues_2, issue_summary_2) = snapshot2.scan_drift_metrics_rca()
    # Scan for target drift
    (segments_t2, issues_t2, issue_summary_t2) = snapshot2.scan_target_drift_metrics_rca()

INFO:etiq.pipeline.IdentifyPipeline0967:Starting pipeline
INFO:etiq.pipeline.IdentifyPipeline0967:RCA drift scan for feature hours-per-week
INFO:etiq.pipeline.IdentifyPipeline0967:Measure = psi
INFO:etiq.pipeline.IdentifyPipeline0967:Searching for segments above the threshold 0.15
INFO:etiq.pipeline.IdentifyPipeline0967:Completed pipeline
INFO:etiq.pipeline.IdentifyPipeline0497:Starting pipeline
INFO:etiq.pipeline.IdentifyPipeline0497:RCA target drift scan.
INFO:etiq.pipeline.IdentifyPipeline0497:Measure = psi
INFO:etiq.pipeline.IdentifyPipeline0497:Searching for segments above the threshold 0.15
INFO:etiq.pipeline.IdentifyPipeline0497:Completed pipeline


In [17]:
issue_summary_t2

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,target_drift_rca_above_threshold,,<function psi at 0x7ff80fb23430>,{income},{1},94,1,"[0.0, 0.15]"
