# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
### Leakage Scans


  4. Load Adult dataset
  
  5. Create leaked dataset based on Adult - for example purposes
  
  6. Load your config file and create your snapshot
  
  7. Scan for target leakage and demographic leakage 
  
  


## What is leakage

Leakage can be very detrimental to your model. 
If accidentally you have a feature which encodes the target your model will look like it's performing really well and accuracy will be much higher than expected. However this will not hold in production and you will likely deploy the wrong model in production. 

There are multiple types of leakage (including Time travel), but at the moment we only provide:

1. Target leakage

This occurs when a feature leaked into the target. E.g. if you're trying to predict yearly income and accidentally a monthly salary feature is included in your dataset (for the same time period).
While it seems hard to make this mistake, think of datasets with hundreds of features sources from different databases and repositories around a business, perhaps calculated by multiple teams. 


2. Demographic leakage 

This occurs when a feature leaked into one of your protected demographics feature, E.g. if relationship status contains information related to a customer's gender, then using that relationship status as a feature in a predictive model is highly problematic. 


A good indicator for the 2 types of leakage above is whether any of the features in the dataset are highly correlated, as this means that likely one has leaked into another. 

This resembles the proxy issue in the bias scans; however the main difference is the level of the thresholds. 

We provide multiple correlation measures to test for this issue. You can customize the measures yourself in the config, but the recommended usage is as per below:
- "continuous_continuous_measure"  :  "pearsons"
- "categorical_categorical_measure": "cramersv" 
- "categorical_continuous_measure": "rankbiserial"
- "binary_continuous_measure": "pointbiserial"

## Set-up


In [3]:
import etiq


In [4]:
from etiq import login as etiq_login
etiq_login("https://dashboard.etiq.ai/", "<token>")


(Dashboard supplied updated license information)


Connection successful. Projects and pipelines will be displayed in the dashboard. 😀

In [5]:
# Can get/create a single named project
project = etiq.projects.open(name="Leakage Test")

  # Example dataset and model 
  
To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.

First, we'll be encoding the categorical features found in this dataset.

Second, we'll log the dataset to Etiq.

In this case we encode prior to splitting into test/train/validate because we know in advance the categories people fall into for this dataset. This means that in production we won't run into new categories that will fall into a bucket not included in this dataset, This allows us to encode prior to splitting into train/test/validation.

However if this is not the case for your use case, you should NOT encode prior to splitting your sample, as this might lead to LEAKAGE.

Encoding categorical values itself is problematic as it assigns a numerical ranking to categorical variables. For best practice encoding use one hot encoding. As we limit the free library functionality to 15 features, we will not do one-hot encoding for the purposes of this example.

Remember: This is an example only. The use case for the majority of scans in Etiq is that you log the model to Etiq once you have the sample that you'll be training on. Usually this sample will have numeric features only as otherwise you will not be able to use it in with the majority of supported libraries training methods.

To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.

In [6]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [7]:
from etiq.transforms import LabelEncoder
import pandas as pd
import numpy as np 

# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()


## Loading the config file

In [8]:
# XXX: Make per-project.
etiq.load_config("./leakage-config.json")


{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.8, 0.1, 0.1],
  'remove_protected_from_features': True},
 'scan_target_leakage': {'leakage_threshold': 0.85,
  'minimum_segment_size': 1000,
  'continuous_continuous_measure': 'pearsons',
  'categorical_categorical_measure': 'cramersv',
  'categorical_continuous_measure': 'rankbiserial'},
 'scan_demographic_leakage': {'leakage_threshold': 0.85,
  'minimum_segment_size': 1000,
  'continuous_continuous_measure': 'pearsons',
  'categorical_categorical_measure': 'cramersv',
  'categorical_continuous_measure': 'rankbiserial'}}

## Logging the snapshot to Etiq 

This can happen at any point in the pipeline and through a variety of ways

In [9]:
#load your dataset

dataset_loader = etiq.dataset(data_encoded)

from etiq.model import DefaultXGBoostClassifier
# Load our model
model = DefaultXGBoostClassifier()

# Creating a snapshot
snapshot = project.snapshots.create(name="Snapshot 1", dataset=dataset_loader.initial_dataset, model=model, bias_params=dataset_loader.bias_params)


## Leakage Scans

In [14]:
#target leakage scan

(segments, issues, issue_summary) = snapshot.scan_target_leakage()

INFO:etiq.pipeline.IdentifyFeatureLeakPipeline0549:Starting pipeline
INFO:etiq.pipeline.IdentifyFeatureLeakPipeline0549:Completed pipeline


In [15]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,target_leakage_issue,,<function pointbiserial at 0x7fd5d6fa6700>,{},{},5,0,"(0, 0.85)"
1,target_leakage_issue,,<function cramersv at 0x7fd5d6fa65e0>,{},{},8,0,"(0, 0.85)"


In [12]:
#demographic leakage scan

(segments, issues, issue_summary) = snapshot.scan_demographic_leakage()

INFO:etiq.pipeline.IdentifyFeatureLeakPipeline0818:Starting pipeline
INFO:etiq.pipeline.IdentifyFeatureLeakPipeline0818:Completed pipeline


In [13]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,demographic_leakage_issue,,<function pointbiserial at 0x7fd5d6fa6700>,{},{},5,0,"(0, 0.85)"
1,demographic_leakage_issue,,<function cramersv at 0x7fd5d6fa65e0>,{},{},8,0,"(0, 0.85)"
