# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
  
### Snapshot 1, pre-configured model 


  4. Load Adult dataset 
  
  5. Load your config file and create your snapshot based on an etiq wrapped xgboost model
  
  6. Scan for accuracy issues 
  
  
  
### Snapshot 2, already-trained model 


  7. Load Adult dataset
  
  8. Train a model 
  
  9. Load your config file and create your snapshot using the test dataset only 
  
  10. Scan for accuracy issues 
  
  
### Snapshot 3, in production


  7. Load Adult dataset & use the already trained model from the previous snapshot 
  
  9. Load your config file and create your snapshot using the test dataset  with the 'label' feature assumed to be actuals
  
  10. Scan for accuracy issues 
  
  

## Why run accuracy scans?

Accuracy is what I optimize my models on. Why should I have tests on accuracy metrics as well?

- High accuracy can be indicative of a problem, just as much as low accuracy is. For instance, if a plain accuracy metric is 10% higher than you've expected, you might have leakage somewhere or another issue.

- Optimizing for a metric pre-production does not equate to optimizing for that metric in production. You will be better off getting a good model off the ground, a model with no obvious issues, and which is likely to be robust, than trying to achieve a 1% accuracy with an overfitting model, a model which is unfairly discriminating against protected demographic groups or with a model that will experience abrupt performance decay. 


Our accuracy scans so far provide 3 metrics: 

1. accuracy - % correct out of total 

2. true positive rate - the proportion positive outcome labels that are correctly classified out of all positive outcome labels

3. true negative rate -  the proportion negative outcome labels that are correctly classified out of all negative outcome labels


But you can use custom metrics to add your own metrics.

# SET-UP

In [1]:
import etiq


Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



In [2]:
from etiq import login as etiq_login
etiq_login("https://dashboard.etiq.ai/", "<token>")


Invalid authentication token: 404

In [3]:
# Can get/create a single named project
project = etiq.projects.open(name="Accuracy Scans")

# SNAPSHOT 1: xgboost, pre-configured model


To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.


First, we'll be encoding the categorical features found in this dataset.

Second, we'll log the dataset to Etiq.

In this case we encode prior to splitting into test/train/validate because we know in advance the categories people fall into for this dataset. This means that in production we won't run into new categories that will fall into a bucket not included in this dataset.

However if this is not the case for your use case, you should NOT encode prior to splitting your sample, as this might lead to LEAKAGE.

Encoding categorical values itself is problematic as it assigns a numerical ranking to categorical variables. For best practice encoding, use one-hot encoding which converts each categorical value into a new categorical column and assigns a binary value of 0 or 1 to those columns. As we limit the free library functionality to 15 features, we will not do one-hot encoding for the purposes of this example.

Remember: This is an example only. The use case for the majority of scans in Etiq is that you log the model to Etiq once you have the sample that you'll be training on. Usually this sample will have numeric features only, as otherwise you will not be able to use it in with the majority of supported libraries training methods.

In [4]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
from etiq.transforms import LabelEncoder
import pandas as pd
import numpy as np 

# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()


In [6]:
data_encoded

Unnamed: 0,age,educational-num,fnlwgt,capital-gain,capital-loss,hours-per-week,occupation,education,relationship,workclass,race,marital-status,native-country,income,gender
0,25,7,226802,0,0,40,7,1,3,4,2,4,39,0,1
1,38,9,89814,0,0,50,5,11,0,4,4,2,39,0,1
2,28,12,336951,0,0,40,11,7,0,2,4,2,39,1,1
3,44,10,160323,7688,0,40,7,15,0,4,2,2,39,1,1
4,18,10,103497,0,0,30,0,15,3,0,4,4,39,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,12,257302,0,0,38,13,7,5,4,4,2,39,0,0
48838,40,9,154374,0,0,40,7,11,0,4,4,2,39,1,1
48839,58,9,151910,0,0,40,1,11,4,4,4,6,39,0,0
48840,22,9,201490,0,0,20,1,11,3,4,4,4,39,0,1


## Loading the config file

In [7]:
# XXX: Make per-project.
etiq.load_config("./config_accuracy.json")


{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.8, 0.1, 0.1],
  'remove_protected_from_features': True},
 'scan_accuracy_metrics': {'thresholds': {'accuracy': [0.8, 1.0],
   'true_pos_rate': [0.7, 1.0],
   'true_neg_rate': [0.6, 1.0]}}}

## Logging the snapshot to Etiq 

This can happen at any point in the pipeline and through a variety of ways

In [8]:
#load your dataset

dataset = etiq.BiasDatasetBuilder.dataset(data_encoded)

from etiq.model import DefaultXGBoostClassifier
# Load our model
model = DefaultXGBoostClassifier()

# Creating a snapshot
snapshot = project.snapshots.create(name="Snapshot 1", 
                                    dataset=dataset, 
                                    model=model, 
                                    bias_params=etiq.biasparams.BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0))


In [9]:
snapshot.bias_params

BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0)

## Accuracy Metrics Scan

In [10]:
(segments, issues, issue_summary) = snapshot.scan_accuracy_metrics()

INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0233:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0233:Computed acurracy metrics for the dataset
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0233:Completed pipeline


In [11]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,accuracy_below_threshold,<compiled_function accuracy at 0x7ff067971130>,,{},{},1,0,"[0.8, 1.0]"
1,accuracy_above_threshold,<compiled_function accuracy at 0x7ff067971130>,,{},{},1,0,"[0.8, 1.0]"
2,true_pos_rate_below_threshold,<compiled_function true_pos_rate at 0x7ff06797...,,{},{all},1,1,"[0.7, 1.0]"
3,true_pos_rate_above_threshold,<compiled_function true_pos_rate at 0x7ff06797...,,{},{},1,0,"[0.7, 1.0]"
4,true_neg_rate_below_threshold,<compiled_function true_neg_rate at 0x7ff06797...,,{},{},1,0,"[0.6, 1.0]"
5,true_neg_rate_above_threshold,<compiled_function true_neg_rate at 0x7ff06797...,,{},{},1,0,"[0.6, 1.0]"


# SNAPSHOT 2, already trained model

## Model Build 

In [12]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data.head()



Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [13]:
# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()



In [14]:
# prepare the training/testing/validation datasets

# separate into train/validate/test dataset of sizes 80%/10%/10% as percetages of the initial data
data_remaining, test = train_test_split(data_encoded, test_size=0.1)
train, valid = train_test_split(data_remaining, test_size=0.1112)

# because we don't want to train on protected attributes or labels to be predicted, 
# let's remove these columns from the training dataset
protected_train = train['gender'].copy() # gender is a protected attribute
y_train = train['income'].copy() # labels we're going to train the model to predict
x_train = train.drop(columns=['gender','income'])
protected_valid = valid['gender'].copy() 
y_valid = valid['income'].copy() 
x_valid = valid.drop(columns=['gender','income'])
protected_test = test['gender'].copy() 
y_test = test['income'].copy()
x_test = test.drop(columns=['gender','income'])

In [15]:
# train a XGBoost model to predict 'income'

standard_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=4)    
model_fit = standard_model.fit(x_train, y_train)

In [16]:
y_train_pred = standard_model.predict(x_train)
y_valid_pred = standard_model.predict(x_valid)
print('Model accuracy on the training dataset :', 
      round(100 * accuracy_score(y_train, y_train_pred),2),'%') # round the score to 2 digits  

print('Model accuracy on the validation dataset :', 
      round(100 * accuracy_score(y_valid, y_valid_pred),2),'%')

Model accuracy on the training dataset : 90.05 %
Model accuracy on the validation dataset : 87.2 %


## Log config, dataset and model to Etiq

For already trained models make sure you only you use a sample you held out. 

As you don't want any retraining of the model to occur, set your train_valid_test split to [0.0, 1.0, 0.0]. 

In [17]:
etiq.load_config("./config_already_trained_accuracy.json")


{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.0, 1.0, 0.0],
  'remove_protected_from_features': True},
 'scan_accuracy_metrics': {'thresholds': {'accuracy': [0.8, 1.0],
   'true_pos_rate': [0.6, 1.0],
   'true_neg_rate': [0.6, 1.0]}}}

In [18]:
from etiq import Model

#Log your already trained model

model = Model(model_architecture=standard_model, model_fitted=model_fit)

dataset = etiq.BiasDatasetBuilder.dataset(test)

In [19]:
snapshot = project.snapshots.create(name="Snapshot 2", 
                                    dataset=dataset, 
                                    model=model, 
                                    bias_params=etiq.biasparams.BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0))


In [20]:
snapshot.scan_accuracy_metrics()


INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0890:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0890:Computed acurracy metrics for the dataset
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0890:Completed pipeline


(  name business_rule                                               mask
 0  all           all  [True, True, True, True, True, True, True, Tru...,
 Empty DataFrame
 Columns: []
 Index: [],
                             name  \
 0       accuracy_below_threshold   
 1       accuracy_above_threshold   
 2  true_pos_rate_below_threshold   
 3  true_pos_rate_above_threshold   
 4  true_neg_rate_below_threshold   
 5  true_neg_rate_above_threshold   
 
                                               metric measure features  \
 0     <compiled_function accuracy at 0x7ff067971130>    None       {}   
 1     <compiled_function accuracy at 0x7ff067971130>    None       {}   
 2  <compiled_function true_pos_rate at 0x7ff06797...    None       {}   
 3  <compiled_function true_pos_rate at 0x7ff06797...    None       {}   
 4  <compiled_function true_neg_rate at 0x7ff06797...    None       {}   
 5  <compiled_function true_neg_rate at 0x7ff06797...    None       {}   
 
   segments  total_issues_test

# SNAPSHOT 3, in-production

At the moment, functionality will not allow us to record actuals separately, but we are working on it. 
This means that to run scans which use the actuals (accuracy, a good chunk of the bias metrics scans, target and concept drift), you will have to create your dataset to include the actuals. To log it to etiq the actual will be the 'label' parameter in your config.
 
This example is just for illustration purposes, as you will not be running production scans from a jupyter notebook. 
Etiq can be used with orchestration and model registry tools. Please email us: info@etiq.ai for help with using Etiq with your toolset and for online models. We will be adding demos on how to use Etiq with Airflow and MLflow shortly. 

## Log dataset, comparison dataset, model, config

For the dataset we will use yesterday and today's datasets set-up earlier in the notebook, but any time window will work - depends on your scoring frequency. 

For model: we will use the model we've trained at the previous step. 

For config we will use a config that has scans achievable in production without the actuals

In [21]:
etiq.load_config("./config_production_accuracy.json")


{'dataset': {'label': 'income',
  'bias_params': {'protected': 'gender',
   'privileged': 1,
   'unprivileged': 0,
   'positive_outcome_label': 1,
   'negative_outcome_label': 0},
  'train_valid_test_splits': [0.0, 1.0, 0.0],
  'remove_protected_from_features': True},
 'scan_accuracy_metrics': {'thresholds': {'accuracy': [0.8, 1.0],
   'true_pos_rate': [0.6, 1.0],
   'true_neg_rate': [0.6, 1.0]}}}

In [22]:
dataset = etiq.BiasDatasetBuilder.dataset(data_encoded)

# Use the already trained model from the previous step
from etiq import Model
model = Model(model_architecture=standard_model, model_fitted=model_fit)

# Creating a snapshot, label it as PRODUCTION (snapshots are labelled Pre-Production) by default
from etiq import SnapshotStage
snapshot = project.snapshots.create(name="Snapshot 3", 
                                    dataset=dataset, 
                                    model=model,
                                    bias_params=etiq.biasparams.BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0), 
                                    stage=SnapshotStage.PRODUCTION)


In [23]:
snapshot.scan_accuracy_metrics()

INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0967:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0967:Computed acurracy metrics for the dataset
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0967:Completed pipeline


(  name business_rule                                               mask
 0  all           all  [True, True, True, True, True, True, True, Tru...,
 Empty DataFrame
 Columns: []
 Index: [],
                             name  \
 0       accuracy_below_threshold   
 1       accuracy_above_threshold   
 2  true_pos_rate_below_threshold   
 3  true_pos_rate_above_threshold   
 4  true_neg_rate_below_threshold   
 5  true_neg_rate_above_threshold   
 
                                               metric measure features  \
 0     <compiled_function accuracy at 0x7ff067971130>    None       {}   
 1     <compiled_function accuracy at 0x7ff067971130>    None       {}   
 2  <compiled_function true_pos_rate at 0x7ff06797...    None       {}   
 3  <compiled_function true_pos_rate at 0x7ff06797...    None       {}   
 4  <compiled_function true_neg_rate at 0x7ff06797...    None       {}   
 5  <compiled_function true_neg_rate at 0x7ff06797...    None       {}   
 
   segments  total_issues_test