# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
  
### Snapshot 1, pre-configured model 


  4. Load Adult dataset 
  
  5. Load your config file and create your snapshot based on an etiq wrapped xgboost model
  
  6. Scan for accuracy issues 
  
  
  
### Snapshot 2, already-trained model 


  7. Load Adult dataset
  
  8. Train a model 
  
  9. Load your config file and create your snapshot using the test dataset only 
  
  10. Scan for accuracy issues 
  
  
### Snapshot 3, in production


  7. Load Adult dataset & use the already trained model from the previous snapshot 
  
  9. Load your config file and create your snapshot using the test dataset  with the 'label' feature assumed to be actuals
  
  10. Scan for accuracy issues 
  
  

## Why run accuracy scans?

Accuracy is what I optimize my models on. Why should I have tests on accuracy metrics as well?

- High accuracy can be indicative of a problem, just as much as low accuracy is. For instance, if a plain accuracy metric is 10% higher than you've expected, you might have leakage somewhere or another issue.

- Optimizing for a metric pre-production does not equate to optimizing for that metric in production. You will be better off getting a good model off the ground, a model with no obvious issues, and which is likely to be robust, than trying to achieve a 1% accuracy with an overfitting model, a model which is unfairly discriminating against protected demographic groups or with a model that will experience abrupt performance decay. 


Our accuracy scans so far provide 3 metrics: 

1. accuracy - % correct out of total 

2. true positive rate - the proportion positive outcome labels that are correctly classified out of all positive outcome labels

3. true negative rate -  the proportion negative outcome labels that are correctly classified out of all negative outcome labels


But you can use custom metrics to add your own metrics.

# SET-UP

In [1]:
import etiq

%pip install Jinja2

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.

Note: you may need to restart the kernel to use updated packages.


In [2]:
# styling the issue_summary 

# making sure it shows the name or None
def metric_format(v):
    if v is not None:
        return v.__name__
    else:
        return v

def threshold_format(v):
    return f'{v[0]} - {v[1]}'

def set_format(v):
    if len(v) == 0:
        return 'None'
    return f'{v}'

# highlighting the row with the issue
def highlight_rows(row):
    value = row.loc['issues_found']
    if value > 0:
        color = '#FFB3BA' # Red
    else:
        color = '#BAFFC9' # Green
    return ['background-color: {}'.format(color) for r in row]

# visuals: first column white
def backround_first_column(col):
    color = 'white'
    return ['background-color: {}'.format(color) for c in col]

# font color - black
def font_color(v):
    color = 'black'
    return 'color: %s' % color

# complete function for issue_summary
def issue_summary_pretty(styler):
    styler.format(metric_format, subset=['metric', 'measure'])
    styler.format(threshold_format, subset=['threshold'])
    styler.format(set_format, subset=['features', 'segments'])
    styler.hide(axis="index")
    styler.apply(highlight_rows, axis=1)
    styler.apply(backround_first_column, axis='columns', subset=['name'])
    styler.applymap(font_color)
    styler.set_caption("Issues Summary")
    return styler

In [3]:
from etiq import login as etiq_login
# etiq_login("https://dashboard.etiq.ai/", "<token>")


In [4]:
# Can get/create a single named project
project = etiq.projects.open(name="Accuracy Scans")

# SNAPSHOT 1: xgboost, pre-configured model


To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.


First, we'll be encoding the categorical features found in this dataset.

Second, we'll log the dataset to Etiq.

In this case we encode prior to splitting into test/train/validate because we know in advance the categories people fall into for this dataset. This means that in production we won't run into new categories that will fall into a bucket not included in this dataset.

However if this is not the case for your use case, you should NOT encode prior to splitting your sample, as this might lead to LEAKAGE.

Encoding categorical values itself is problematic as it assigns a numerical ranking to categorical variables. For best practice encoding, use one-hot encoding which converts each categorical value into a new categorical column and assigns a binary value of 0 or 1 to those columns. As we limit the free library functionality to 15 features, we will not do one-hot encoding for the purposes of this example.

Remember: This is an example only. The use case for the majority of scans in Etiq is that you log the model to Etiq once you have the sample that you'll be training on. Usually this sample will have numeric features only, as otherwise you will not be able to use it in with the majority of supported libraries training methods.

In [5]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdataset.csv")
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [6]:
from etiq.transforms import LabelEncoder
import pandas as pd
import numpy as np 

# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()


In [7]:
data_encoded

Unnamed: 0,age,educational-num,fnlwgt,capital-gain,capital-loss,hours-per-week,native-country,occupation,race,relationship,gender,marital-status,income,education,workclass
0,25,7,226802,0,0,40,39,7,2,3,1,4,0,1,4
1,38,9,89814,0,0,50,39,5,4,0,1,2,0,11,4
2,28,12,336951,0,0,40,39,11,4,0,1,2,1,7,2
3,44,10,160323,7688,0,40,39,7,2,0,1,2,1,15,4
4,18,10,103497,0,0,30,39,0,4,3,0,4,0,15,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,12,257302,0,0,38,39,13,4,5,0,2,0,7,4
48838,40,9,154374,0,0,40,39,7,4,0,1,2,1,11,4
48839,58,9,151910,0,0,40,39,1,4,4,0,6,0,11,4
48840,22,9,201490,0,0,20,39,1,4,3,1,4,0,11,4


## Loading the config file

In [8]:
# XXX: Make per-project.


## Logging the snapshot to Etiq 

This can happen at any point in the pipeline and through a variety of ways

In [9]:
from etiq.model import DefaultXGBoostClassifier

with etiq.etiq_config('config_accuracy.json'):
    #load your dataset

    dataset = etiq.BiasDatasetBuilder.dataset(data_encoded)

    # Load our model
    model = DefaultXGBoostClassifier()

    # Creating a snapshot
    snapshot = project.snapshots.create(name="Snapshot 1", 
                                        dataset=dataset, 
                                        model=model, 
                                        bias_params=etiq.biasparams.BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0))
    
    #accuracy metrics scan
    (segments, issues, issue_summary) = snapshot.scan_accuracy_metrics()

# issue_summary

issue_summary.style.pipe(issue_summary_pretty)




INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0448:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0448:Computed acurracy metrics for the dataset {'accuracy': 0.87, 'true_pos_rate': 0.6690328305235137, 'true_neg_rate': 0.9302820649281532}
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0448:Issue Aggregate = {'accuracy_below_threshold': IssueAggregate(name='accuracy_below_threshold', metric=<compiled_function accuracy at 0x14ee37f10>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_pos_rate_below_threshold': IssueAggregate(name='true_pos_rate_below_threshold', metric=<compiled_function true_pos_rate at 0x14eeec040>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_neg_rate_below_threshold': IssueAggregate(name='true_neg_rate_below_threshold', metric=<compiled_function true_neg

name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
accuracy_below_threshold,accuracy,,,,1,0,0.6 - 1.0
true_pos_rate_below_threshold,true_pos_rate,,,{'all'},1,1,0.6 - 1.0
true_neg_rate_below_threshold,true_neg_rate,,,,1,0,0.6 - 1.0


## Accuracy Metrics Scan

# SNAPSHOT 2, already trained model

## Model Build 

In [10]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdataset.csv")
data.head()



Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [11]:
# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()



In [12]:
# prepare the training/testing/validation datasets

# separate into train/validate/test dataset of sizes 80%/10%/10% as percetages of the initial data
data_remaining, test = train_test_split(data_encoded, test_size=0.1)
train, valid = train_test_split(data_remaining, test_size=0.1112)

# because we don't want to train on protected attributes or labels to be predicted, 
# let's remove these columns from the training dataset
protected_train = train['gender'].copy() # gender is a protected attribute
y_train = train['income'].copy() # labels we're going to train the model to predict
x_train = train.drop(columns=['gender','income'])
protected_valid = valid['gender'].copy() 
y_valid = valid['income'].copy() 
x_valid = valid.drop(columns=['gender','income'])
protected_test = test['gender'].copy() 
y_test = test['income'].copy()
x_test = test.drop(columns=['gender','income'])

In [13]:
# train a XGBoost model to predict 'income'

standard_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=4)    
model_fit = standard_model.fit(x_train, y_train)

In [14]:
y_train_pred = standard_model.predict(x_train)
y_valid_pred = standard_model.predict(x_valid)
print('Model accuracy on the training dataset :', 
      round(100 * accuracy_score(y_train, y_train_pred),2),'%') # round the score to 2 digits  

print('Model accuracy on the validation dataset :', 
      round(100 * accuracy_score(y_valid, y_valid_pred),2),'%')

Model accuracy on the training dataset : 90.05 %
Model accuracy on the validation dataset : 87.67 %


## Log config, dataset and model to Etiq

For already trained models make sure you only you use a sample you held out. 

As you don't want any retraining of the model to occur, set your train_valid_test split to [0.0, 1.0, 0.0]. 

In [15]:
from etiq import Model

with etiq.etiq_config('config_already_trained_accuracy.json'):
    #load your dataset

    dataset = etiq.BiasDatasetBuilder.dataset(test)

    #Log your already trained model
    model = Model(model_architecture=standard_model, model_fitted=model_fit)

    # Creating a snapshot
    snapshot = project.snapshots.create(name="Snapshot 2", 
                                    dataset=dataset, 
                                    model=model, 
                                    bias_params=etiq.biasparams.BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0))
    
    #accuracy metrics scan
    (segments, issues, issue_summary) = snapshot.scan_accuracy_metrics()
    
#issue_summary

issue_summary.style.pipe(issue_summary_pretty)


INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0517:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0517:Computed acurracy metrics for the dataset {'accuracy': 0.87, 'true_pos_rate': 0.6717687074829932, 'true_neg_rate': 0.9385279050957132}
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0517:Issue Aggregate = {'accuracy_below_threshold': IssueAggregate(name='accuracy_below_threshold', metric=<compiled_function accuracy at 0x14ee37f10>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_pos_rate_below_threshold': IssueAggregate(name='true_pos_rate_below_threshold', metric=<compiled_function true_pos_rate at 0x14eeec040>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_neg_rate_below_threshold': IssueAggregate(name='true_neg_rate_below_threshold', metric=<compiled_function true_neg

name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
accuracy_below_threshold,accuracy,,,,1,0,0.6 - 1.0
true_pos_rate_below_threshold,true_pos_rate,,,,1,0,0.6 - 1.0
true_neg_rate_below_threshold,true_neg_rate,,,,1,0,0.6 - 1.0


# SNAPSHOT 3, in-production

At the moment, functionality will not allow us to record actuals separately, but we are working on it. 
This means that to run scans which use the actuals (accuracy, a good chunk of the bias metrics scans, target and concept drift), you will have to create your dataset to include the actuals. To log it to etiq the actual will be the 'label' parameter in your config.
 
This example is just for illustration purposes, as you will not be running production scans from a jupyter notebook. 
Etiq can be used with orchestration and model registry tools. Please email us: info@etiq.ai for help with using Etiq with your toolset and for online models. We will be adding demos on how to use Etiq with Airflow and MLflow shortly. 

## Log dataset, comparison dataset, model, config

For the dataset we will use yesterday and today's datasets set-up earlier in the notebook, but any time window will work - depends on your scoring frequency. 

For model: we will use the model we've trained at the previous step. 

For config we will use a config that has scans achievable in production without the actuals

In [16]:
from etiq import Model
from etiq import SnapshotStage

with etiq.etiq_config('config_production_accuracy.json'):
    #load your dataset
    dataset = etiq.BiasDatasetBuilder.dataset(data_encoded)

    # Use the already trained model from the previous step
    model = Model(model_architecture=standard_model, model_fitted=model_fit)

    # Creating a snapshot, label it as PRODUCTION (snapshots are labelled Pre-Production) by default
    snapshot = project.snapshots.create(name="Snapshot 3", 
                                        dataset=dataset, 
                                        model=model,
                                        bias_params=etiq.biasparams.BiasParams(protected='gender', privileged=1, unprivileged=0, positive_outcome_label=1, negative_outcome_label=0), 
                                        stage=SnapshotStage.PRODUCTION)

    #accuracy metrics scan
    (segments, issues, issue_summary) = snapshot.scan_accuracy_metrics()
   
# issue_summary

issue_summary.style.pipe(issue_summary_pretty)

INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0440:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0440:Computed acurracy metrics for the dataset {'accuracy': 0.9, 'true_pos_rate': 0.7065115085137332, 'true_neg_rate': 0.9549185843089759}
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0440:Issue Aggregate = {'accuracy_below_threshold': IssueAggregate(name='accuracy_below_threshold', metric=<compiled_function accuracy at 0x14ee37f10>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_pos_rate_below_threshold': IssueAggregate(name='true_pos_rate_below_threshold', metric=<compiled_function true_pos_rate at 0x14eeec040>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_neg_rate_below_threshold': IssueAggregate(name='true_neg_rate_below_threshold', metric=<compiled_function true_neg_

name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
accuracy_below_threshold,accuracy,,,,1,0,0.6 - 1.0
true_pos_rate_below_threshold,true_pos_rate,,,,1,0,0.6 - 1.0
true_neg_rate_below_threshold,true_neg_rate,,,,1,0,0.6 - 1.0
