# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
  
  
### Snapshot 1, already-trained model 


  7. Load Adult dataset
  
  8. Train a model 
  
  9. Load your config file and create your snapshot using the test dataset only 
  
  10. Scan for accuracy rca issues 
  
 
  

## What are Accuracy RCA scans?

Some scans are simple tests that show whether a metric for an entire sample dataset is above or below certain thresholds set by the user. Other scans are more complex and look at whether a metric for only a part of the sample dataset is below or above the threshold. This will help you discover segments of customers or groups of records for which the model has a lower than expected accuracy or groups for which bias thresholds are not met.

##### Imagine if your accuracy metrics tests have picked up issues, this test finds out exactly which segment has the issue, which should help you fix it sooner. 

If only a part of the data drifted or only a segment is underperforming, your overall tests might not pick up on it, but this test would. While you can pre-set segments you are interested in, if you just run the scan as is, it will discover problematic segments on its own.


The RCA accuracy scans so far provide 3 metrics out-of-the-box:

1. Accuracy - % correct out of total 

2. True Positive Rate - the proportion positive outcome labels that are correctly classified out of all positive outcome labels

3. True Negative Rate -  the proportion negative outcome labels that are correctly classified out of all negative outcome labels


# SET-UP

In [1]:
import etiq

#!pip install Jinja2
#import Jinja2


Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



In [2]:
# styling the issue_summary + segments_accuracy_rca

def metric_format(v):
    if v is not None:
        return v.__name__
    else:
        return v

def threshold_format(v):
    return f'{v[0]} - {v[1]}'

def set_format(v):
    if len(v) == 0:
        return 'None'
    return f'{v}'

def highlight_rows(row):
    value = row.loc['issues_found']
    if value > 0:
        color = '#FFB3BA' # Red
    else:
        color = '#BAFFC9' # Green
    return ['background-color: {}'.format(color) for r in row]

def backround_first_column(col):
    color = 'white'
    return ['background-color: {}'.format(color) for c in col]

def font_color(v):
    color = 'black'
    # if v:
    #     color = 'blue'
    return 'color: %s' % color

def issue_summary_pretty(styler):
    styler.format(metric_format, subset=['metric', 'measure'])
    styler.format(threshold_format, subset=['threshold'])
    styler.format(set_format, subset=['features', 'segments'])
    styler.hide(axis="index")
    styler.apply(highlight_rows, axis=1)
    styler.apply(backround_first_column, axis='columns', subset=['name'])
    styler.applymap(font_color)
    styler.set_caption("Issues Summary")
    return styler 

def list_of_segments(summary):
    segments = summary["segment"]
    seg_list = []
    for elem in segments:
        if elem in seg_list:
            continue
        else:
            seg_list.append(elem)
    return seg_list

def issues_accuracy_rca_pretty(styler):
    styler.hide(axis="index")
    # styler.apply(backround_first_column, axis='columns', subset=['name'])
    # styler.applymap(font_color)
    styler.set_caption("Issues Accuracy RCA")
    styler.set_properties(**{'background-color': 'white',
                           'color': 'black'})
    return styler


In [3]:
from etiq import login as etiq_login
#etiq_login("https://dashboard.etiq.ai/", "<token>")

In [4]:
# Can get/create a single named project
project = etiq.projects.open(name="Accuracy RCA Scans")

# SNAPSHOT 1: xgboost, pre-configured model


To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset from https://archive.ics.uci.edu/ml/datasets/adult.


First, we'll be encoding the categorical features found in this dataset.

Second, we'll log the dataset to Etiq.

In this case we encode prior to splitting into test/train/validate because we know in advance the categories people fall into for this dataset. This means that in production we won't run into new categories that will fall into a bucket not included in this dataset, This allows us to encode prior to splitting into train/test/validation.

However if this is not the case for your use case, you should NOT encode prior to splitting your sample, as this might lead to LEAKAGE. Encoding categorical values itself is problematic as it assigns a numerical ranking to categorical variables. For best practice encoding use one hot encoding. This is for example purposes

Once our model is built, we will:
 - Log the relevant config to etiq 
 - Log the model together with the hold-out sample to etiq
 - Run the Accuracy RCA scan



## Model Build 

In [5]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdataset.csv")
data.head()



Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [6]:
# use a LabelEncoder to transform categorical variables
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label

data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()



In [7]:
# prepare the training/testing/validation datasets

# separate into train/validate/test dataset of sizes 80%/10%/10% as percetages of the initial data
data_remaining, test = train_test_split(data_encoded, test_size=0.1)
train, valid = train_test_split(data_remaining, test_size=0.1112)

# because we don't want to train on protected attributes or labels to be predicted, 
# let's remove these columns from the training dataset
protected_train = train['gender'].copy() # gender is a protected attribute
y_train = train['income'].copy() # labels we're going to train the model to predict
x_train = train.drop(columns=['gender','income'])
protected_valid = valid['gender'].copy() 
y_valid = valid['income'].copy() 
x_valid = valid.drop(columns=['gender','income'])
protected_test = test['gender'].copy() 
y_test = test['income'].copy()
x_test = test.drop(columns=['gender','income'])

In [8]:
# train a XGBoost model to predict 'income'

standard_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=4)    
model_fit = standard_model.fit(x_train, y_train)

In [9]:
y_train_pred = standard_model.predict(x_train)
y_valid_pred = standard_model.predict(x_valid)
print('Model accuracy on the training dataset :', 
      round(100 * accuracy_score(y_train, y_train_pred),2),'%') # round the score to 2 digits  

print('Model accuracy on the validation dataset :', 
      round(100 * accuracy_score(y_valid, y_valid_pred),2),'%')

Model accuracy on the training dataset : 90.1 %
Model accuracy on the validation dataset : 87.15 %


## Adding a Custom Metric

The etiq librabry has three inbuilt accuracy metrics i.e. accuracy, true_pos_rate (true positive rate), and true_neg_rate (true negative rate). However, it is easy to create custom accuracy metrics and use them with an accuracy scan.

For example, F-score is an accuracy metric calculated from precision and recall (see https://en.wikipedia.org/wiki/F-score for more information). This can be set up as a custom accuracy metric using a set of python decorators evailable in the etiq librbary

NB: Decorators operate from bottom to top. Keep this in mind when adding your own custom metric.

In [10]:
@etiq.metrics.accuracy_metric
@etiq.custom_metric
@etiq.actual_values("actual")
@etiq.prediction_values("predictions")
@etiq.positive_outcome("positive_outcome_label")
@etiq.negative_outcome("negative_outcome_label")

def f_score(predictions, actual, positive_outcome_label, negative_outcome_label):
    true_pos = sum((predictions == actual) & (actual == positive_outcome_label))
    false_pos = sum((predictions != actual) & (actual == positive_outcome_label))
    false_neg = sum((predictions != actual) & (actual == negative_outcome_label))
    
    return true_pos / (true_pos + 0.5*(false_pos + false_neg))

## Log config, dataset and model to Etiq

For already trained models make sure you only you use a sample you held out. 

As you don't want any retraining of the model to occur, set your train_valid_test split to [0.0, 1.0, 0.0]. 

In this instance we have used a low volume for the minimum segment size to illustrate the issues, but you should probably set it higher, depending on your use case and sample size. Also we have set the threshold for accuracy high so as to make sure it finds at least one issue.

In [11]:
from etiq import Model

with etiq.etiq_config("./config_rca_accuracy.json"):
    #log your dataset (the sample you held-out!!!)
    dataset = etiq.BiasDatasetBuilder.dataset(test)
    
    #Log your already trained model

    model = Model(model_architecture=standard_model, model_fitted=model_fit)
    
    # Create the snapshot
    snapshot = project.snapshots.create(name="Snapshot 1", 
                                        dataset=dataset,
                                        model=model, 
                                        bias_params=etiq.biasparams.BiasParams(protected='gender', 
                                                                               privileged=1, 
                                                                               unprivileged=0, 
                                                                               positive_outcome_label=1, 
                                                                               negative_outcome_label=0)
                                       )
    print("Running Accuracy Scan... \n")
    (segments_accuracy, issues_accuracy, issue_summary_accuracy) = snapshot.scan_accuracy_metrics()
    print("\n Running RCA Accuracy Scan... \n")
    (segments_accuracy_rca, issues_accuracy_rca, issue_summary_accuracy_rca)  = snapshot.scan_accuracy_metrics_rca()

INFO:etiq.charting:Created histogram summary of data (15 fields)
Running Accuracy Scan... 

INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0514:Starting pipeline
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0514:Computed acurracy metrics for the dataset {'accuracy': 0.87, 'true_pos_rate': 0.6591107236268526, 'true_neg_rate': 0.9355270197966827, 'f_score': 0.7052238805970149}
INFO:etiq.pipeline.AccuracyMetricsIssuePipeline0514:Issue Aggregate = {'accuracy_below_threshold': IssueAggregate(name='accuracy_below_threshold', metric=<compiled_function accuracy at 0x14eb9bf10>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_pos_rate_below_threshold': IssueAggregate(name='true_pos_rate_below_threshold', metric=<compiled_function true_pos_rate at 0x14ec54040>, measure=None, features=set(), segments=set(), total_issues_tested=0, issues_found=0, threshold=(0.0, 1.0)), 'true_neg_rate_below_threshold': IssueAggregate(name='true_neg

In [12]:
# issue_summary_accuracy

issue_summary_accuracy.style.pipe(issue_summary_pretty)

name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
accuracy_below_threshold,accuracy,,,,1,0,0.5 - 1.0
true_pos_rate_below_threshold,true_pos_rate,,,{'all'},1,1,0.5 - 1.0
true_neg_rate_below_threshold,true_neg_rate,,,,1,0,0.5 - 1.0
f_score_below_threshold,f_score,,,,1,0,0.5 - 1.0


In [13]:
# issue_summary_accuracy_rca

issue_summary_accuracy_rca.style.pipe(issue_summary_pretty)

name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
accuracy_below_threshold,accuracy,,,{8},9,1,0.8 - 1.0
true_pos_rate_below_threshold,true_pos_rate,,,{9},1,1,0.7 - 1.0


In [14]:
# issues_accuracy_rca

def highlight_row_from_listofseg(row):
    list_seg = list_of_segments(issues_accuracy_rca)
    color = 'whie'
    if row["name"]+1 in list_seg:
        color = '#FFB3BA' # Red
    return ['background-color: {}'.format(color) for r in row]

def segments_accuracy_rca_pretty(styler):
    styler.hide(axis="index")
    styler.apply(backround_first_column, axis='columns')
    styler.applymap(font_color)
    styler.set_caption("Segments Accuracy RCA")
    styler.apply(highlight_row_from_listofseg, axis=1)
    return styler 

issues_accuracy_rca.style.pipe(issues_accuracy_rca_pretty)

name,feature,segment,measure,measure_value,metric,metric_value,threshold,value
accuracy_below_threshold,,8,,,,0.76,"[0.8, 1.0]",
true_pos_rate_below_threshold,,9,,,,0.659111,"[0.7, 1.0]",


In [15]:
#segments_accuracy_rca

segments_accuracy_rca.style.pipe(segments_accuracy_rca_pretty)



name,business_rule,mask,tags
0,all,[ True True True ... True True True],"{'accuracy', 'true_pos_rate'}"
1,`relationship` > 0.0,[ True False True ... True True True],{'accuracy'}
2,`age` > 27.0 and `relationship` > 0.0,[ True False False ... True True False],{'accuracy'}
3,`age` > 40.0 and `relationship` > 0.0,[False False False ... True True False],{'accuracy'}
4,27.0 < `age` <= 40.0 and `relationship` > 0.0,[ True False False ... False False False],{'accuracy'}
5,`age` <= 27.0 and `relationship` > 0.0,[False False True ... False False True],{'accuracy'}
6,`age` <= 27.0 and `relationship` > 3.0,[False False False ... False False True],{'accuracy'}
7,`age` <= 27.0 and 0.0 < `relationship` <= 3.0,[False False True ... False False False],{'accuracy'}
8,`relationship` <= 0.0,[False True False ... False False False],{'accuracy'}


In terms of accuracy, the normal scan doesn't find any issues. However the RCA scan finds 1 issue with accuracy falling below the threshold.