# Overview

The etiqai library enables users to identify and mitigate bias in predictive models. The library is similar to pytorch or sklearn libraries, with a battery of metrics and debiasing algorithms available to find the best debiasing method for the problem at hand. We are now adding functionality to identify additional issues in predictive models, e.g. leakage, which we found are very time-consuming and have many other bias and non-bias related implications down-the-line.

The DataPipeline object holds what we'd like to focus on during the debiasing process: the dataset used for training the model, the model we'd like to test and the fairness metrics used to evaluate results. DebiasPipeline objects take the initial DataPipeline object and apply to it different Identify methods, which aim to generate flags for rows at risk of being biased against, or Repair methods. Repair methods generate new "debiased" datasets. Models trained on debiased datasets perform better on fairness metrics. We have a (growing) collection of repair methods that allow our users to pick the debiased dataset version that best fits their criteria for a good solution.

 We really appreciate your feedback on using this library, whether it's good or bad! This is a link to downloading the library: https://etiq.ai/download and documentation is also available on the site. To access our full solution including hands on support, to submit any comments, feature requests or issues please login to our slack channel https://etiqcore.slack.com or email us at info@etiq.ai



# The prediction problem setup

To illustrate some of the library's features, we build a model that predicts whether an applicant makes over or under 50K using the Adult dataset https://archive.ics.uci.edu/ml/datasets/adult. 

In [1]:
from etiq_core import *
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



Here we're loading the Adult dataset that is already available with the library, but users can use any dataset in a pandas dataframe format for their debiasing analysis. 

In [2]:
data = load_sample('adultdata')
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K


For the purposes of giving an example of leakage, we have intentionally introduced a new feature that is very highly correlated to the target. When running the example we will be alerted to potential issues. In a real-life situation we would of course wouldn't know where the issue is.

In [3]:
#add a variable which is highly correlated with the target - monthly income

data['monthly_salary'] = np.where(data['income'] == '<=50K', '<4.2K', '>4.2K')
 
data.head(10)


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,monthly_salary
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K,<4.2K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K,<4.2K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K,>4.2K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K,>4.2K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K,<4.2K
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K,<4.2K
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K,<4.2K
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K,>4.2K
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K,<4.2K
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K,<4.2K


In [4]:
print('List of columns:', data.columns.values)
print('Nr rows in the dataset:', data.shape[0])

List of columns: ['age' 'workclass' 'fnlwgt' 'education' 'educational-num' 'marital-status'
 'occupation' 'relationship' 'race' 'gender' 'capital-gain' 'capital-loss'
 'hours-per-week' 'native-country' 'income' 'monthly_salary']
Nr rows in the dataset: 48842


## Standard approach to training a binary classifier

We want to predict (classify) if a person's income is above or below 50K using this dataset. Following standard ML practice, after some data cleaning (removing rows with missing values and encoding categorical variables), we split the dataset into train/validate/test groups and train a model for this binary classification task, using the features in the dataset to predict the 'income' variable. 

In [5]:
# data preprocessing
# remove rows with missing values
data = data.replace('?', np.nan)
data.dropna(inplace=True)

# use a LabelEncoder to transform categorical variables 
cont_vars = ['age', 'educational-num', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_vars = list(set(data.columns.values) - set(cont_vars))

label_encoders = {}
data_encoded = pd.DataFrame()
for i in cat_vars:
    label = LabelEncoder()
    data_encoded[i] = label.fit_transform(data[i])
    label_encoders[i] = label
data_encoded.set_index(data.index, inplace=True)
data_encoded = pd.concat([data.loc[:, cont_vars], data_encoded], axis=1).copy()

In [6]:
# separate into train/validate/test dataset of sizes 80%/10%/10% as percetages of the initial data

data_remaining, test = train_test_split(data_encoded, test_size=0.1)
train, valid = train_test_split(data_remaining, test_size=0.1112)

# because we don't want to train on protected attributes or labels to be predicted, 
# let's remove these columns from the training dataset
protected_train = train['gender'].copy() # gender is a protected attribute
y_train = train['income'].copy() # labels we're going to train the model to predict
x_train = train.drop(columns=['gender','income'])
protected_valid = valid['gender'].copy() 
y_valid = valid['income'].copy() 
x_valid = valid.drop(columns=['gender','income'])
protected_test = test['gender'].copy() 
y_test = test['income'].copy()
x_test = test.drop(columns=['gender','income'])

In [7]:
# train a XGBoost model to predict 'income'

standard_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=4)    
model_fit = standard_model.fit(x_train, y_train)

How accurate is the model we trained?

In [8]:
y_train_pred = standard_model.predict(x_train)
y_valid_pred = standard_model.predict(x_valid)
print('Model accuracy on the training dataset :', 
      round(100 * accuracy_score(y_train, y_train_pred),2),'%') # round the score to 2 digits  

print('Model accuracy on the validation dataset :', 
      round(100 * accuracy_score(y_valid, y_valid_pred),2),'%')

Model accuracy on the training dataset : 100.0 %
Model accuracy on the validation dataset : 100.0 %


By this point in the analysis, it's clear that there is an issue with this model. The accuracy rate is 100%. Imagine this problem if you have many features that you've inherited from a data source where you don't fully understand. How do you find where the problem is?

# Leakage analysis

The accuracy results above are showing quite clearly that there is likely something wrong. 

The etiq library will help us test for a potential leakage problem 

We wrap the model that we want to evaluate, the dataset used to train that model and the metrics we'd like to calculate into a DataPipeline object. We will be looking at target leakage but also potential leakage from a demographics feature into one of the features used in the model.

## Logging the pipeline

Just like in the bias example, we now log a few key elements we need to run the leakage pipeline

In [9]:
#FIRST - assign the parameters

# the protected category we'll look at is 'gender' for the demographic leakage test parameters

param = BiasParams(protected='gender', privileged='Male', unprivileged='Female', 
                   positive_outcome_label='>50K', negative_outcome_label='<=50K')

In [10]:
#some transforms

transforms = [Dropna, EncodeLabels] # preprocessing with the help of etiqai 

In [11]:
# SECOND - asigning the dataset
# for now data to be uploaded is in pandas format 

dl = DatasetLoader(data=data, label='income', transforms=transforms, bias_params=param, 
                        train_valid_test_splits=[0.8, 0.1, 0.1], cat_col=cat_vars, cont_col=cont_vars,
                        names_col = data.columns.values)

In [12]:
# MODEL - in this case we'll just use a wrapper we have in our library 
model_xgb = DefaultXGBoostClassifier()


In [13]:
# METRICS - decide which bias metrics to look at. We offer most mainstream bias ones 

metrics_initial = [accuracy]

In [14]:
#INITIAL PIPELINE 
pipeline_initial = DataPipeline(dataset_loader=dl, model=model_xgb, metrics=metrics_initial)
pipeline_initial.run()

INFO:etiq_core.pipeline.DataPipeline0747:Starting pipeline
INFO:etiq_core.pipeline.DataPipeline0747:Fitting model
INFO:etiq_core.pipeline.DataPipeline0747:Computed metrics for the initial dataset
INFO:etiq_core.pipeline.DataPipeline0747:Completed pipeline


# Run the pipeline to help finding any leakage related issues

Plenty of times you will hopefully find this issue during the data exploration phase. However if the model is already built, and you are at the end of the process, you won't go through a whole exploration phase again and if anything seeps in (e.g. when model gets productionalized), you'll just want to have an easy way to run these checks time and time again, ideally automated as part of an orchestration flow.

In [15]:
#DATA LEAKAGE PIPELINE

identify_method = IdentifyFeatureLeakPipeline(nr_groups=1, group_def=['unsupervised'])

leakage_pipeline1 = DebiasPipeline(pipeline_initial, identify_pipeline=identify_method, metrics = metrics_initial)

leakage_pipeline1.run()

INFO:etiq_core.pipeline.DebiasPipeline0469:Starting pipeline
INFO:etiq_core.pipeline.DebiasPipeline0469:Start Phase IdentifyFeatureLeakPipeline0420
INFO:etiq_core.pipeline.IdentifyFeatureLeakPipeline0420:Starting pipeline
INFO:etiq_core.pipeline.IdentifyFeatureLeakPipeline0420:Completed pipeline
INFO:etiq_core.pipeline.DebiasPipeline0469:Completed Phase IdentifyFeatureLeakPipeline0420
INFO:etiq_core.pipeline.DebiasPipeline0469:Fitting model
INFO:etiq_core.pipeline.DebiasPipeline0469:Computed metrics for the initial dataset
ERROR:etiq_core.pipeline.DebiasPipeline0469:Execution failed.
ERROR:etiq_core.pipeline.DebiasPipeline0469:list indices must be integers or slices, not str
Traceback (most recent call last):
  File "/home/hjow/Development/projects/etiq-1.2.3-release/lib/python3.8/site-packages/etiq_core/pipeline.py", line 177, in run
  File "/home/hjow/Development/projects/etiq-1.2.3-release/lib/python3.8/site-packages/etiq_core/pipelines/steps.py", line 1624, in execute
TypeError: li

In [18]:
# check issues

leakage_pipeline1.get_issues_summary()

[(0, 'monthly_salary', 1.0, 'target_leakage_issue', 36176, 'no_issue')]

The pipeline surfaces the issue we initially put into the data. In this example of course we knew what the issue was to begin with, but in a real-life example if you had a feature leaked from the target you wouldn't know without checking. 