# Comparing different fairness measures

#### This notebook is a companion to the blog post `Put in title and link`

In it we will compare two popular fairness measures, the Disparate Impact Ratio and the Equal Opportunity Difference, to a new measure of fairness called the Burden Index.

We will use  CognitiveScale's Cortex Certifiai toolkit to calculate the tDisparate Impact Ratio, the Equal Opportunity Difference, and the Burden Index (to install visit https://www.cognitivescale.com/certifai/ and sign up). 

We will train a logistic regression model on the Adult Census data for the task of predicting which individuals in the dataset will make more than $50,000 per year.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import numpy as np
import random
from sklearn.linear_model import LogisticRegression
from cat_encoder import CatEncoder
from sklearn.model_selection import GridSearchCV
from sklearn import svm

# import Certifai functions
from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiPredictorWrapper, CertifaiModel, CertifaiModelMetric,
                                      CertifaiDataset, CertifaiGroupingFeature, CertifaiDatasetSource,
                                      CertifaiPredictionTask, CertifaiTaskOutcomes, CertifaiOutcomeValue)
from certifai.scanner.report_utils import scores, construct_scores_dataframe

np.random.seed(0)

In [2]:
# Example will use a simple logistic classifier on the Adult Census dataset
all_data_file = "adult.data"

column_names = ['age', 'workclass', 'fnlwgt', 'education',
            'education-num', 'marital-status', 'occupation', 'relationship',
            'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
            'native-country', 'income-per-year']

label_column='income-per-year'

features_to_drop = ['fnlwgt','relationship','education-num'] # we'll drop these features, 
        # "fnlwgt" was assigned after the data was gathered, so we'll drop it
        # "relationship" is the relationship of the person who answered the questions to the person who the answers were about
        # "education-num" is somewhat redundant with "education" (you could use education instead)
 
df = pd.read_csv(all_data_file, header=None, names = column_names)

df = df.drop(features_to_drop, axis=1)
df[label_column] = df[label_column].str.contains(">50K").astype(int)

cat_columns = ['workclass', 'education',
 'marital-status', 'occupation',
 'native-country','race','sex']

num_columns = [f for f in df.columns if (f not in cat_columns) and (f != label_column)]

# Separate outcome
y = df[label_column]
X = df.drop(label_column, axis=1)
X.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,Bachelors,Never-married,Adm-clerical,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,White,Male,0,0,13,United-States
2,38,Private,HS-grad,Divorced,Handlers-cleaners,White,Male,0,0,40,United-States
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Black,Male,0,0,40,United-States
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Black,Female,0,0,40,Cuba


In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

encoder = CatEncoder(cat_columns, X) # this one-hot encodes the categorical features and standardizes the numerical features

def build_model(data, name, model_family, test=None):
    if test is None:
        test = data    
    if model_family == 'SVM':
        parameters = {'kernel':('linear', 'rbf', 'poly'), 'C':[0.1, .5, 1, 2, 4, 10], 'gamma':['auto']}
        m = svm.SVC()
    elif model_family == 'logistic':
        parameters = {'C': (0.5, 1.0, 2.0), 'solver': ['liblinear'], 'max_iter': [1000]}
        m = LogisticRegression(random_state=4)
    model = GridSearchCV(m, parameters, cv=3)
    model.fit(data[0], data[1])

    # Assess on the test data
    accuracy = model.score(test[0], test[1].values)
    print(f"Model '{name}' accuracy is {accuracy}")
    return model

logistic_model = build_model((encoder(X_train.values), y_train),
                        'Logistic classifier',
                        'logistic',
                        test=(encoder(X_test.values), y_test))

Model 'Logistic classifier' accuracy is 0.8565945033010901


### Next, we'll use the Cortex Certifai toolkit to calculate the Burden Index, Disparate Impact, and Equal Opportunity measures

Note: In Certifai, Disparate Impact is referred to Demographic Parity

In [6]:
# Wrap the model up for use by Certifai as a local model
model_proxy = CertifaiPredictorWrapper(logistic_model, encoder=encoder)

In [7]:
# First define the possible prediction outcomes
task = CertifaiPredictionTask(CertifaiTaskOutcomes.classification(
    [
        CertifaiOutcomeValue(1, name='earned more than 50k per year', favorable=True),
        CertifaiOutcomeValue(0, name='earned less than 50k per year')
    ]),
    prediction_description='Did person earn more than 50k per year')

scan = CertifaiScanBuilder.create('test_user_case',
                                  prediction_task=task)

# Add our local model
first_model = CertifaiModel('full',
                            local_predictor=model_proxy)
scan.add_model(first_model)

# Add the eval dataset
eval_dataset = CertifaiDataset('evaluation',
                               CertifaiDatasetSource.dataframe(df))
scan.add_dataset(eval_dataset)

# Setup an evaluation for fairness on the above dataset using the model
# We'll look at disparity in the features we hid from the model
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('sex'))
# scan.add_fairness_grouping_feature(CertifaiGroupingFeature('race')) # you can take a look at the burden for race too
scan.add_evaluation_type('fairness')
scan.evaluation_dataset_id = 'evaluation'

# We can calculate other measures of fairness as well
scan.add_fairness_metric('demographic parity')
scan.add_fairness_metric('equal opportunity')

# Because the dataset contains a ground truth outcome column which the model does not
# expect to receive as input we need to state that in the dataset schema (since it cannot
# be inferred from the CSV)
scan.dataset_schema.outcome_feature_name = label_column

# Run the scan.
# By default this will write the results into individual report files (one per model and evaluation
# type) in the 'reports' directory relative to the Jupyter root.  This may be disabled by specifying
# `write_reports=False` as below
# The result is a dictionary of dictionaries of reports.  The top level dict key is the evaluation type
# and the second level key is model id.
# Reports saved as JSON (which `write_reports=True` will do) may be visualized in the console app
result = scan.run(write_reports=False)


Starting scan with model_use_case_id: 'test_user_case' and scan_id: 'bf110bb6e020'
[--------------------] 2020-06-04 15:30:30.683216 - 0 of 1 reports (0.0% complete) - Running fairness evaluation for model: full
[####################] 2020-06-04 15:39:14.879823 - 1 of 1 reports (100.0% complete) - Completed all evaluations


## Let's take a look at the fairness metrics

In [8]:
score_df = construct_scores_dataframe(scores('fairness', result), include_confidence=False)
display(score_df)

Unnamed: 0,context,type,overall fairness,Feature (sex),Group details ( Female),Group details ( Male)
full (burden),full,burden,74.917046,74.917046,0.219938,0.131663
full (demographic parity),full,demographic parity,80.186787,80.186787,0.068924,0.253372
full (equal opportunity),full,equal opportunity,67.537411,67.537411,0.408696,0.609375


In this table, the `type` column indicates which fairness measure is listed in that row. 

Where `type == burden`, the `overall fairness` cell shows the Burden Index, and `Group details <<attribute value>>` is the average distance (by group within a protected attribute) between the original observations and their counterfactuals that lie in the favorable class (Here, if someone already receives the favorable outcome, then the distance between their original feature values and the counterfactual that lies in the favorable class is trivially 0. Operationally, this means we add a 1 to the denominator for each person in that group who received the favorable prediction).  

The `overall fairness` for `demographic parity` and `equal opportunity` is a normalized version of these measures, which is great for comparing numbers on a similar scale, but below we calculate the Disparate Impact Ratio and the Equal Opportunity Difference, which is what we refer to in the blog.

In [17]:
Burden_Index = score_df.loc[score_df.type == 'burden','overall fairness'].values[0]
DI_unprivileged = score_df.loc[score_df.type == 'demographic parity','Group details ( Female)'].values[0]
DI_privileged = score_df.loc[score_df.type == 'demographic parity','Group details ( Male)'].values[0]
DI_ratio = DI_unprivileged / DI_privileged
TPR_unprivileged = score_df.loc[score_df.type == 'equal opportunity','Group details ( Female)'].values[0]
TPR_privileged = score_df.loc[score_df.type == 'equal opportunity','Group details ( Male)'].values[0]
EO_difference = TPR_unprivileged - TPR_privileged

print('Disparate Impact Ratio: ', np.round(DI_ratio,4))

print("Privileged TPR:", np.round(TPR_privileged,4))
print("Unprivileged TPR:", np.round(TPR_unprivileged,4))
print('Equal Opportunity Difference (Unprivileged TPR - Privileged TPR)):', np.round(EO_difference,4))

print('Burden Index: ', np.round(Burden_Index,4))

Disparate Impact Ratio:  0.272
Privileged TPR: 0.6094
Unprivileged TPR: 0.4087
Equal Opportunity Difference (Unprivileged TPR - Privileged TPR)): -0.2007
Burden Index:  74.917


A Disparate Impact Ratio of less than .8 is generally considered to be unfair (but this is somewhat arbitrary), so this model would be considered unfair by that measure. However, Disparate Impact does not take into account the ground truth of the data, which can be problematic.

The Equal Opportunity Difference should be 0 to be considered completely fair. A negative value of the Equal Opportunity Difference means the model is correctly predicting the privileged class who received favored outcome more often than it is correctly predicting the unprivileged class. However, there isn't a hard threshold for when the difference is unfair. However a discrepancy of .2 between the true positive rates seems to indicate the model is not being fair to the women who are defined by the ground truth to receive the favorable outcome.

Here the Burden Index is 74.917. The Burden Index, which takes on values between 0 and 100, is a Gini-like index that measures the disparity between the `Group details` of the different groups. A low value means that the distance between the observations and their counterfactuals for at least one group is much higher than the other groups. This means it's much harder for them to change their features to gain the favorable prediction. Like the Equal Opportunity Difference, there isn't a hard threshold where models with a Burden Index below that value would be considered unfair,  but model developers can use it to compare models without needing to know the ground truth. This is useful in a production settings where ground truth does not exist, and unlike Disparate Impact, the Burden Index takes in to account some notion of "worthiness" given the model.

While none of the measures give us a definitive idea of whether or not the model is fair (because it depends on your definition of fairness), they do all say that the model is favoring the privileged group over the unprivileged group. Looking at this, I would say we as model developers should figure out how to do a better job at predicting which women will make more than $50k per year (and we as humans should figure out how to dismantle and reconfigure power structures so all groups have a fairer shot of achieving the outcomes they want).

So, which metric should we focus on? Typically there is no right answer: they all tell us something about our model's fairness. Disparate Impact operationalizes the idea of fairness that all groups should be equally likely to have a favorable prediction of the model, but on an individual level, it can be unfair because it does not mean that a "better" individual will get the favorable prediction from the model. Equal Opportunity addresses this issue by looking at the people who are "good enough" in the eyes of the world (defined as those people who had the positive outcome in the ground truth) and desiring that those people be equally likely to receive the favorable prediction from the model. However, Equal Opportunity has no concept of how hard it would be for certain groups to change their prediction from unfavorable to favorable, which is where the Burden Index comes in. A model could have an Equal Opportunity Difference of 0 and yet 1) it could be nearly impossible for one group to change their features enough to gain the favorable prediction, and/or 2) very few people in one group in the ground truth have the favorable outcome but the model could be predicting on this small group correctly, which doesn't seem great either. The Burden Index will illuminate both of these points and give a more comprehensive view of the fairness of the model.