## Import libraries and data

In [1]:
%matplotlib inline

from IPython.display import clear_output


import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd

import shap

pd.set_option('display.max_columns', None)

from sklearn.metrics import precision_score
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Bias in COMPAS

In this section, we will study if the COMPAS model is biased by comparing the output scores with the real rate of recividism. In other words, given two individuals with the same features except race, we will try to analyze if the model overpredicts a higher score for a given race. 

COMPAS works by evaluating a range of factors including age, sex, personality traits, measures of social isolation, prior criminal history, family criminality, geography, and employment status. Northpointe gets some of this information from criminal records, and the rest from a questionnaire that asks defendants to respond to queries like, “How many of your friends/acquaintances are taking drugs illegally?” and to agree or disagree with statements like, “A hungry person has a right to steal.”

COMPAS returns a score from 0 to 10 indicating the risk of recividism. In order to compare more easily, the decimal score will be transformed to a binary label indicating High risk (5-10) or Low risk (1-4)

In [2]:
url = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv'
df = pd.read_csv(url)
df['high_risk'] = (df['decile_score'] >= 5).astype(int)

In [3]:
df.head()

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_case_number,c_offense_date,c_arrest_date,c_days_from_compas,c_charge_degree,c_charge_desc,is_recid,r_case_number,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,violent_recid,is_violent_recid,vr_case_number,vr_charge_degree,vr_offense_date,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid,high_risk
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,0,1,0,0,0,-1.0,2013-08-13 06:03:42,2013-08-14 05:41:20,13011352CF10A,2013-08-13,,1.0,F,Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,Risk of Recidivism,1,Low,2013-08-14,Risk of Violence,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,0,3,0,0,0,-1.0,2013-01-26 03:45:27,2013-02-05 05:36:53,13001275CF10A,2013-01-26,,1.0,F,Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,2013-07-05,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),2013-07-05,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,2013-01-27,Risk of Violence,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1,0
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,0,4,0,1,4,-1.0,2013-04-13 04:58:34,2013-04-14 07:02:04,13005330CF10A,2013-04-13,,1.0,F,Possession of Cocaine,1,13011511MM10A,(M1),0.0,2013-06-16,Driving Under The Influence,2013-06-16,2013-06-16,,0,,,,,Risk of Recidivism,4,Low,2013-04-14,Risk of Violence,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1,0
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,0,8,1,0,1,,,,13000570CF10A,2013-01-12,,1.0,F,Possession of Cannabis,0,,,,,,,,,0,,,,,Risk of Recidivism,8,High,2013-01-13,Risk of Violence,6,Medium,2013-01-13,,,1,0,1174,0,0,1
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,0,1,0,0,2,,,,12014130CF10A,,2013-01-09,76.0,F,arrest case no charge,0,,,,,,,,,0,,,,,Risk of Recidivism,1,Low,2013-03-26,Risk of Violence,1,Low,2013-03-26,,,2,0,1102,0,0,0


## Experiments

As we don't have the input features needed to replicate the COMPAS model, we will train a classifier to predict the COMPAS score given the gender, race, age, priors_count, and crime factor. We will evaluate the model by using different fairness metrics, and study how different methods of data rebalancing can affect these metrics.


SMOTE/Undersample/Oversample -> Train -> Evaluate different metrics

### Metrics to evaluate:

Castelnovo, A., Crupi, R., Greco, G., & Regoli, D. (2021). The zoo of Fairness metrics in Machine Learning. arXiv preprint arXiv:2106.00467.


(INDEPENDENCE)

- **Demographic parity**: Positive prediction ratio between two races.
- **Demographic parity conditioned on priors?**

(SEPARATION)

- **Predictive equality** -> FPR
- **Equality of opportunity** -> FNR

(SUFFICIENCY)

- **Predictive parity** -> Precision

In [4]:
def eval_fairness(y_pred, y_true, black_mask, white_mask):
    y_pred_black = y_pred[black_mask]
    y_true_black = y_true[black_mask]
    y_pred_white = y_pred[white_mask]
    y_true_white = y_true[white_mask]
    # False Positive Rates FPR = FP / (FP + TN)
    fpr_black = np.sum((y_pred_black == 1) * (y_true_black == 0)) / np.sum(y_true_black == 0)
    fpr_white = np.sum((y_pred_white == 1) * (y_true_white == 0)) / np.sum(y_true_white == 0)
    # True positive rates TPR = TP / (TP + FN)
    tpr_black = np.sum((y_pred_black == 1)*(y_true_black == 1)) / np.sum(y_true_black == 1)
    tpr_white = np.sum((y_pred_white == 1)*(y_true_white == 1)) / np.sum(y_true_white == 1)
    # Precision
    precision_black = precision_score(y_true_black, y_pred_black)
    precision_white = precision_score(y_true_white, y_pred_white)

    data = {}
    data['TPR_w'] = tpr_white
    data['TPR_b'] = tpr_black
    data['FPR_w'] = fpr_white
    data['FPR_b'] = fpr_black
    data['Eq. Oportunity'] = abs(tpr_white-tpr_black)
    data['Pred. Equality'] = abs(fpr_white-fpr_black)
    data['Eq. odds'] = abs(tpr_white-tpr_black) + abs(fpr_white-fpr_black)
    data['Accuracy'] = np.mean(y_pred == y_true)
    
    return data 

### SMOTE/Oversampling/Undersampling

In [5]:
def eval_resampler(df, sampler=None, resample_test=False):

    # Prepare the data
    df_temp = df[(df['race'] == 'African-American') | (df['race'] == 'Caucasian')]
    cols = ['age', 'sex', 'race', 'priors_count', 'score_text']
    X, recid = df_temp[cols], df_temp['two_year_recid']
    X['score_text'] = [0 if y_i == 'Low' else 1 for y_i in X['score_text']]
    X = pd.get_dummies(X, drop_first=True)
    X_train, X_test, recid_train, recid_test = train_test_split(X, recid.values, test_size=0.2, random_state=42)

    ##############################
    # RESAMPLE THE TRAINING SET  #
    ##############################

    # Build target variable combining both the race and whether it has recivided or not
    #   - '00': Black, Non-recividist
    #   - '01': Black, Recividist
    #   - '10': White, Non-recividist
    #   - '11': White, Recividist
    if sampler:
        # get the race value
        y_race = X_train['race_Caucasian'].values
        # build the target variable
        y_sampler = np.array([str(a) + str(b) for a, b in zip(y_race, recid_train)])

        print("TRAINING SET:")
        print("Before Sampling: \n\tBlack, Non-recidivist: {}\n\tBlack, Recidivist: {}\
            \n\tWhite, Non-recidivist: {}\n\tWhite, Recidivist: {}".format(np.sum(y_sampler == '00'), \
            np.sum(y_sampler == '01'), np.sum(y_sampler == '10'), np.sum(y_sampler == '11')))

        # Sample the dataset according to the race and the recividism rates
        X_train, y_sampler = sampler.fit_resample(X_train, y_sampler)

        print("After Sampling: \n\tBlack, Non-recidivist: {}\n\tBlack, Recidivist: {}\
            \n\tWhite, Non-recidivist: {}\n\tWhite, Recidivist: {}".format(np.sum(y_sampler == '00'), \
            np.sum(y_sampler == '01'), np.sum(y_sampler == '10'), np.sum(y_sampler == '11')))

        # Undo the label, i.e. get the race and the real recividism rate
        race, recid_train = np.array([int(y_i[0]) for y_i in y_sampler]), np.array([int(y_i[1]) for y_i in y_sampler])
        X_train['race_Caucasian'] = race 
        
    X_train, y_train = X_train.drop(columns='score_text'), X_train['score_text']

    ####################################
    # RESAMPLE THE TEST SET (OPTIONAL) #
    ####################################

    if resample_test and sampler:
    # get the race value
        y_race = X_test['race_Caucasian'].values
        # build the target variable
        y_sampler = np.array([str(a) + str(b) for a, b in zip(y_race, recid_test)])

        print("TEST SET:")
        print("Before Sampling: \n\tBlack, Non-recidivist: {}\n\tBlack, Recidivist: {}\
            \n\tWhite, Non-recidivist: {}\n\tWhite, Recidivist: {}".format(np.sum(y_sampler == '00'), \
            np.sum(y_sampler == '01'), np.sum(y_sampler == '10'), np.sum(y_sampler == '11')))

        # Sample the dataset according to the race and the recividism rates
        X_test, y_sampler = sampler.fit_resample(X_test, y_sampler)

        print("After Sampling: \n\tBlack, Non-recidivist: {}\n\tBlack, Recidivist: {}\
            \n\tWhite, Non-recidivist: {}\n\tWhite, Recidivist: {}".format(np.sum(y_sampler == '00'), \
            np.sum(y_sampler == '01'), np.sum(y_sampler == '10'), np.sum(y_sampler == '11')))

        # Undo the label, i.e. get the race and the real recividism rate
        race, recid_test = np.array([int(y_i[0]) for y_i in y_sampler]), np.array([int(y_i[1]) for y_i in y_sampler])
        X_test['race_Caucasian'] = race 

    X_test, y_test = X_test.drop(columns='score_text'), X_test['score_text']

    # Train the model

    clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    clf.fit(X_train, y_train)

    # Predict
    y_pred = clf.predict(X_test)

    black_mask = X_test['race_Caucasian'] == 0
    white_mask = X_test['race_Caucasian'] == 1

    # Evaluate fairness metrics
    data = eval_fairness(y_pred, recid_test, black_mask, white_mask)
    return data

In [6]:
data = []
index= []

index.append("Original Training - Original Test")
data.append(eval_resampler(df))
index.append("SMOTE Training - Original Test")
data.append(eval_resampler(df, sampler=SMOTE(random_state=42)))
index.append("SMOTE Training - SMOTE Test")
data.append(eval_resampler(df, sampler=SMOTE(random_state=42), resample_test=True))
index.append("Oversampling Training - Original Test")
data.append(eval_resampler(df, sampler=RandomOverSampler(random_state=42)))
index.append("Oversampling Training - Oversampling Test")
data.append(eval_resampler(df, sampler=RandomOverSampler(random_state=42), resample_test=True))
index.append("Undersampling Training - Original Test")
data.append(eval_resampler(df, sampler=RandomUnderSampler(random_state=42)))
index.append("Undersampling Training - Undersampling Test")
data.append(eval_resampler(df, sampler=RandomUnderSampler(random_state=42), resample_test=True))


clear_output(wait=True)

pd.DataFrame(data, index=index)

Unnamed: 0,TPR_w,TPR_b,FPR_w,FPR_b,Eq. Oportunity,Pred. Equality,Eq. odds,Accuracy
Original Training - Original Test,0.35567,0.713542,0.171617,0.381089,0.357872,0.209472,0.567343,0.658537
SMOTE Training - Original Test,0.340206,0.716146,0.178218,0.383954,0.37594,0.205736,0.581676,0.654472
SMOTE Training - SMOTE Test,0.294271,0.716146,0.1875,0.390625,0.421875,0.203125,0.625,0.608073
Oversampling Training - Original Test,0.396907,0.700521,0.214521,0.369628,0.303614,0.155106,0.45872,0.653659
Oversampling Training - Oversampling Test,0.372396,0.700521,0.205729,0.380208,0.328125,0.174479,0.502604,0.621745
Undersampling Training - Original Test,0.371134,0.721354,0.188119,0.418338,0.35022,0.230219,0.580439,0.64878
Undersampling Training - Undersampling Test,0.371134,0.721649,0.226804,0.453608,0.350515,0.226804,0.57732,0.603093


### Training a different classifier for each race

In [7]:
df_temp = df[(df['race'] == 'African-American') | (df['race'] == 'Caucasian')]
cols = ['age', 'sex', 'race', 'priors_count', 'score_text']
X, recid = df_temp[cols], df_temp['two_year_recid']
X['score_text'] = [0 if y_i == 'Low' else 1 for y_i in X['score_text']]
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, recid_train, recid_test = train_test_split(X, recid.values, test_size=0.2, random_state=42)

# Train a classifier for each race
X_train_black, recid_train_black = X_train[X_train['race_Caucasian'] == 0], recid_train[X_train['race_Caucasian'] == 0]
X_train_white, recid_train_white = X_train[X_train['race_Caucasian'] == 1], recid_train[X_train['race_Caucasian'] == 1]
# Get score text in order to train
X_train_black, y_train_black = X_train_black.drop(columns='score_text'), X_train_black['score_text']
X_train_white, y_train_white = X_train_white.drop(columns='score_text'), X_train_white['score_text']

clf_black = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
clf_white = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Fit the models
clf_black.fit(X_train_black, y_train_black)
clf_white.fit(X_train_white, y_train_white)

# Make predictions
X_test_black, recid_test_black = X_test[X_test['race_Caucasian'] == 0], recid_test[X_test['race_Caucasian'] == 0]
X_test_white, recid_test_white = X_test[X_test['race_Caucasian'] == 1], recid_test[X_test['race_Caucasian'] == 1]
# Get score text in order to train
X_test_black, y_test_black = X_test_black.drop(columns='score_text'), X_test_black['score_text']
X_test_white, y_test_white = X_test_white.drop(columns='score_text'), X_test_white['score_text']

y_pred_black = clf_black.predict(X_test_black)
y_pred_white = clf_white.predict(X_test_white)
y_pred = np.concatenate((y_pred_black, y_pred_white))
recid_test = np.concatenate((recid_test_black, recid_test_white))
black_mask = np.array([True]*len(y_pred_black) + [False]*len(y_pred_white))
white_mask = np.array([False]*len(y_pred_black) + [True]*len(y_pred_white))

index.append("Split by race")
data.append(eval_fairness(y_pred, recid_test, black_mask, white_mask))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### Removing race attribute

In [8]:
# Train without the race variable
df_temp = df[(df['race'] == 'African-American') | (df['race'] == 'Caucasian')]
cols = ['age', 'sex', 'race', 'priors_count', 'score_text']
X, recid = df_temp[cols], df_temp['two_year_recid']
X['score_text'] = [0 if y_i == 'Low' else 1 for y_i in X['score_text']]
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, recid_train, recid_test = train_test_split(X, recid.values, test_size=0.2, random_state=42)

# drop the race
X_train, y_train = X_train.drop(columns=['race_Caucasian', 'score_text']), X_train['score_text']
# Train the model without race
clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test.drop(columns=['race_Caucasian', 'score_text']))
black_mask = X_test['race_Caucasian'] == 0
white_mask = X_test['race_Caucasian'] == 1 

index.append("Remove race attribute")
data.append(eval_fairness(y_pred, recid_test, black_mask, white_mask))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [9]:
pd.set_option('precision', 3)
display(pd.DataFrame(data, index=index))
pd.reset_option('precision')

Unnamed: 0,TPR_w,TPR_b,FPR_w,FPR_b,Eq. Oportunity,Pred. Equality,Eq. odds,Accuracy
Original Training - Original Test,0.356,0.714,0.172,0.381,0.358,0.209,0.567,0.659
SMOTE Training - Original Test,0.34,0.716,0.178,0.384,0.376,0.206,0.582,0.654
SMOTE Training - SMOTE Test,0.294,0.716,0.188,0.391,0.422,0.203,0.625,0.608
Oversampling Training - Original Test,0.397,0.701,0.215,0.37,0.304,0.155,0.459,0.654
Oversampling Training - Oversampling Test,0.372,0.701,0.206,0.38,0.328,0.174,0.503,0.622
Undersampling Training - Original Test,0.371,0.721,0.188,0.418,0.35,0.23,0.58,0.649
Undersampling Training - Undersampling Test,0.371,0.722,0.227,0.454,0.351,0.227,0.577,0.603
Split by race,0.371,0.703,0.198,0.372,0.332,0.174,0.506,0.654
Remove race attribute,0.407,0.674,0.172,0.347,0.267,0.175,0.442,0.664


## Fairness metrics

It is extremely difficult to concretely define the concept of fairness. Because of this, there are many metrics associated with this concept. We can divide this metrics into two classes, whether they measure **individual** (similar individuals should have similar outcomes) or **group** fairness (the model shouldn't discriminate against certain groups). We will focus on the latter metrics. There are three broad notions of group fairness:

- **Independence**: Decisions should be independent of any sensitive attribute.
- **Separation**: Disparities on groups should be completely justified by the target variable, assuming that you can trust that the target variable is correctly labeled i.e. is not biased.
- **Sufficiency**: Similar to separation, but taking into account the predicted variable instead of the target.

### Independence

Decisions should be independent from any protected attributes (demographic parity):

$$P(\hat{Y} = 1 \mid A = a) = P(\hat{Y} = 1 \mid A = b), \quad \forall a, b \in \mathcal{A}$$

There are situations where independence doesn't directly relate to fairness. For example, women may have a bigger admission rate into a degree than men because they might have better grades in general. In order to achieve independence, or demographic parity, the model **should favor the group** with lower admission rates, i.e. **treating groups differently**, and this might not be fair. For this reason, we have to be very careful with enforcing demographic parity, and it has to be justified, for example, we can't trust that the target variable is correctly labelled, or there is historical bias in the data.

Another criteria for independence named **Conditional demographic parity** could be more suitable in most situations. In the last example, we could say that we have achieved conditional demographic parity if the predicted variable (admission) is independent of the protected attribute (sex) for individuals with the same grades.

### Separation

Disparities between groups should be completely justified by the target variable:

$$P(\hat{Y}=1 \mid A = a, Y=y) = P(\hat{Y} = 1 \mid A = b, Y=y),\quad \forall a,b\in \mathcal{A}, y\in{0, 1}$$

This can be a good fairness metric **if the target variable is free of any bias**.





There are two relaxed versions of separation:

- **Predictive equality**: Same false positive rates across groups. $FPR = \dfrac{FP}{FP+TN}$
- **Equality of opportunity**: Same false negative rates. $FNR = \dfrac{FN}{FN + TP}$

Depending on the problem, one may prioritize one or the other.

- Predictive equality may be prioritized when you want to minimize the risk of innocent people being arrested (False positive) between groups.
- Equality of opportunity may be prioritized when accepting people into a degree, i.e. both groups should have equal opportunities.

In conclusion, separation might be a good fairness metric if the data can be trusted, specifically the target variables.

### Sufficiency

When the model has the same precission across sensitive groups, it has achieved predictive parity:

$$P(Y=1 \mid A=a, \hat{Y} = 1) = P(Y=1 \mid A=b, \hat{Y} = 1),\quad \forall a,b\in \mathcal{A}$$

If we require the last condition to also hold true for $Y=0$, then the sufficiency criterion is satisfied.

The main difference between sufficiency and separation is in the point of view: while in separation, the observations are grouped according to the target variable, in sufficiency they are grouped according to the predicted value.

## Incompatibilities between metrics

- If the target variable is binary and there is a group imbalance, then **independence** and **separation** are incompatible.
- If there is a group imbalance, **sufficiency** and **independence** are also incompatible.
- Separation and sufficiency can both hold either when there is an imbalance in sensitive groups.

Chouldechova (2017) showed that, if the true recidivism rate is different for black and white people, then Predictive Parity and Equality of Odds cannot both hold, thus implying that a reflection on which of the two (in general of the many) notions is more important to be pursued in that specific case must be carefully considered.