# Tutorial of Fairness Metrics for Healthcare

This tutorial introduces methods and libraries for measuring fairness and bias in machine learning models as as they relate to problems in healthcare. After providing some background, it will generate a simple baseline model predicting Length of Stay (LOS) using data from the [MIMIC-III database](https://mimic.physionet.org/gettingstarted/access/). It will then use variations of that model to demonstrate common measures of "fairness" using [AIF360](http://aif360.mybluemix.net/), a prominent library for this purpose, before comparing AIF360 to another prominent library, [FairLearn](https://fairlearn.github.io/).


### Tutorial Contents
[Part 0:] Background

[Part 1:](#part1) Model Setup

[Part 2:](#part2) Metrics of Fairness in AIF360

[Part 3:](#part3) Comparing Against a Second Model - Evaluating Unawarenes

[Part 4:](#part4) Testing Other Sensitive Attributes

[Part 5:](#part5) Comparison to FairLearn

### Requirements
This tutorial assumes basic knowledge of machine learning implementation in Python. Before starting, please install [AIF360](http://aif360.mybluemix.net/) and [FairLearn](https://fairlearn.github.io/). Also, ensure that you have installed the Pandas, Numpy, Scikit, and XGBOOST libraries.

## Part 0: Background 
SECTIONS TO BE INCLUDED:
* what is fairness
* metrics for fairness
* list of measures that will be included in this notebook

## Part 1: Model Setup <a class="anchor" id="part1"></a>

This section introduces and loads the data subset that will be used in this tutorial. Then it generates a simple baseline model to be used throughout the tutorial.

In [28]:
from IPython.display import Image
import numpy as np
import os
import pandas as pd
import sys

# Jupyter Add-Ons from local folder
import tutorial_helpers

# Prediction Libs
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, XGBRegressor

# Metrics
from aif360.sklearn.metrics import *
from fairlearn.metrics import (
    selection_rate, demographic_parity_difference, demographic_parity_ratio,
    balanced_accuracy_score_group_summary, roc_auc_score_group_summary,
    equalized_odds_difference, difference_from_summary)
from sklearn.metrics import balanced_accuracy_score, roc_auc_score


### MIMIC III
This tutorial uses data from the MIMIC III Critical Care database, a freely accessible source of Electronic Health Records from Beth Israel Deaconess Medical Center in Boston, years 2001 through 2012. To download the MIMIC III data, please use this link: [Access to MIMIC III](https://mimic.physionet.org/gettingstarted/access/). Please save the data in a folder with the default name ("MIMIC").

The raw MIMIC download contains only a folder of zipped_files. The tutorial code will automatically unzip and format the necessary data for this experiment, saving the formatted data in the current folder. Simply enter the correct path of the MIMIC folder in the following cell to enable this feature. Your path should end with the directory "MIMIC".

Example: path_to_mimic_data_folder = "~/data/MIMIC"

In [5]:
# path_to_mimic_data_folder = "[path to your downloaded data folder]"
path_to_mimic_data_folder = "~/data/MIMIC"

### Data Subset
The following models use data from all years of the MIMIC-III dataset for patients aged 65 and older. Features include diagnosis and procedure codes categorized through the Clinical Classifications Software system ([HCUP](#hcup)). 

Data are imported at the encounter level, with patient identification dropped. All features are one-hot encoded and prefixed with their variable type (e.g. "GENDER_", "ETHNICITY_"). 

In [6]:
df = tutorial_helpers.load_example_data(path_to_mimic_data_folder) # note: ADMIT_ID has been masked
df.head()

Unnamed: 0,ADMIT_ID,AGE,length_of_stay,GENDER_M,ETHNICITY_AMERICAN INDIAN/ALASKA NATIVE,ETHNICITY_AMERICAN INDIAN/ALASKA NATIVE FEDERALLY RECOGNIZED TRIBE,ETHNICITY_ASIAN,ETHNICITY_ASIAN - ASIAN INDIAN,ETHNICITY_ASIAN - CAMBODIAN,ETHNICITY_ASIAN - CHINESE,...,PROCEDURE_CCS_221,PROCEDURE_CCS_222,PROCEDURE_CCS_223,PROCEDURE_CCS_224,PROCEDURE_CCS_225,PROCEDURE_CCS_226,PROCEDURE_CCS_227,PROCEDURE_CCS_228,PROCEDURE_CCS_229,PROCEDURE_CCS_231
0,1019779,65.0,1.144444,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1006687,70.0,5.496528,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,978785,75.0,6.768056,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1052125,70.0,6.988889,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,1017033,75.0,5.364583,1,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0


### Baseline Length of Stay Model
All models in this tutorial predict the length of time spent in the ICU, a.k.a. the "Length of Stay" (LOS). The baseline model will use only the patient's age, their diagnosis, and the use of medical procedures during their stay to predict this value.

Two target variables will be used for the following experiments: 'length_of_stay' and 'los_binary'. For this dataset, length_of_stay is, of course, the true value of the length of the patient's stay in days. The los_binary variable is a binary variable indicating whether the admission resulted in a length of stay either < or >= the mean.

In [7]:
mean_val=df['length_of_stay'].mean()
df['los_binary'] = df['length_of_stay'].apply(lambda x: 0 if x<=mean_val else 1)
df[['length_of_stay', 'los_binary']].describe().round(4)

Unnamed: 0,length_of_stay,los_binary
count,22434.0,22434.0
mean,9.1152,0.388
std,6.2087,0.4873
min,0.0042,0.0
25%,4.7352,0.0
50%,7.5799,0.0
75%,12.0177,1.0
max,29.9889,1.0


In [8]:
# Subset and split data for the first model
X = df.loc[:,['ADMIT_ID']+[c for c in df.columns if (c.startswith('AGE') or c.startswith('DIAGNOSIS_') or c.startswith('PROCEDURE_'))]]
y = df.loc[:, ['los_binary']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# generate alternative model
baseline_model = XGBClassifier()
baseline_model.fit(X_train, y_train)
baseline_y_pred = baseline_model.predict(X_test)

#
print('\n', "Baseline ROC_AUC Score:", roc_auc_score(y_test, baseline_y_pred) )


 Baseline ROC_AUC Score: 0.7236032721546329


## Part 2: Testing Gender as a Sensitive Attribute <a class="anchor" id="part2"></a>
Our first experiment will test the effect of including the sensitive attribute 'GENDER_M'. This attribute is encoded in our data as a boolean attribute, where 0=female and 1=male, since males are assumed to be the privileged group. For the purposes of this experiment all other senstitive attributes and potential proxies will be dropped, such that only gender, diangosis, and procedure codes will be used to make the prediction.

First we will examine fairness measurements for a version of this model that includes gender as a feature, before comparing them to similar measurements for the baseline (without gender). We will see that while some measures can be used to analyze a model in isolation, others require comparison against other models.

In [9]:
df.groupby('GENDER_M')['length_of_stay'].describe().round(4)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
GENDER_M,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,9987.0,9.1373,6.1984,0.0042,4.7507,7.6722,12.0667,29.9889
1,12447.0,9.0975,6.2171,0.0049,4.7149,7.4597,11.9729,29.966


In [10]:
# Generate a model that includes gender as a feature
X_train_gender = X_train.join(df[['GENDER_M']], how='inner')
X_test_gender = X_test.join(df[['GENDER_M']], how='inner')

model = XGBClassifier()
model.fit(X_train_gender, y_train)
y_pred_gender = model.predict(X_test_gender)

#
print('\n', "ROC_AUC Score with Gender Included:", roc_auc_score(y_test, y_pred_gender) )


 ROC_AUC Score with Gender Included: 0.7236032721546329


### Measuring Fairness via AIF360

AIF360 requires the sensitive attribute to be in the same dataframe (or 2-D array) as the target variable (both the ground truth and the prediction), so we add that here.


In [11]:
y_test_aif = pd.concat([X_test_gender['GENDER_M'], y_test], axis=1).set_index('GENDER_M')
y_pred_aif = pd.concat([X_test_gender['GENDER_M'].reset_index(drop=True), pd.Series(y_pred_gender)], axis=1).set_index('GENDER_M')
y_pred_aif.columns = y_test_aif.columns

### Prediction Rates
The base rate is the average value of the ground truth (optionally weighted). It provides useful context, although it is not technically a measure of fairness. 
> $base\_rate = \sum_{i=0}^N(y_i)/N$

The Selection Rate is the average value of the predicted (ŷ).
> $selection\_rate = \sum_{i=0}^N(ŷ_i)/N$

In [12]:
model_scores =  pd.DataFrame(columns=('measure','value'))
print("base_rate:", round(base_rate(y_test_aif, y_pred_aif), 4), "\n")
model_scores.loc[0] = ['selection_rate', selection_rate(y_test_aif, y_pred_aif)]
print(model_scores)

base_rate: 0.3801 

          measure     value
0  selection_rate  0.269314


### Measures of Demographic Parity

The Disparate Impact Ratio is the ratio between the probability of positive prediction for the unprivileged group and the probability of positive prediction for the privileged group. A ratio of 1 indicates that the model is fair (it favors neither group).
> $disparate\_impact\_ratio = P(ŷ =1 | unprivileged) / P(ŷ =1 | privileged)$

Statistical Parity Difference is the difference between the selection rate of the privileged group and that of the unprivileged group. A difference of 0 indicates that the model is fair (it favors neither group).
> $statistical\_parity\_difference = selection\_rate(ŷ_{unprivileged}) - selection\_rate(ŷ_{privileged}) $

In [13]:
model_scores.loc[1] = ['disparate_impact_ratio', disparate_impact_ratio(y_test_aif, y_pred_aif, prot_attr='GENDER_M')]
model_scores.loc[2] = ['statistical_parity_difference', statistical_parity_difference(y_test_aif, y_pred_aif, prot_attr='GENDER_M')]
model_scores.tail(2)

Unnamed: 0,measure,value
1,disparate_impact_ratio,1.099128
2,statistical_parity_difference,0.025555


### Measures of Equal Odds
Average Odds Difference measures the average of the difference in FPR and TPR for the unprivileged and privileged groups.
> $ average\_odds\_difference = \dfrac{(FPR_{unprivileged} - FPR_{privileged})
        + (TPR_{unprivileged} - TPR_{privileged})}{2}$

Average Odds Error is the average of the absolute difference in FPR and TPR for the unprivileged and privileged groups.
> $average\_odds\_error = \dfrac{|FPR_{unprivileged} - FPR_{privileged}|
        + |TPR_{unprivileged} - TPR_{privileged}|}{2}$
        
Equal Opportunity Difference is the difference in recall scores (TPR) between the unprivileged and privileged groups. A difference of 0 indicates that the model is fair.
> $equal\_opportunity\_difference =  Recall(ŷ_{unprivileged}) - Recall(ŷ_{privileged})$


In [14]:
model_scores.loc[3] = ['average_odds_difference', average_odds_difference(y_test_aif, y_pred_aif, prot_attr='GENDER_M')]
model_scores.loc[4] = ['average_odds_error', average_odds_error(y_test_aif, y_pred_aif, prot_attr='GENDER_M')]
model_scores.loc[5] = ['equal_opportunity_difference', equal_opportunity_difference(y_test_aif, y_pred_aif, prot_attr='GENDER_M')]
model_scores.tail(3)


Unnamed: 0,measure,value
3,average_odds_difference,0.013367
4,average_odds_error,0.015131
5,equal_opportunity_difference,0.028498


### Measures Of Individual Fairness
Consistency scores measure the similarity between a given prediction and the predictions of "like" individuals. In AIF360, the consistency score is calculated as the compliment of the mean distance to the score of the mean nearest neighbhor, using Scikit's Nearest Neighbors algorithm (default 5 neighbors determined by BallTree algorithm).
> $ consistency\_score = 1 - |mean_{distance}(mean({nearest\ neighbor}) )| $

#### The Generalized Entropy Index and Related Measures
The Generalized Entropy (GE) Index is...
> $ GE =  \mathcal{E}(\alpha) = \begin{cases}
            \frac{1}{n \alpha (\alpha-1)}\sum_{i=1}^n\left[\left(\frac{b_i}{\mu}\right)^\alpha - 1\right],& \alpha \ne 0, 1,\\
            \frac{1}{n}\sum_{i=1}^n\frac{b_{i}}{\mu}\ln\frac{b_{i}}{\mu},& \alpha=1,\\
            -\frac{1}{n}\sum_{i=1}^n\ln\frac{b_{i}}{\mu},& \alpha=0.
        \end{cases}
        $

Generalized Entropy Error = Calculates the GE of the set of errors, i.e. 1 + (ŷ == pos_label) - (y == pos_label) 
> $ GE(Error) = b_i = \hat{y}_i - y_i + 1 $

Between Group Generalized Entropy Error = Calculates the GE of the set of mean errors for the two groups (privileged error & unprivileged error), weighted by the number of predictions in each group
> $ GE(Error_{group}) =  GE( [N_{unprivileged}*mean(Error_{unprivileged}), N_{privileged}*mean(Error_{privileged})] ) $

In [15]:
model_scores.loc[6] = ['consistency_score', consistency_score(X_test_gender, y_pred_gender)]
model_scores.loc[7] = ['generalized_entropy_error', generalized_entropy_error(y_test['los_binary'], y_pred_gender)]
model_scores.loc[8] = ['between_group_generalized_entropy_error', 
                            between_group_generalized_entropy_error(y_test_aif, y_pred_aif, prot_attr=['GENDER_M'])]
model_scores.tail(3) 

Unnamed: 0,measure,value
6,consistency_score,0.682955
7,generalized_entropy_error,0.140157
8,between_group_generalized_entropy_error,1.5e-05


## Part 3: Comparing Against a Second Model - Evaluating Unawareness <a class="anchor" id="part3"></a>

To demonstrate the change in model scores relative to the use of a sensitive attribute, this section generates a new, though similar model with the sensitive attribute removed. As shown below, for this sensitive attribute, there is no observed difference in scores with the exclusion of the sensitive attribute.

Note: Since we have already discussed the individual measures, a helper function will be used to save space.

In [26]:
new_scores = tutorial_helpers.get_aif360_measures_df(X_test_gender, y_test, baseline_y_pred, sensitive_attributes=['GENDER_M'])

comparison = model_scores.rename(columns={'value':'gender_score'}
                                ).merge(new_scores.rename(columns={'value':'gender_score (feature removed)'}))
comparison.round(4)

base_rate: 0.3801 



Unnamed: 0,measure,gender_score,gender_score (feature removed)
0,selection_rate,0.2693,0.2693
1,disparate_impact_ratio,1.0991,1.0991
2,statistical_parity_difference,0.0256,0.0256
3,average_odds_difference,0.0134,0.0134
4,average_odds_error,0.0151,0.0151
5,equal_opportunity_difference,0.0285,0.0285
6,consistency_score,0.683,0.683
7,generalized_entropy_error,0.1402,0.312
8,between_group_generalized_entropy_error,0.0,0.0



> to do: add peformance functions like roc_auc_score_group_summary from AIF360 to process above

## Part 4: Testing Other Sensitive Attributes

Our next experiment will test the presence of bias relative to a patient\'s language, assuming that there is a bias toward individuals who speak English. As above, we will add a boolean 'LANGUAGE_ENGL' feature to the baseline data.

In [18]:
# Here we attach the sensitive attribute to our data
lang_cols = [c for c in df.columns if c.startswith("LANGUAGE_")]
eng_cols = ['LANGUAGE_ENGL']
X_lang =  df.loc[:,lang_cols]
X_lang['LANG_ENGL'] = 0
X_lang.loc[X_lang[eng_cols].eq(1).any(axis=1), 'LANG_ENGL'] = 1
X_lang = X_lang.drop(lang_cols, axis=1).fillna(0)
X_lang.join(df['length_of_stay']).groupby('LANG_ENGL')['length_of_stay'].describe().round(4)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
LANG_ENGL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,10266.0,9.5874,6.4989,0.0042,4.9115,7.9167,12.8132,29.9792
1,12168.0,8.7169,5.9239,0.0049,4.4819,7.2174,11.6069,29.9889


In [38]:
# Here we train the model
X_lang_train = X_train.join(X_lang, how='inner')
X_lang_test = X_test.join(X_lang, how='inner')
lang_model = XGBClassifier()
lang_model.fit(X_lang_train, y_train)
y_pred_lang = lang_model.predict(X_lang_test)
print('\n', "ROC_AUC Score with Gender Included:", roc_auc_score(y_test, y_pred_lang) )


 ROC_AUC Score with Gender Included: 0.7227318124596439


Again, by comparing the results with and without the sensitivie attribute we can better demonstrate the effect that the attribute has on the fairness of the model. In this example we see

In [50]:
print("Measure values with feature included:")
lang_scores = tutorial_helpers.get_aif360_measures_df(X_lang_test, y_test, y_pred_lang, sensitive_attributes=['LANG_ENGL'])
print(lang_scores.round(4).head(3))
print("\n", "Measure values with feature removed:")
lang_ko_scores = tutorial_helpers.get_aif360_measures_df(X_lang_test, y_test, baseline_y_pred, sensitive_attributes=['LANG_ENGL']) 
print(lang_ko_scores.round(4).head(3))

Measure values with feature included:
base_rate: 0.3801 

                         measure   value
0                 selection_rate  0.2704
1         disparate_impact_ratio  1.2227
2  statistical_parity_difference  0.0546

 Measure values with feature removed:
base_rate: 0.3801 

                         measure   value
0                 selection_rate  0.2693
1         disparate_impact_ratio  1.0930
2  statistical_parity_difference  0.0240


### Comparing All Four Models Against Each Other
As shown below

In [22]:
full_comparison = comparison.merge(lang_scores.rename(columns={'value':'lang_score'})
                            ).merge(lang_ko_scores.rename(columns={'value':'lang_score (feature removed)'})
                            )
full_comparison.round(4)

Unnamed: 0,measure,gender_score,gender_score (feature removed),lang_score,lang_score (feature removed)
0,selection_rate,0.2693,0.2693,0.2704,0.2693
1,disparate_impact_ratio,1.0991,1.0991,1.2227,1.093
2,statistical_parity_difference,0.0256,0.0256,0.0546,0.024
3,average_odds_difference,0.0134,0.0134,0.0382,0.0047
4,average_odds_error,0.0151,0.0151,0.0382,0.0047
5,equal_opportunity_difference,0.0285,0.0285,0.0516,0.0061
6,consistency_score,0.683,0.683,0.6815,0.6828
7,generalized_entropy_error,0.1402,0.312,0.3089,0.312
8,between_group_generalized_entropy_error,0.0,0.0,0.0,0.0001


## Part 5: Comparison to FairLearn <a class="anchor" id="part5"></a>

Next, some of the same metrics will be demonstrated using Microsoft's FairLearn library. Although both APIs are similar and the measures built into FairLearn are not as comprehensive as those of AIF360, some users may find FairLearn's documentation style to be more accessible. A table comparing the measures available in each library is shown below. 

In [35]:
Image(url="img/library_measure_comparison.png", width=500)

In [23]:
print("Selection rate", 
      selection_rate(y_test, y_pred_lang) )
print("Demographic parity difference", 
      demographic_parity_difference(y_test, y_pred_lang, sensitive_features=X_lang_test['LANG_ENGL']))
print("Demographic parity ratio", 
      demographic_parity_ratio(y_test, y_pred_lang, sensitive_features=X_lang_test['LANG_ENGL']))

print("------")
y_prob_lang = lang_model.predict_proba(X_lang_test)[:, 1]
print("Overall AUC", roc_auc_score(y_test, y_prob_lang) )
print("AUC difference", roc_auc_score_group_summary(y_test, y_prob_lang, sensitive_features=X_lang_test['LANG_ENGL']))



Selection rate 0.27039438141545113
Demographic parity difference 0.054553219850345586
Demographic parity ratio 0.8178916255964723
------
Overall AUC 0.8277471574588929
AUC difference {'overall': 0.8277471574588929, 'by_group': {0: 0.8251191163857235, 1: 0.8284959998227068}}


### Balanced Error Rate Difference
Similar to the Equal Opportunity Difference measured by AIF360, the Balanced Error Rate Difference offered by FairLearn calculates the difference in accuracy score between the unprivileged and privileged group.

In [24]:

print("-----")
print("Balanced error rate difference",
        balanced_accuracy_score_group_summary(y_test, y_pred_lang, sensitive_features=X_lang_test['LANG_ENGL']))
print("Equalized odds difference",
      equalized_odds_difference(y_test, y_pred_lang, sensitive_features=X_lang_test['LANG_ENGL']))
      


-----
Balanced error rate difference {'overall': 0.7227318124596438, 'by_group': {0: 0.7289100106775893, 1: 0.7155632936639851}}
Equalized odds difference 0.05159445477325997


## Summary
This tutorial introduced multiple measures of ML fairness in the context of a healthcare model using the AIF360 and FairLearn Python libraries. A subset of the MIMIC-III database was used to generate a series of simple Length of Stay (LOS) models. It was shown that while the inclusion of a sensitive feature can significantly affect a model's bias as it relates to that feature, this is not always the case. 

# References 

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635

<a id="hcup"></a>
HCUP https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp

In [37]:
Image(url="library_algorithm_comparison.png", width=500)

# TEST CELLS BELOW (TO BE REMOVED BEFORE TUTORIAL)

## AIF360

### Effect of Caucasian Ethnicity

In [82]:
eth_cols = [c for c in df.columns if c.startswith("ETHNICITY_")]
cauc_cols = [c for c in df.columns if c.startswith("ETHNICITY_WHITE")]
cauc_cols

['ETHNICITY_WHITE',
 'ETHNICITY_WHITE - BRAZILIAN',
 'ETHNICITY_WHITE - EASTERN EUROPEAN',
 'ETHNICITY_WHITE - OTHER EUROPEAN',
 'ETHNICITY_WHITE - RUSSIAN']

In [83]:
X_eth =  df.loc[:,eth_cols]
X_eth['caucasian'] = 0
X_eth.loc[X_eth[cauc_cols].eq(1).any(axis=1), 'caucasian'] = 1
X_eth = X_eth.drop(eth_cols, axis=1).fillna(0)
X_eth['caucasian'].describe()

count    23485.000000
mean         0.741750
std          0.437681
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: caucasian, dtype: float64

In [84]:
X_eth.join(df['length_of_stay']).groupby('caucasian')['length_of_stay'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
caucasian,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,6065.0,10.811345,10.279624,-0.797222,4.834722,7.899306,13.63125,123.127778
1,17420.0,10.547957,9.946768,-0.76875,4.844965,7.831944,12.948264,191.422917


In [85]:
X_eth_train = X_train.join(X_eth, how='inner')
X_eth_test = X_test.join(X_eth, how='inner')
print(X_train.shape, X_eth_train.shape, X_test.shape, X_eth_test.shape)


(15734, 506) (15734, 507) (7751, 506) (7751, 507)


In [86]:
eth_model = XGBClassifier()
eth_model.fit(X_eth_train, y_train)
eth_y_pred = eth_model.predict(X_eth_test) 


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [87]:
(tutorial_helpers.get_aif360_measures_df(X_eth_test, y_test, y_pred, sensitive_attributes=['caucasian']) ).round(4)

base_rate: 0.3465 



Unnamed: 0,measure,value
0,selection_rate,0.2344
1,disparate_impact_ratio,1.0125
2,statistical_parity_difference,0.0029
3,average_odds_difference,-0.0006
4,average_odds_error,0.0007
5,equal_opportunity_difference,-0.0013
6,generalized_entropy_error,0.268
7,between_group_generalized_entropy_error,0.0
8,consistency_score,0.7122


In [88]:
eth_scores = tutorial_helpers.get_aif360_measures_df(X_eth_test, y_test, eth_y_pred, sensitive_attributes=['caucasian'])
eth_scores.round(4)

base_rate: 0.3465 



Unnamed: 0,measure,value
0,selection_rate,0.2344
1,disparate_impact_ratio,1.0125
2,statistical_parity_difference,0.0029
3,average_odds_difference,-0.0006
4,average_odds_error,0.0007
5,equal_opportunity_difference,-0.0013
6,generalized_entropy_error,0.268
7,between_group_generalized_entropy_error,0.0
8,consistency_score,0.7122
