# Binary Classification Fairness Assessment Template - Example

## About
This example is intended as a simple illustration of the [Binary Classification Fairness Assessment Template](../templates/Template-BinaryClassificationAssessment.ipynb). It compares a Random Forest Classifier against fairness-aware alternative versions of that same classifier. For more information about the specific measures used, please see the [Measuring Fairness in Binary Classification Tutorial](../tutorials_and_examples/Tutorial-MeasuringFairnessInBinaryClassification.ipynb).

For simplicity, only two fairness-aware algorithms are compared in this notebook. However, several other fairness-aware models were tested during development. For a peek at that process, see [Supplemental - Models for Binary Classification Example](../tutorials_and_examples/Supplemental-ModelsForBinaryClassificationExample.ipynb).

## Contents

[Part 0](#part0) - Data Loading and Baseline Model Setup

[Part 1](#part1) - Measuring a Single (Baseline) Model

[Part 2](#part2) - Fairness-Aware Models

[Part 3](#part3) - Compare Several Models



## Requirements

To run this notebook, please install FairMLHealth using [the instructions posted in GitHub](https://github.com/KenSciResearch/fairMLHealth#installation_instructions). Some components of this notebook additionally require the [Fairlearn](https://github.com/fairlearn/fairlearn) package.

The tutorial uses data from the MIMIC III Critical Care database, a freely accessible source of electronic health records from Beth Israel Deaconess Medical Center in Boston. To download the MIMIC III data, please use this link: [Access to MIMIC III](https://mimic.physionet.org/gettingstarted/access/) and save the data with the default directory name ("MIMIC"). No further action is required beyond remembering the download location, and you do not need to unzip any files.

A basic knowledge of ML implementation in Python is assumed. 


In [1]:
from IPython.display import Markdown, HTML
import numpy as np
import pandas as pd

from fairmlhealth import model_comparison as fhmc, analyze
from fairmlhealth.mimic_data import load_mimic3_example

from fairmlhealth.__utils import validate_notebook_requirements
validate_notebook_requirements()

----
----
# Load (or Generate) Data and Models <a name="part0"></a>

Here you should load (or generate) your test dataset and models.

## MIMIC-III

This example uses a simple data subset from the [MIMIC-III clinical database](https://mimic.physionet.org/gettingstarted/access/) to predict the length of ICU stay (LOS) for a set of encounters. MIMIC-III is a freely available database, however all users must pass a quick human subjects certification course. For the example, LOS is the total intensive care unit (ICU) time for a given hospital admission in patients 65 and above. The raw LOS value is then converted to a binary value specifying whether an admission's length of stay is greater than the sample mean. 

Note that the code below will automatically unzip and format all necessary data for these experiments from a raw download of MIMIC-III data (saving the formatted data in the same MIMIC folder). If you would like to run this example on your own, [follow these steps to be granted access to MIMIC III](https://mimic.physionet.org/gettingstarted/access/) and download the data.


## Data Subset

Data are imported at the encounter level with all additional patient identification dropped. Boolean diagnosis and procedure features are categorized through the Clinical Classifications Software system ([HCUP](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp)). All features other than age are one-hot encoded and prefixed with their variable type (e.g. "GENDER_", "ETHNICITY_").  


In [2]:
# path_to_mimic_data_folder = "[path to folder containing your MIMIC-III zip files]"
path_to_mimic_data_folder = "~/data/MIMIC"

In [3]:
# Load data and keep a 10K observation subset to speed processing
df = load_mimic3_example(path_to_mimic_data_folder) 
df = df.sample(n=10000, random_state=42)

# Subset to ages 65+
df = df.loc[df['AGE'].ge(65), :]
df.drop('GENDER_F', axis=1, inplace=True) # Redundant with GENDER_M


# Generate a binary target flagging whether an observation's length_of_stay value is above or below the mean. 
mean_val = df['length_of_stay'].mean()
df['long_los'] = df['length_of_stay'].apply(lambda x: 1 if x > mean_val else 0)

## Split Data

In [4]:
from sklearn.model_selection import train_test_split

# Subset and Split Data
X = df.loc[:, [c for c in df.columns 
                if c not in ['ADMIT_ID', 'length_of_stay', 'long_los']]]
y = df.loc[:, ['long_los']]
splits = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
X_train, X_test, y_train, y_test=splits

## Test Baseline

A Scikit-Learn Random Forest Classifier serves as our basis for comparison. Parameters were tuned using Scikit-Learn's GridSearch in the [Supplemental - Models for Binary Classification Example](../tutorials_and_examples/Supplemental-ModelsForBinaryClassificationExample.ipynb).

In [5]:
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

In [6]:
# Set model parameters (currently set as default values, but defined here to be explicit)
rf_params = {'n_estimators': 1800, 'min_samples_split': 5, 'bootstrap': False}

# Train Model
rf_model = RandomForestClassifier(**rf_params)
rf_model.fit(X_train, y_train.iloc[:, 0])
y_pred_rf = rf_model.predict(X_test)

# display performance 
print("\n", "Random Forest Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_rf, target_names=['LOS <= mean', 'LOS > mean']))


 Random Forest Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.76      0.89      0.82       982
  LOS > mean       0.76      0.55      0.64       614

    accuracy                           0.76      1596
   macro avg       0.77      0.72      0.73      1596
weighted avg       0.76      0.76      0.75      1596



----
----
# Evaluate a Single (Baseline) Model <a name="part1"></a>

### Required Variables  

- X_test = test data to be passed to the models to generate predictions.
- y_test = target data array corresponding to X. 
- X_test['LANGUAGE_ENGL'] = protected attributes data corresponding to X, optionally also included in X. 
- rf_model = the trained model to be evaluated. 


In [7]:
# Generate comparison table (returned as a pandas dataframe)
meas = fhmc.measure_model(X_test, y_test, X_test['LANGUAGE_ENGL'], rf_model)
analyze.flag(meas)

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,Statistical Parity Difference,0.0054
Group Fairness,Disparate Impact Ratio,1.0195
Group Fairness,Equalized Odds Difference,0.0146
Group Fairness,Equalized Odds Ratio,0.9424
Group Fairness,Positive Predictive Parity Difference,0.0226
Group Fairness,Balanced Accuracy Difference,0.0106
Group Fairness,Balanced Accuracy Ratio,1.0149
Group Fairness,AUC Difference,0.0115
Individual Fairness,Consistency Score,0.8038
Individual Fairness,Between-Group Gen. Entropy Error,0.0


#### Stratified Data Report

FairMLHealth includes stratified table features to aid in identifying the source of unfairness or other bias. The data analysis tableer evaluates basic statistics specific to each feature-value, in addition to relative statistics for the target value. Since the reporter can evaluate many features at once, it can be a useful option for identifying patterns of bias either alone or in concert with other (e.g., visual methods).

In [8]:
analyze.data(X_test['LANGUAGE_ENGL'], y_test)

Unnamed: 0,Feature Name,Feature Value,Obs.,Missing Values,Feature Entropy,Target Max,Target Mean,Target Median,Target Min,Target Std. Dev.,Value Prevalence
0,ALL FEATURES,ALL VALUES,1596.0,0,,1.0,0.3847,0.0,0.0,0.4867,1.0
1,LANGUAGE_ENGL,0,776.0,0,0.9995,1.0,0.3892,0.0,0.0,0.4879,0.4862
2,LANGUAGE_ENGL,1,820.0,0,0.9995,1.0,0.3805,0.0,0.0,0.4858,0.5138


#### Stratified Performance Report

The stratified performance reporter evaluates model performance specific to each feature-value subset. If prediction probabilities (via the *predict_proba()* method) are available to the model, additional ROC_AUC and PR_AUC values will be included.

In [9]:
analyze.performance(X_test['LANGUAGE_ENGL'], y_test, y_pred_rf)

Unnamed: 0,Feature Name,Feature Value,Obs.,Target Mean,Pred. Mean,Accuracy,FPR,Precision,TPR
0,ALL FEATURES,ALL VALUES,1596.0,0.3847,0.2794,0.7581,0.111,0.7556,0.5489
1,LANGUAGE_ENGL,0,776.0,0.3892,0.2822,0.7616,0.1076,0.7671,0.5563
2,LANGUAGE_ENGL,1,820.0,0.3805,0.2768,0.7549,0.1142,0.7445,0.5417


#### Stratified Fairness Report

The stratified bias reporter evaluates model bias specific to each feature-value subset. It assumes each feature-value as the "privileged" group relative to all other possible values for the feature. To simplify the report, fairness measures have been simplified to their component parts. For example, measures of Equalized Odds can be determined by combining the True Positive Rate (TPR) Ratios & Differences with False Positive Rate (FPR) Ratios & Differences.



In [10]:
analyze.bias(X_test['LANGUAGE_ENGL'], y_test, y_pred_rf)

Unnamed: 0,Feature Name,Feature Value,Obs.,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,TPR Diff,TPR Ratio
0,LANGUAGE_ENGL,0,776.0,0.0066,1.0611,-0.0226,0.9705,-0.0146,0.9737
1,LANGUAGE_ENGL,1,820.0,-0.0066,0.9424,0.0226,1.0304,0.0146,1.027


----
# Generate Fairness-Aware Models <a name="part2"></a>


## Fairlearn Models

The [Fairlearn](https://fairlearn.github.io/) package includes three [mitigation algorithms](https://fairlearn.github.io/user_guide/mitigation.html) designed to increase the fairness of an existing model relative to one of two user-specified fairness metrics. Both algorithms and metrics are listed in the cell below.

For more information about the specifics of these fairness metrics, see [Part 5 of the Measuring Fairness in Binary Classification Tutorial](../tutorials_and_examples/Tutorial-MeasuringFairnessInBinaryClassification.ipynb#part5).

In [11]:
# Mitigation Algorithms
from fairlearn.reductions import GridSearch, ExponentiatedGradient

# Fairness Measures
from fairlearn.reductions import EqualizedOdds, DemographicParity 

### Fair ExponentiatedGradient

Fairlearn's ExponentiatedGradient is a wrapper that runs a constrained optimization using the Exponentiated Gradient approach on a binary classification model. It treats the prediction as a sequence of cost-sensitive classification problems, returning the solution with the smallest error (constrained by the metric of choice). This approach has been demonstrated to have minimal effect on model performance by some measures. [[Agarwal2018]](#Agarwal2018)

This approach is applicable to sensitive attributes that are either categorical or binary/Boolean. It can be used for classification problems only.

Note: solutions are not guaranteed for this approach.


In [12]:
# Set seed for consistent results with Fairlearn's ExponentiatedGradient
np.random.seed(36)  

#### Fair ExponentiatedGradient Using Demographic Parity as Constraint

In [13]:
eg_rfDP_model = ExponentiatedGradient(RandomForestClassifier(**rf_params), 
                                      constraints=DemographicParity()) 
eg_rfDP_model.fit(X_train, y_train,
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_eg_rfDP = eg_rfDP_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_eg_rfDP, 
       target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.76      0.89      0.82       982
  LOS > mean       0.76      0.54      0.63       614

    accuracy                           0.76      1596
   macro avg       0.76      0.72      0.73      1596
weighted avg       0.76      0.76      0.75      1596



#### Fair ExponentiatedGradient Using Equalized Odds as Constraint

In [14]:
eg_rfEO_model = ExponentiatedGradient(RandomForestClassifier(**rf_params), 
                                      constraints=EqualizedOdds())  
eg_rfEO_model.fit(X_train, y_train, 
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_eg_rfEO = eg_rfEO_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_eg_rfEO, 
       target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.76      0.89      0.82       982
  LOS > mean       0.76      0.55      0.64       614

    accuracy                           0.76      1596
   macro avg       0.76      0.72      0.73      1596
weighted avg       0.76      0.76      0.75      1596



### Fair GridSearch

Fairlearn's GridSearch is a wrapper that runs a constrained optimization using the Grid Search approach  on a binary classification or a regression model. It treats the prediction as a sequence of cost-sensitive classification problems, returning the solution with the smallest error (constrained by the metric of choice). This approach has been demonstrated to have minimal effect on model performance by some measures [[Agarwal2018]](#Agarwal2018).

This approach is applicable to sensitive attributes that are binary/Boolean only. It can be used for either binary classification or regression problems.


#### Fair GridSearch Using Equalized Odds as Constraint

In [15]:
# Train GridSearch
gs_rfEO_model = GridSearch(RandomForestClassifier(**rf_params),
                           constraints=EqualizedOdds(),
                           grid_size=45)

gs_rfEO_model.fit(X_train, y_train, 
                  sensitive_features = X_train['LANGUAGE_ENGL'])
y_pred_gs_rfEO = gs_rfEO_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_gs_rfEO, 
       target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.76      0.89      0.82       982
  LOS > mean       0.76      0.55      0.64       614

    accuracy                           0.76      1596
   macro avg       0.76      0.72      0.73      1596
weighted avg       0.76      0.76      0.75      1596



#### Fair GridSearch Using Demographic Parity as Constraint

In [16]:
# Train GridSearch
gs_rfDP_model = GridSearch(RandomForestClassifier(**rf_params),
                           constraints=DemographicParity(),
                           grid_size=45)

gs_rfDP_model.fit(X_train, y_train, 
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_gs_rfDP = gs_rfDP_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_gs_rfDP, 
       target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.74      0.88      0.80       982
  LOS > mean       0.72      0.49      0.59       614

    accuracy                           0.73      1596
   macro avg       0.73      0.69      0.69      1596
weighted avg       0.73      0.73      0.72      1596



----
----
# Compare Several Models <a name="part3"></a>


## Set the Required Variables  

* X_data (NumPy array or similar pandas object): Test data to be passed to the models to generate predictions (or list/dict of data for each model if model inputs differ). It's recommended that these be separate data from those used to train the model.

* y_data (NumPy array or similar pandas object): Target data array corresponding to X (or list/dict of labels for each model if labels differ). It is recommended that the target is not present in the test data.

* pa_data (NumPy array or similar pandas object): Protected attributes corresponding to X, optionally also included in X (or list/dict of data for each model if attributes differ). Note that values must currently be binary- or Boolean-type.

* models (list or dict-like): The set of trained models to be evaluated (list or dict). Note that the dictionary keys are assumed as model names. If a list-like object is passed, the function will set model names relative to their index (i.e. "model_0", "model_1", etc.)


In [17]:
X = X_test
y = y_test
protected_attr = X_test['LANGUAGE_ENGL']
models = {'rf_model': rf_model,
         'gs_rfEO_model': gs_rfEO_model, 'gs_rfDP_model': gs_rfDP_model,
         'eg_rfEO_model': eg_rfEO_model, 'eg_rfDP_model': eg_rfDP_model}
display("Models being compared in this example:", list(models.keys()))


'Models being compared in this example:'

['rf_model',
 'gs_rfEO_model',
 'gs_rfDP_model',
 'eg_rfEO_model',
 'eg_rfDP_model']

## Comparison with the FairMLHealth Tool

The FairMLHealth model comparison tool generates a table of fairness measures that can be used to quickly compare the fairness-performance tradeoff for a set of fairness-aware models. 

Note that there is some additional formatting added to the cell below simply to add highlighting for this example.

In [18]:
# Generate comparison table (returned as a pandas dataframe)
comparison = fhmc.compare_models(X, y, protected_attr, models)

# Here we determine the indices for equal odds measures so that we can highlight according
#    to those indices later
idx = pd.IndexSlice
eotag = idx[:, ['Equal Opportunity Difference', 'Equalized Odds Difference',
                 'Equalized Odds Ratio']
            ]
equal_odds = comparison.loc[eotag, :].index

# Here we return the flagged table as a pandas styler so we can also highlight 
#       measures of Equal Odds
flagged = analyze.flag(comparison, as_styler=True)
flagged.apply(lambda x: ['background-color:' + "#DED8F9" 
                          if x.name in equal_odds else '' for i in x]
                , axis=1)



Unnamed: 0_level_0,Unnamed: 1_level_0,rf_model,gs_rfEO_model,gs_rfDP_model,eg_rfEO_model,eg_rfDP_model
Metric,Measure,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Data Metrics,Prevalence of Privileged Class (%),38.0,38.0,38.0,38.0,38.0
Group Fairness,AUC Difference,0.0115,0.0102,0.0362,,
Group Fairness,Balanced Accuracy Difference,0.0106,0.0022,-0.0426,0.008,-0.0488
Group Fairness,Balanced Accuracy Ratio,1.0149,1.003,0.94,1.0112,0.9335
Group Fairness,Disparate Impact Ratio,1.0195,1.0099,0.4786,1.0191,0.8001
Group Fairness,Equalized Odds Difference,0.0146,-0.0028,-0.2381,0.0113,-0.123
Group Fairness,Equalized Odds Ratio,0.9424,0.9743,0.1989,0.9569,0.7859
Group Fairness,Positive Predictive Parity Difference,0.0226,0.0119,0.1995,0.0185,0.0078
Group Fairness,Statistical Parity Difference,0.0054,0.0027,-0.1825,0.0052,-0.0592
Individual Fairness,Between-Group Gen. Entropy Error,0.0,0.0,0.0059,0.0,0.0007


# References

<a name="Agarwal2018"></a>
Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. In International Conference on Machine Learning (pp. 60-69). PMLR. Available through [arXiv preprint:1803.02453](https://arxiv.org/pdf/1803.02453.pdf).