# Binary Classification Fairness Assesment Template - Example

This example is a simple illustration for the use of the [Binary Classification Fairness Template](../binary_classification_fairness_template.ipynb). It compares a RandomForestClassifier against fairness-aware alternative versions. For more information about specific measures, please see the [KDD Tutorial Notebook](kdd_fairness_in_healthcare_tutorial.ipynb). 

## Example Contents

[Part 1](#part1) - Data Loading and Model Setup

[Part 2](#part2) - Fairness-Aware Models

[Part 3](#part3) - Model Comparison

In [1]:
from IPython.display import Markdown
from fairMLHealth.utils import model_comparison, helpers
import numpy as np
import pandas as pd

np.random.seed(0)  # set seed for consistent results with ExponentiatedGradient

----
# Load Data and Generate Baseline Model <a name="part1"></a>

This example uses a data subset from the [MIMIC-III clinical database](https://mimic.physionet.org/gettingstarted/access/) to predict "length of stay" (LOS) value. For this example, LOS is total ICU time for a given hospital admission in patients 65 and above. The raw LOS value is then converted to a binary value specifying whether an admission's length of stay is greater than the sample mean. A baseline model is then generated using the Scikit-Learn RandomForestClassifier.

Note that this set of example models was generated through a larger test of model fitness and was chosen due to its pronounced differences in scores across the fairness-aware versions that you will see shortly. For more examples of standard vs. fairness-aware classifiers for this toy problem, see [models_for_binary_classification_example.ipynb](models_for_binary_classification_example.ipynb).

Note also that the code below will automatically unzip and format all necessary data for these experiments from a raw download of MIMIC-III data (saving the formatted data in the same MIMIC folder). MIMIC-III is a freely available database, however all users must pass a quick human subjects certification course. If you would like to run this example on your own, [follow these steps to be granted access to MIMIC III](https://mimic.physionet.org/gettingstarted/access/) and download the data.

## Load Data

Example models in this notebook use data from all years of the MIMIC-III dataset for patients aged 65 and older. Data are imported at the encounter level with all additional patient identification dropped. All models include an "AGE" feature, simplified to 5-year bins, as well as boolean diagnosis and procedure features categorized through the Clinical Classifications Software system ([HCUP](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp)). All features other than age are one-hot encoded and prefixed with their variable type (e.g. "GENDER_", "ETHNICITY_").  

In [2]:
# path_to_mimic_data_folder = "[path to folder containing your MIMIC-III zip files]"
path_to_mimic_data_folder = "~/data/MIMIC"

In [3]:
# Load data and subset to ages 65+
df = helpers.load_mimic3_example(path_to_mimic_data_folder) 
df.drop('GENDER_F', axis=1, inplace=True)
df = df.loc[df['AGE'].ge(65),:]
helpers.print_feature_table(df)
display(Markdown('---'))

# Generate a binary target flagging whether an observation's length_of_stay value is above or below the mean. 
mean_val=df['length_of_stay'].mean()
df['long_los'] = df['length_of_stay'].apply(lambda x: 1 if x > mean_val else 0)
los_tbl = df[['length_of_stay', 'long_los']].describe().transpose().round(4)
display(los_tbl.style.applymap(helpers.highlight_col, subset=pd.IndexSlice[:, 'mean']))


 This data subset has 22434 total observations and 648 input features 



Unnamed: 0,Raw Feature,Category Count (Encoded Features)
0,AGE,1
1,DIAGNOSIS,282
2,ETHNICITY,41
3,GENDER,1
4,INSURANCE,5
5,LANGUAGE,69
6,MARRIED,7
7,PROCEDURE,222
8,RELIGION,20


---

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
length_of_stay,22434.0,9.1152,6.2087,0.0042,4.7352,7.5799,12.0177,29.9889
long_los,22434.0,0.388,0.4873,0.0,0.0,0.0,1.0,1.0


## Split Data and Generate Baseline

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

In [5]:
# Subset and Split Data
X = df.loc[:,[c for c in df.columns if c not in ['ADMIT_ID','length_of_stay', 'long_los']]]
y = df.loc[:, ['long_los']]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)


In [6]:
# Set model parameters (currently set as default values, but defined here to be explicit)
rf_params = {'n_estimators': 1800, 'min_samples_split': 5, 'bootstrap': False}

# Train Model
rf_model = RandomForestClassifier(**rf_params)
rf_model.fit(X_train, y_train.iloc[:,0])
y_pred_rf = rf_model.predict(X_test)

# display performance 
print("\n", "Random Forest Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_rf, target_names=['LOS <= mean', 'LOS > mean']) )


 Random Forest Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.77      0.89      0.83      4531
  LOS > mean       0.78      0.58      0.66      2873

    accuracy                           0.77      7404
   macro avg       0.77      0.74      0.74      7404
weighted avg       0.77      0.77      0.76      7404



----
# Fairness-Aware Models <a name="part2"></a>


## FairLearn Models

The [FairLearn](https://fairlearn.github.io/) package includes three [mitigation algorithms](https://fairlearn.github.io/user_guide/mitigation.html) designed to increase the fairness of an existing model relative to one of two user-specified fairness metrics. Both algorithms and metrics are listed in the cell below.

For more information about the specifics of these fairness metrics, see also [Part 5 of the KDD Tutorial Notebook](kdd_fairness_in_healthcare_tutorial.ipynb#part5)

In [7]:
# Mitigation Algorithms
from fairlearn.reductions import GridSearch, ExponentiatedGradient

# Fairness Measures
from fairlearn.reductions import EqualizedOdds, DemographicParity 

### Fair ExponentiatedGradient

ExponentiatedGradient is a wrapper that runs a constrained optimization on a binary classification model using the Exponentiated Gradient approach according to the fairness metric of choice. GridSearch treats the prediction as a sequence of cost-sensitive classification problems. It then returns the solution with the smallest error (constrained by the metric of choice). This approach has been demonstrated to have minimal effect on model performance by some measures. [Agarwal2018](#Agarwal2018)

Applicable to categorical sensitive attributes.

Using Equalized Odds as constraint

In [8]:
eg_rfEO_model = ExponentiatedGradient(RandomForestClassifier(**rf_params), 
                                    constraints=EqualizedOdds())  #NOTE: this may alter the model; TODO: test to determine if this is true
eg_rfEO_model.fit(X_train, y_train, sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_eg_rfEO = eg_rfEO_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_eg_rfEO, target_names=['LOS <= mean', 'LOS > mean']) )



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.77      0.89      0.83      4531
  LOS > mean       0.78      0.58      0.66      2873

    accuracy                           0.77      7404
   macro avg       0.77      0.74      0.74      7404
weighted avg       0.77      0.77      0.76      7404



Using Demographic Parity

In [None]:
eg_rfDP_model = ExponentiatedGradient(RandomForestClassifier(**rf_params), 
                                    constraints=DemographicParity())  #NOTE: this may alter the model; TODO: test to determine if this is true
eg_rfDP_model.fit(X_train, y_train, sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_eg_rfDP = eg_rfDP_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_eg_rfDP, target_names=['LOS <= mean', 'LOS > mean']) )


### Fair GridSearch

GridSearch is a wrapper that runs a constrained optimization on a binary classification or a regression model using the Grid Search approach according to the fairness metric of choice. GridSearch treats the prediction as a sequence of cost-sensitive classification problems. It then returns the solution with the smallest error (constrained by the metric of choice). This approach has been demonstrated to have minimal effect on model performance by some measures. [Agarwal2018](#Agarwal2018)

Applicable to binary sensitive attributes.

Using Equalized Odds as constraint

In [None]:
# Train GridSearch
gs_rfEO_model = GridSearch(RandomForestClassifier(**rf_params),
                           constraints=EqualizedOdds(),
                           grid_size=40)

gs_rfEO_model.fit(X_train, y_train, sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_gs_rfEO = gs_rfEO_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_gs_rfEO, target_names=['LOS <= mean', 'LOS > mean']) )


Using Demographic Parity as constraint

In [None]:
# Train GridSearch
gs_rfDP_model = GridSearch(RandomForestClassifier(**rf_params),
                           constraints=DemographicParity(),
                           grid_size=40)

gs_rfDP_model.fit(X_train, y_train, sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_gs_rfDP = gs_rfDP_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_gs_rfDP, target_names=['LOS <= mean', 'LOS > mean']) )


----
# Model Comparison <a name="part3"></a>


## Set the Required Variables  

* X (numpy array or similar pandas object): test data to be passed to the models to generate predictions. It's recommended that these be separate data from those used to train the model.

* y (numpy array or similar pandas object): target data array corresponding to X. It is recommended that the target is not present in the test_data.

* models (list or dict-like): the set of trained models to be evaluated. Note that the dictionary keys are assumed as model names. If a list-like object is passed, the function will set model names relative to their index (i.e. "model_0", "model_1", etc.)

* protected_attr (numpy array or similar pandas object): protected attributes correspoinding to X, optionally also included in X. Note that values must currently be binary- or boolean-type.


In [None]:
X = X_test
y = y_test
protected_attr = X_test['LANGUAGE_ENGL']
models ={'rf_model':rf_model, 'gs_rfEO_model':gs_rfEO_model, 'gs_rfDP_model':gs_rfDP_model, 'eg_rfEO_model':eg_rfEO_model, 'eg_rfDP_model':eg_rfDP_model}
print("Models being compared in this example:", list(models.keys()))

## Comparison with the FairMLHealth Tool

The FairMLHealth model_comparison tool generates a table of fairness measures that can be used to quickly compare the fairness-performance tradeoff for a set of fairness-aware models. 

In [None]:
comparison = model_comparison.compare_models(X, y, protected_attr, models)

# Highlight Groups
idx = pd.IndexSlice
equal_odds = comparison.loc[idx['** Group Fairness **',
                            ['Equal Opportunity Difference', 'Equalized Odds Difference', 'Equalized Odds Ratio']],:].index
focus_measures = comparison.loc[idx['** Group Fairness **',
                            ['Disparate Impact Ratio', 'Consistency Score']],:].index

# Note: An HTML wrapper is used here around the .render() method to enable color rendering in GitHub. These steps are not necessary to 
#    display highlights in a standard Jupyter notebook
table = model_comparison.highlight_suspicious_scores(comparison
            ).apply(lambda x: ['background-color:aquamarine' if x.name in equal_odds else '' for i in x], axis=1)
HTML(table.render())

## Comparison with the FairLearn Dashboard

FairLearn comes with its own model comparison dashboard to allow visual comparison between models.

In [None]:
from fairlearn.widget import FairlearnDashboard

# Note: for classification models, arrays must be passed as a list
FairlearnDashboard(sensitive_features=protected_attr.to_list(), 
                   sensitive_feature_names=['LANGUAGE_ENGL'],
                   y_true=y.iloc[:,0].to_list(),
                   y_pred={k:model.predict(X) for k,model in models.items()})

# References

<a name="Agarwal2018"></a>
Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. [rXiv preprint arXiv:1803.02453](https://arxiv.org/pdf/1803.02453.pdf).