# Models -  Binary Classification Fairness Assesment Template - Example

## About
This example is intended as a simple illustration for the use of the [Binary Classification Fairness Assessment Template](../templates/Template-BinaryClassificationAssessment.ipynb). It compares a Random Forest Classifier against fairness-aware alternative versions of that same classifier. For more information about the specific measures used, please see the [Measuring Fairness in Binary Classification Tutorial](../tutorials_and_examples/Tutorial-MeasuringFairnessInBinaryClassification.ipynb).

In the interest of simplicity, only two fairness-aware algorithms are compared in this notebook. However, several other fairness-aware models were tested in during development. For a peek at that process, see [Supplemental - Models for Binary Classification Example](../tutorials_and_examples/Supplemental-ModelsForBinaryClassificationExample.ipynb).

## Example Contents

[Part 1](#part1) - Data Loading and Model Setup

[Part 2](#part2) - Fairness-Aware Models

[Part 3](#part3) - Model Comparison


In [1]:
from IPython.display import Markdown
from fairmlhealth import reports, tutorial_helpers as helpers, model_comparison as fhmc
import numpy as np
import pandas as pd

# Pointers to make this example more colorful
ks_magenta = '#d00095'
ks_magenta_lt = '#ff05b8'
ks_purple = '#947fed'

----
# Load Data and Generate Baseline Model <a name="part1"></a>

## MIMIC-III

This example uses a data subset from the [MIMIC-III clinical database](https://mimic.physionet.org/gettingstarted/access/) to predict "length of stay" (LOS) value. For this example, LOS is total ICU time for a given hospital admission in patients 65 and above. The raw LOS value is then converted to a binary value specifying whether an admission's length of stay is greater than the sample mean. 

Note that the code below will automatically unzip and format all necessary data for these experiments from a raw download of MIMIC-III data (saving the formatted data in the same MIMIC folder). MIMIC-III is a freely available database, however all users must pass a quick human subjects certification course. If you would like to run this example on your own, [follow these steps to be granted access to MIMIC III](https://mimic.physionet.org/gettingstarted/access/) and download the data.



## Data Subset

Data are imported at the encounter level with all additional patient identification dropped. Boolean diagnosis and procedure features are categorized through the Clinical Classifications Software system ([HCUP](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp)). All features other than age are one-hot encoded and prefixed with their variable type (e.g. "GENDER_", "ETHNICITY_").  


In [2]:
# path_to_mimic_data_folder = "[path to folder containing your MIMIC-III zip files]"
path_to_mimic_data_folder = "~/data/MIMIC"

In [3]:
# Load data and subset to ages 65+
df = helpers.load_mimic3_example(path_to_mimic_data_folder) 
df = df.loc[df['AGE'].ge(65), :]
df.drop('GENDER_F', axis=1, inplace=True) # Redundant with GENDER_M

# Show variable count
helpers.print_feature_table(df)
display(Markdown('---'))

# Generate a binary target flagging whether an observation's length_of_stay value is above or below the mean. 
mean_val = df['length_of_stay'].mean()
df['long_los'] = df['length_of_stay'].apply(lambda x: 1 if x > mean_val else 0)
los_tbl = df[['length_of_stay', 'long_los']].describe().transpose().round(4)
tbl_style = los_tbl.style.applymap(helpers.highlight_col, 
                                    subset=pd.IndexSlice[:, 'mean'],
                                    color=ks_magenta_lt
                                  )
display(tbl_style)


 This data subset has 22434 total observations and 648 input features 



Unnamed: 0,Raw Feature,Category Count (Encoded Features)
0,AGE,1
1,DIAGNOSIS,282
2,ETHNICITY,41
3,GENDER,1
4,INSURANCE,5
5,LANGUAGE,69
6,MARRIED,7
7,PROCEDURE,222
8,RELIGION,20


---

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
length_of_stay,22434.0,9.1152,6.2087,0.0042,4.7352,7.5799,12.0177,29.9889
long_los,22434.0,0.388,0.4873,0.0,0.0,0.0,1.0,1.0


## Split Data

In [4]:
from sklearn.model_selection import train_test_split

# Subset and Split Data
X = df.loc[:, [c for c in df.columns 
                if c not in ['ADMIT_ID','length_of_stay', 'long_los']]]
y = df.loc[:, ['long_los']]
splits = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
X_train, X_test, y_train, y_test = splits

## Generate Baseline

A Scikit-Learn Random Forest Classifier serves as our basis for comparison. Parameters were tuned using Scikit-Learn's GridSearch in the [Supplemental - Models for Binary Classification Example](../tutorials_and_examples/Supplemental-ModelsForBinaryClassificationExample.ipynb).

In [5]:
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

In [6]:
# Set model parameters (currently set as default values, but defined here to be explicit)
rf_params = {'n_estimators': 1800, 'min_samples_split': 5, 'bootstrap': False}

# Train Model
rf_model = RandomForestClassifier(**rf_params)
rf_model.fit(X_train, y_train.iloc[:, 0])
y_pred_rf = rf_model.predict(X_test)

# display performance 
print("\n", "Random Forest Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_rf, 
                            target_names=['LOS <= mean', 'LOS > mean']) )


 Random Forest Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.77      0.90      0.83      4531
  LOS > mean       0.78      0.58      0.66      2873

    accuracy                           0.77      7404
   macro avg       0.77      0.74      0.75      7404
weighted avg       0.77      0.77      0.76      7404



----
# Fairness-Aware Models <a name="part2"></a>


## FairLearn Models

The [FairLearn](https://fairlearn.github.io/) package includes three [mitigation algorithms](https://fairlearn.github.io/user_guide/mitigation.html) designed to increase the fairness of an existing model relative to one of two user-specified fairness metrics. Both algorithms and metrics are listed in the cell below.

For more information about the specifics of these fairness metrics, see also [Part 5 of the Measuring Fairness in Binary Classification Tutorial](../tutorials_and_examples/Tutorial-MeasuringFairnessInBinaryClassification.ipynb#part5).

In [7]:
# Mitigation Algorithms
from fairlearn.reductions import GridSearch, ExponentiatedGradient

# Fairness Measures
from fairlearn.reductions import EqualizedOdds, DemographicParity 

### Fair ExponentiatedGradient

FairLearn's ExponentiatedGradient is a wrapper that runs a constrained optimization using the Exponentiated Gradient approach on a binary classification model. It treats the prediction as a sequence of cost-sensitive classification problems, returning the solution with the smallest error (constrained by the metric of choice). This approach has been demonstrated to have minimal effect on model performance by some measures. [Agarwal2018](#Agarwal2018)

This approach is applicable to sensitive attributes that are either categorical or binary/boolean. It can be used for classification problems only.

Note: solutions are not guaranteed for this approach.


In [8]:
# Set seed for consistent results with FairLearn's ExponentiatedGradient
np.random.seed(36)  

#### Fair ExponentiatedGradient Using Demographic Parity as Constraint

In [9]:
eg_rfDP_model = ExponentiatedGradient(RandomForestClassifier(**rf_params), 
                                      constraints=DemographicParity()) 
eg_rfDP_model.fit(X_train, y_train,
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_eg_rfDP = eg_rfDP_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_eg_rfDP, 
                            target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.74      0.89      0.81      4531
  LOS > mean       0.74      0.52      0.61      2873

    accuracy                           0.74      7404
   macro avg       0.74      0.70      0.71      7404
weighted avg       0.74      0.74      0.73      7404



#### Fair ExponentiatedGradient Using Equalized Odds as Constraint

In [10]:
eg_rfEO_model = ExponentiatedGradient(RandomForestClassifier(**rf_params), 
                                      constraints=EqualizedOdds())  
eg_rfEO_model.fit(X_train, y_train, 
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_eg_rfEO = eg_rfEO_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_eg_rfEO, 
                            target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.77      0.89      0.83      4531
  LOS > mean       0.77      0.58      0.66      2873

    accuracy                           0.77      7404
   macro avg       0.77      0.74      0.74      7404
weighted avg       0.77      0.77      0.76      7404



### Fair GridSearch

FairLearn's GridSearch is a wrapper that runs a constrained optimization using the Grid Search approach  on a binary classification or a regression model. It treats the prediction as a sequence of cost-sensitive classification problems, returning the solution with the smallest error (constrained by the metric of choice). This approach has been demonstrated to have minimal effect on model performance by some measures. [[Agarwal2018]](#Agarwal2018)

This approach is applicable to sensitive attributes that are binary/boolean only. It can be used for either binary classification or regression problems.


#### Fair GridSearch Using Equalized Odds as Constraint

In [11]:
# Train GridSearch
gs_rfEO_model = GridSearch(RandomForestClassifier(**rf_params),
                           constraints=EqualizedOdds(),
                           grid_size=45)

gs_rfEO_model.fit(X_train, y_train, 
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_gs_rfEO = gs_rfEO_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_gs_rfEO, 
                            target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.77      0.89      0.83      4531
  LOS > mean       0.77      0.58      0.66      2873

    accuracy                           0.77      7404
   macro avg       0.77      0.74      0.74      7404
weighted avg       0.77      0.77      0.76      7404



#### Fair GridSearch Using Demographic Parity as Constraint

In [12]:
# Train GridSearch
gs_rfDP_model = GridSearch(RandomForestClassifier(**rf_params),
                           constraints=DemographicParity(),
                           grid_size=45)

gs_rfDP_model.fit(X_train, y_train, 
                  sensitive_features=X_train['LANGUAGE_ENGL'])
y_pred_gs_rfDP = gs_rfDP_model.predict(X_test)

# display performance 
print("\n", "Prediction Scores:", "\n", 
      classification_report(y_test, y_pred_gs_rfDP, 
                            target_names=['LOS <= mean', 'LOS > mean']))



 Prediction Scores: 
               precision    recall  f1-score   support

 LOS <= mean       0.75      0.89      0.81      4531
  LOS > mean       0.74      0.53      0.62      2873

    accuracy                           0.75      7404
   macro avg       0.75      0.71      0.71      7404
weighted avg       0.75      0.75      0.73      7404



----
# Model Comparison <a name="part3"></a>


## Set the Required Variables  

* X (numpy array or similar pandas object): test data to be passed to the models to generate predictions. It's recommended that these be separate data from those used to train the model.

* y (numpy array or similar pandas object): target data array corresponding to X. It is recommended that the target is not present in the test_data.

* models (list or dict-like): the set of trained models to be evaluated. Note that the dictionary keys are assumed as model names. If a list-like object is passed, the function will set model names relative to their index (i.e. "model_0", "model_1", etc.)

* protected_attr (numpy array or similar pandas object): protected attributes correspoinding to X, optionally also included in X. Note that values must currently be binary- or boolean-type.


In [16]:
X = X_test
y = y_test
protected_attr = X_test['LANGUAGE_ENGL']
models = {'rf_model': rf_model,
         'gs_rfEO_model': gs_rfEO_model, 'gs_rfDP_model': gs_rfDP_model,
         'eg_rfEO_model': eg_rfEO_model, 'eg_rfDP_model': eg_rfDP_model}
print("Models being compared in this example:", list(models.keys()))


Models being compared in this example: ['rf_model', 'gs_rfEO_model', 'gs_rfDP_model', 'eg_rfEO_model', 'eg_rfDP_model']


## Comparison with the FairMLHealth Tool

The FairMLHealth model comparison tool generates a table of fairness measures that can be used to quickly compare the fairness-performance tradeoff for a set of fairness-aware models. 

Note that there is some additional formatting added to the cell below simply to add highlighting for this example

In [14]:
import os, joblib

output_file = os.path.expanduser("~/data/fairMLHealth/fairml_binaryExampleModels")

model_file = output_file + "_model.joblib"
data_file = output_file + "_data.joblib"


data = {'X':X, 'y':y, 'protected_attr':protected_attr}
joblib.dump(data, data_file, compress=3)



['/Users/christineallen/data/fairMLHealth/fairml_binaryExampleModels_data.joblib']

In [None]:
import logging
logging.basicConfig()


class ValidationError(Exception):
    pass


if os.path.exists(model_file):
    raise ValidationError(f"File already exists: {model_file}.")
else:
    print("saving models...")
with open(model_file, 'wb') as file:
    for name, model in models.items():
        setattr(model, 'fmlh_model_name', name)
        print("\t", name)
        try:
            joblib.dump(model, file, compress=3)
        except BaseException as e:
            #logger.error("Cannot save models. " + str(e)) 
            print(f"Cannot save model {name}.", str(e))
            pass


saving models...
	 rf_model
	 gs_rfEO_model
	 gs_rfDP_model
	 eg_rfEO_model
Cannot save model eg_rfEO_model. Can't pickle local object '_Lagrangian.best_h.<locals>.h'
	 eg_rfDP_model


In [None]:
with open(model_file, "rb") as f:
    test = joblib.load(f)

In [None]:
import logging


load_models = {}
with open(model_file, "rb") as f:
    while True:
        try:
            new_model = joblib.load(f)
            load_models[new_model.model_name] = new_model
        except EOFError:
            break
        break

# References

<a name="Agarwal2018"></a>
Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. [rXiv preprint arXiv:1803.02453](https://arxiv.org/pdf/1803.02453.pdf).