----
# Tutorial for Evaluating Fairness in Binary Classification

----
## Evaluating Your Own Model? Looking for a Template? 
**Please note: This notebook is intended as an introduction to and reference for evaluating the fairness of a set of machine learning models using FairMLHealth, and it is not designed to work with outside models.**  

We recommend using our [Template-BinaryClassificationAssessment](../templates/Template-BinaryClassificationAssessment.ipynb), which also has a useful example: [ Example-BinaryClassificationTemplate]( Example-BinaryClassificationTemplate.ipynb) if you are looking for a template into which you can insert existing models. 


## Overview

All machine learning models can be assumed to hold biases, just as all humans hold biases and as all humans fall ill at some point in their lives. The motivation that drives us to study and prevent the harm caused by human illness drives us to prevent the harm caused by innate biases. That means building models that provide fair representation for all demographics. This starts with measurement and evaluation.

This notebook introduces concepts, methods, and libraries for measuring fairness in machine learning (ML) models as it relates to problems in healthcare. This is a revamped version of the tutorial presented at the [KDD 2020 Tutorial on Fairness in Machine Learning for Healthcare](../docs/publications/KDD2020-FairnessInHealthcareML-Slides.pptx), the notebook for which can be found here: [/docs/publications/KDD2020-FairnessInHealthcareML-TutorialNotebook.ipynb](../docs/publications/KDD2020-FairnessInHealthcareML-TutorialNotebook.ipynb).

[Part 1](#part1) will frame the problem and introduce our hypothetical example predicting Length of Stay (LOS) based on a subset of the [MIMIC-III clinical database](https://mimic.physionet.org/gettingstarted/access/). In [Part 2](#part2) this baseline model is used as an example to evaluate and discuss common metrics like *Disparate Impact*, *Equalized Odds*, and *Consistency Scores*. Then, [Part 4] compares the results of the baseline model to results for other modeling approaches: one "unaware" version, a "fairness-aware" version, and a third version simply using a different standard machine learning algorithm. 

There are abundant other publications covering the theoretical basis for fairness metrics, and many resources both online and academic covering the details of specific fairness measures (See [References (bottom)](#references) and [Additional Resources (bottom)](#additional_resources), or [Our Resources Page](../docs/Measures_QuickReference.md) for just a few). Many of these otherwise excellent references stop short of discussing  edge cases and the practical and philosophical considerations raised when evaluating real models for real customers. Here we attempt to bridge that gap.



## Table of Contents
[Part 1](#part1) - Framing the Problem
  
[Part 2](#part2) - Evaluating the Baseline Model

[Part 3](#part3) - Comparing Models 
  
[References](#references)
  


## Requirements

To run this notebook, please install FairMLHealth using [the instructions posted in GitHub](https://github.com/KenSciResearch/fairMLHealth#installation_instructions). Some components of this notebook additionally require the [Fairlearn](https://github.com/fairlearn/fairlearn) package.

The tutorial uses data from the MIMIC III Critical Care database, a freely accessible source of electronic health records from Beth Israel Deaconess Medical Center in Boston. To download the MIMIC III data, please use this link: [Access to MIMIC III](https://mimic.physionet.org/gettingstarted/access/) and save the data with the default directory name ("MIMIC"). No further action is required beyond remembering the download location, and you do not need to unzip any files.

A basic knowledge of ML implementation in Python is assumed. 


----
---- 
# Part 1 - Framing the Problem <a class = "anchor" id = "part1"></a>

In [1]:
# Standard Libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Load Prediction Libraries
import sklearn.metrics as sk_metric
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Remove limit to the number of columns and column widths displayed by pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 0) # adjust column width as needed



### Below are helper functions that make the tutorial easier to read

In [2]:
# Helpers from local folder
from fairmlhealth.__mimic_data import load_mimic3_example
from fairmlhealth.report import compare
from fairmlhealth import measure, stat_utils


# Verify that all required packages are present
from fairmlhealth.__validation import validate_notebook_requirements
validate_notebook_requirements()

# Functions and pointers to make this tutorial more colorful
ks_magenta = '#d00095'
ks_magenta_lt = '#ff05b8'
ks_purple = '#947fed'


ImportError: cannot import name 'measure_model' from 'fairmlhealth.report' (~/fairMLHealth/fairmlhealth/report.py)

### Loading the MIMIC III Data Subset <a id="datasubset"></a>

As mentioned above, the MIMIC-III data download contains a folder of zipped_files. The tutorial code in the [Data Subset](#datasubset) section below will automatically unzip and format all necessary data for these experiments, saving the formatted data in the same MIMIC folder. Simply enter the correct path of the MIMIC folder in the following cell to enable this feature. Your path should end with the directory "MIMIC".

Example: path_to_mimic_data_folder = "~/data/MIMIC"

In [None]:
# path_to_mimic_data_folder = "[path to your downloaded data folder]"
path_to_mimic_data_folder = "~/data/MIMIC"

Example models in this notebook use data from all years of the MIMIC-III dataset for patients aged 65 and older. Data are imported at the encounter level with all additional patient identification dropped. All models include an "AGE" feature, simplified to 5-year bins, as well as Boolean diagnosis and procedure features categorized through the Clinical Classifications Software system ([HCUP](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp)). All features other than age are one-hot encoded and prefixed with their variable type (e.g. "GENDER_", "ETHNICITY_").  

In [None]:
# Load data and keep a 10K observation subset to speed processing
df = load_mimic3_example(path_to_mimic_data_folder) 
df = df.sample(n=10000, random_state=42).reset_index(drop=True)

# Subset to ages 65+
df = df.loc[df['AGE'].ge(65), :]
df.drop('GENDER_F', axis=1, inplace=True) # Redundant with GENDER_M

# Display insights
display(stat_utils.feature_table(df))
print("\n\n", "Below is a scrollable version of the first five rows of data:")
display(df.head())

### Generating the Model <a id="generatemodel"></a>


In [None]:
# Display LOS Distributions
ax = df.groupby('LANGUAGE_ENGL')['length_of_stay'
        ].plot(kind='kde', 
               title="Distribution of Length of Stay (in Days) Relative to English Language Status")
plt.show()

In [None]:
# Subset and Split Data
baseline_cols = [c for c in df.columns if (c.startswith('AGE') or c.startswith('DIAGNOSIS_') or c.startswith('PROCEDURE_') or c == "LANGUAGE_ENGL")]
X = df.loc[:, baseline_cols]
y = df['length_of_stay'].clip(0, 25)

#
splits = train_test_split(X, y, test_size=0.33, random_state=42)
X_train, X_test, y_train, y_test = splits

In [None]:
# Train New Model with Language Feature
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)
y_pred = baseline_model.predict(X_test)


### FairMLHealth Stratified Data Table

FairMLHealth includes stratified table features to aid in identifying the source of unfairness or other bias. The data table contains basic statistics specific to each feature-value, in addition to relative statistics for the target value. Since the table can be used to evaluate many features at once, it can be a useful option for identifying patterns of bias either alone or in concert with other (e.g., visual methods).

In [None]:
print("FairMLHealth Stratified Data Table")
measure.data(X_test, y_test, features=['LANGUAGE_ENGL'])


----
----
# Part 2 - Evaluating the Baseline Model <a id="part2"></a>
In this section we will evaluate an array of fairness measures for a single model. For more information on how to interpret these results, see [Evaluating Fairness](docs/Evaluating_Fairness.md) in our documentation. Skip ahead to [Part 3](#part3) for an example comparing multiple models against each other.

Our experiment tests sociodemographic bias as it relates to language. Here we suppose that individuals who speak English may be given preferential treatment in an English-speaking society due to the requirement of using a translator. Language may also be used as a proxy for race, religion, or nationality, which are [explicitly protected attributes](https://www.eeoc.gov/employers/small-business/3-who-protected-employment-discrimination).


In [None]:
lang_values = X_test['LANGUAGE_ENGL']

In [None]:
print("FairMLHealth Fairness Measure Report: English Language")
fairness_report = compare(X_test, y_test, lang_values, baseline_model, pred_type="regression")

### FairMLHealth Stratified Performance Table

The stratified performance table contains model performance measures specific to each feature-value subset. In this example we can see that the Equalized Odds ratio is out of range because of the significant difference in False Positive Rates (FPR) between the two classes. More about this difference in [Comparing Group Fairness Measures](#comparing_group_measures) below.

Note that if prediction probabilities (via the *predict_proba()* method) are available to the model, additional ROC_AUC and PR_AUC values will be included in the table.

In [None]:
print("FairMLHealth Stratified Performance Table")
measure.performance(X_test, y_test, y_pred, features=['LANGUAGE_ENGL'], pred_type="regression")

### FairMLHealth Stratified Fairness Table

The stratified bias table contains prediction bias measures specific to each feature-value subset. It assumes each feature-value as the "privileged" group relative to all other possible values for the feature. To simplify the table, fairness measures have been reduced to their component parts. For example, measures of Equalized Odds can be determined by combining the True Positive Rate (TPR) Ratios & Differences with False Positive Rate (FPR) Ratios & Differences.

In [None]:
print("FairMLHealth Stratified Fairness Table")
measure.bias(X_test, y_test, y_pred, features=['LANGUAGE_ENGL'], pred_type="regression")

In [None]:
quantiles = pd.qcut(y_test, 3, labels=False)

measure.bias(X_test, y_test, y_pred, features=['LANGUAGE_ENGL'], 
                    pred_type="regression", cohorts=quantiles)