### About
This notebook contains simple, toy examples to help you get started with FairMLHealth tool usage. This same content is mirrored in the repository's main [README](../../../README.md)

### Example Setup

In [1]:
from fairmlhealth import model_comparison as fhmc, reports


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, TweedieRegressor
np.random.seed(547)

# Load data
X = pd.DataFrame({'col1': np.random.randint(1, 50, 16), 
                  'col2': np.random.randint(1, 50, 16),
                  'col3': np.random.randint(1, 50, 16),
                  'gender': [0, 1]*8, 
                  'ethnicity': [1, 1, 0, 0]*4,
                  'other': [1, 0, 0, 1]*4
                 })

y = pd.Series(np.random.uniform(0, 8, 16), index=X.index, name="y")
X_train, y_train= X.iloc[0:5], y.iloc[0:5]
X_test, y_test = X.iloc[5:16], y.iloc[5:16]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75, random_state=36)

#Train models
model_1 = LinearRegression().fit(X_train, y_train)
model_2 = TweedieRegressor().fit(X_train, y_train)

# Deterimine your set of protected attributes
prtc_attr = X_test['gender']

# Specify either a dict or a list of trained models to compare
model_dict = {'model_1': model_1, 'model_2': model_2}


In [2]:
display(X)

Unnamed: 0,col1,col2,col3,gender,ethnicity,other
0,48,36,20,0,1,1
1,24,48,44,1,1,0
2,19,18,44,0,0,0
3,38,29,49,1,0,1
4,45,33,23,0,1,1
5,35,43,5,1,1,0
6,24,33,12,0,0,0
7,39,39,46,1,0,1
8,6,11,31,0,1,1
9,36,21,16,1,1,0


### Model Measurement
The primary feature of this library is the model comparison tool. The current version supports assessment of binary prediction models through use of the measure_models and compare_models functions.

Measure_model is designed to generate a report of multiple fairness metrics for a single model. Here it is shown wrapped in a "flag" function to emphasize values that are outside of the "fair" range. 

In [3]:
# Generate a pandas dataframe of measures
fhmc.measure_model(X_test, y_test, prtc_attr, model_1, pred_type="regression")


  "Dependent metrics will be skipped.")


Unnamed: 0,Unnamed: 1,Value
Group Fairness,Mean Prediction Ratio,8.377066
Group Fairness,MAE Ratio,1.551616
Group Fairness,R2 Ratio,4.266221
Group Fairness,Mean Prediction Difference,-4.238006
Group Fairness,MAE Difference,3.457205
Group Fairness,R2 Difference,-25.702288
Individual Fairness,Consistency Score,-2.596601
Individual Fairness,Between-Group Gen. Entropy Error,0.0
Model Performance (Weighted Avg),Target Mean,4.21497
Model Performance (Weighted Avg),Pred. Mean,-2.693487


### Evaluating

FairMLHealth now also includes stratified reporting features to aid in identifying the source of unfairness or other bias: data reports, performance reports, and bias reports. Note that these stratified reports can evaluate multiple features at once, and that there are two options for identifying which features to assess.

Note that the flag tool has not yet been updated to work with stratified reports.

#### Stratified Data Reports

The data reporter is shown below with each of the two data argument options. It evaluates basic statistics specific to each feature-value, in addition to relative statistics for the target value. Since the reporter can evaluate many features at once, it can be a useful option for identifying patterns of bias either alone or in concert with other (e.g., visual methods).

In [4]:
# Arguments Option 1: pass full set of data, subsetting with *features* argument
reports.data_report(X_test, y_test, features=['gender'])

USER ALERT! The following features have more than 11 values, which will slow processing time. Consider reducing to bins or quantiles: ['y']


Unnamed: 0,Feature Name,Feature Value,Obs.,Entropy,Missing Values,Value Prevalence,y Mean,y Median,y Std. Dev.
0,ALL FEATURES,ALL VALUES,12,,0,1.0,4.21497,4.148021,2.424003
1,gender,0,6,1.0,0,0.5,4.912123,4.629036,2.299964
2,gender,1,6,1.0,0,0.5,3.517816,3.349271,2.543707


#### Stratified Performance Reports

The stratified perofrmance reporter evaluates model performance specific to each feature-value subset. If prediction probabilities are available to the model, additional ROC AUC and PR AUC values will be included.

In [5]:
reports.performance_report(X_test[['gender']], y_test, 
                           model_1.predict(X_test), pred_type="regression")

Unnamed: 0,Feature Name,Feature Value,Obs.,Target Mean,Pred. Mean,Error Mean,Error Std. Dev.,MAE,MSE,Pred. Median,Pred. Std. Dev.
0,ALL FEATURES,ALL VALUES,12.0,4.21497,-2.693487,-6.908457,7.559461,7.996011,100.110102,-2.054167,6.677894
1,gender,0,6.0,4.912123,-4.81249,-9.724613,8.330387,9.724613,152.39756,-4.511152,6.163537
2,gender,1,6.0,3.517816,-0.574484,-4.0923,6.106625,6.267408,47.822645,-1.958892,7.02437


#### Stratified Bias Reports

The stratified bias reporter presents model bias specific to each feature-value subset. Inspired by common measures of fairness, the reporter assumes each feature-value as the "privileged" group relative to all other possible values for the feature. For example, row 3 in the table below displaying measures of "col1" value of "2" where 2 is considered to be the privileged group and all other values (1, 2, 45, and 50) are considered unprivileged.

To simplify the report, fairness measures have been simplified to their component parts. For example, measures of Equalized Odds can be determined by combining the True Positive Rate (TPR) Ratios & Differences with False Positive Rate (FPR) Ratios & Differences.

See also: [Fairness Quick References](../docs/Fairness_Quick_References.pdf) and the [Tutorial for Evaluating Fairness in Binary Classification](./Tutorial-EvaluatingFairnessInBinaryClassification.ipynb)

In [7]:
reports.bias_report(X_test[['gender', 'col2']], y_test, 
                    model_1.predict(X_test), pred_type="regression")


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,MAE Difference,MAE Ratio,Mean Prediction Difference,Mean Prediction Ratio,R2 Difference,R2 Ratio
0,gender,-3.457205,0.644489,4.238006,0.119374,25.702288,0.234399
1,gender,3.457205,1.551616,-4.238006,8.377066,-25.702288,4.266221
2,col2,-14.914819,0.311664,12.384089,0.118292,,
3,col2,1.512711,1.228874,-2.435103,6.278679,,
4,col2,-3.815804,0.668013,4.501117,0.339965,,
5,col2,0.881197,1.122589,2.695131,0.478095,,
6,col2,1.604387,1.245871,-17.028231,-0.318411,,
7,col2,0.610323,1.081513,-0.353593,1.147403,2.155442,0.891794
8,col2,-6.247333,0.544746,5.88316,0.272461,,
9,col2,5.272567,2.667044,-0.105885,1.040781,,


## Special Cases

### Hypothetical Example for which Only Mid-Range Values Need Be Accurate

Patients undergoing a multi-stage surgical procedure, the treatment times for which are predicted by a machine learning model.

| Average Predicted Time in Surgery | Intervention |
| - | - |
|0-5 hours | Outpatient Procedures |
|5-9 hours | Treatment Decision Depends on Predicted Trends in Surgery Time |
| 9+ hours | Inpatient Procedures |

In [None]:
# Generate quantiles
quantiles = pd.qcut(y_test, 3, labels=False)

# Generate plots
g = sns.lineplot(x=quantiles, y=y_test)
g.axhspan(0, 5, alpha=0.25, color='green')
g.axhspan(5, 9, alpha=0.25, color='lightyellow')
g.axhspan(9, 15, alpha=0.25, color='blue')
plt.xticks([*range(3)], [*range(3)])
g.set_xlabel("Quantile")
g.set_ylabel("Average Time in Surgery")
g.set_title("True Target Trend Across Quantiles")
plt.show()


## Cohort Analysis

In [None]:
reports.bias_report(X_test[['gender', 'B']], y_test, model_1.predict(X_test), 
                    pred_type="regression", cohorts=quantiles)