## About
This notebook contains simple, toy examples to help you get started with FairMLHealth tool usage. This same content is mirrored in the repository's main [README](../README.md).

## Example Setup

In [1]:
from fairmlhealth import report as fhrp, measure, stat_utils 

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier


In [2]:
# First we'll create a semi-randomized dataframe with specific columns for our attributes of interest
np.random.seed(506)
N = 240
X = pd.DataFrame({'col1': np.random.randint(1, 4, N), 
                  'col2': np.random.randint(1, 75, N),
                  'col3': np.random.randint(0, 2, N),
                  'gender': [0, 1]*int(N/2), 
                  'ethnicity': [1, 1, 0, 0]*int(N/4),
                  'other': [1, 1, 1, 1, 1, 0, 0, 1]*int(N/8)
                 })

# Next we'll create a randomized target value with strong relationships to key features that will be 
# used in the examples below
#gender_corr = X['gender']
#gender_corr[(gender_corr.index+1)%4==0] = 0 # only associate with gender every other instance 
y = pd.Series(np.random.randint(0, 2, N), name='y')

# Third, we'll split the data and use it to train two generic models
splits = train_test_split(X, y, stratify=y, test_size=0.5, random_state=60)
X_train, X_test, y_train, y_test = splits

model_1 = BernoulliNB().fit(X_train, y_train)
model_2 = DecisionTreeClassifier().fit(X_train, y_train)



In [3]:
display(X.head(), y.head())

Unnamed: 0,col1,col2,col3,gender,ethnicity,other
0,1,15,0,0,1,1
1,3,51,1,1,1,1
2,1,30,1,0,0,1
3,2,28,1,1,0,1
4,1,72,0,0,1,1


0    0
1    0
2    0
3    1
4    1
Name: y, dtype: int64

## Model Measurement
The primary feature of this library is the model comparison tool. The current version supports assessment of binary prediction models through use of the measure_models and compare_models functions.

Measure_model is designed to generate a report of multiple fairness metrics for a single model. Here it is shown wrapped in a "flag" function to emphasize values that are outside of the "fair" range. 

In [4]:
# Generate a pandas dataframe of measures
fhrp.measure_model(X_test, y_test, X_test['gender'], model_1, flag_oor=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,Balanced Accuracy Difference,-0.0618
Group Fairness,Balanced Accuracy Ratio,0.9003
Group Fairness,Statistical Parity Difference,-0.2875
Group Fairness,Disparate Impact Ratio,0.5048
Group Fairness,Positive Predictive Parity Difference,-0.1373
Group Fairness,Positive Predictive Parity Ratio,0.7941
Group Fairness,Equalized Odds Difference,-0.3257
Group Fairness,Equalized Odds Ratio,0.525
Group Fairness,AUC Difference,-0.0746
Data Metrics,Prevalence of Privileged Class (%),50.0


## Evaluation

FairMLHealth now also includes stratified table features to aid in identifying the source of unfairness or other bias: data tables, performance tables, and bias tables. Note that these stratified tables can evaluate multiple features at once, and that there are two options for identifying which features to assess.

Note that the flag tool has not yet been updated to work with stratified tables.

### Significance Testing


In [5]:
model_1_preds = pd.Series(model_1.predict(X_test))

gender_corr = X_test['gender']
#gender_corr[(gender_corr.index+1)%4!=0] = 0 # only associate with gender every other instance 
model_1_preds.loc[gender_corr.eq(1) & model_1_preds.ne(y_test)] =  y_test
display(gender_corr.head())
display(y_test.head())
display(model_1_preds.head())

model_1_results = stat_utils.binary_result_labels(y_test, model_1_preds)
display(model_1_results.head())

152    0
187    1
191    1
171    1
91     1
Name: gender, dtype: int64

152    0
187    1
191    0
171    1
91     0
Name: y, dtype: int64

0    0
1    1
2    1
3    1
4    1
dtype: int64

0    TN
1    TP
2    FP
3    TP
4    FP
Name: prediction result, dtype: object

In [6]:
isMale = X_test['gender'].eq(1)
reject_h0 = stat_utils.bootstrap_significance(func=stat_utils.kruskal_pval, 
                                              dist_a=y_test.loc[isMale], 
                                              dist_b=y_test.loc[~isMale])
print("Is it likely that the difference in the mean y value is related to gender?\n", reject_h0)

Is it likely that the difference in the mean y value is related to gender?
 False


In [7]:
reject_h0 = stat_utils.bootstrap_significance(func=stat_utils.chisquare_pval, 
                                              group=X_test['gender'], 
                                              values=model_1_results)
print("Can we reject the hypothesis that prediction results are from the same", 
      "distribution for each gender?\n", reject_h0)

Can we reject the hypothesis that prediction results are from the same distribution for each gender?
 False


In [8]:
reject_h0 = stat_utils.bootstrap_significance(func=stat_utils.chisquare_pval, 
                                              group=X_test['ethnicity'], 
                                              values=model_1_results)
print("Can we reject the hypothesis that prediction results are from the same", 
      "distribution for each ethnicity?\n", reject_h0)

Can we reject the hypothesis that prediction results are from the same distribution for each ethnicity?
 False


In [9]:
reject_h0 = stat_utils.bootstrap_significance(func=stat_utils.chisquare_pval, 
                                              group=X_test['other'], 
                                              values=model_1_results)
print("Can we reject the hypothesis that prediction results are from the same", 
      "distribution for each \"other\" category?\n", reject_h0)

Can we reject the hypothesis that prediction results are from the same distribution for each "other" category?
 False


### Stratified Data Tables

The data table is shown below with each of the two data argument options. It evaluates basic statistics specific to each feature-value, in addition to relative statistics for the target value. Since the table can be used to evaluate many features at once, it can be a useful option for identifying patterns of bias either alone or in concert with other (e.g., visual methods).

In [10]:
# Arguments Option 1: pass full set of data, subsetting with *features* argument
measure.data(X_test, y_test, features=['gender', 'other', 'col1'])

Unnamed: 0,Feature Name,Feature Value,Obs.,Entropy,Mean y,Median y,Missing Values,Std. Dev. y,Value Prevalence
0,ALL FEATURES,ALL VALUES,120,,0.5,0.5,0,0.5021,1.0
1,gender,0,58,0.9992,0.431,0.0,0,0.4995,0.4833
2,gender,1,62,0.9992,0.5645,1.0,0,0.4999,0.5167
3,other,0,28,0.7838,0.4643,0.0,0,0.5079,0.2333
4,other,1,92,0.7838,0.5109,1.0,0,0.5026,0.7667
5,col1,1,41,1.5841,0.5366,1.0,0,0.5049,0.3417
6,col1,2,41,1.5841,0.5366,1.0,0,0.5049,0.3417
7,col1,3,38,1.5841,0.4211,0.0,0,0.5004,0.3167


In [11]:
# Multiple targets can also be passed through the data table 
measure.data(X_test, X_test, features=['gender', 'col1'], targets=['col2', 'col3'])

Unnamed: 0,Feature Name,Feature Value,Obs.,Entropy,Mean col2,Mean col3,Median col2,Median col3,Missing Values,Std. Dev. col2,Std. Dev. col3,Value Prevalence
0,ALL FEATURES,ALL VALUES,120,,36.3417,0.4917,33.5,0.0,0,21.0779,0.502,1.0
1,gender,0,58,0.9992,38.9138,0.4828,38.5,0.0,0,21.5482,0.5041,0.4833
2,gender,1,62,0.9992,33.9355,0.5,30.0,0.5,0,20.5098,0.5041,0.5167
3,col1,1,41,1.5841,39.8293,0.5122,40.0,1.0,0,21.862,0.5061,0.3417
4,col1,2,41,1.5841,31.3659,0.4634,25.0,0.0,0,18.4415,0.5049,0.3417
5,col1,3,38,1.5841,37.9474,0.5,35.5,0.5,0,22.3824,0.5067,0.3167


In [12]:
# The "ALL FEATURES" overview column can be turned off via the add_overview argument
measure.data(X_test[['gender']], y_test, add_overview=False)

Unnamed: 0,Feature Name,Feature Value,Obs.,Entropy,Mean y,Median y,Missing Values,Std. Dev. y,Value Prevalence
0,gender,0,58,0.9992,0.431,0.0,0,0.4995,0.4833
1,gender,1,62,0.9992,0.5645,1.0,0,0.4999,0.5167


In [13]:
# Multiple targets can also be passed through the data table 
measure.data(X_test, X_test, features=['gender', 'col1'], targets=['col2', 'other'])

Unnamed: 0,Feature Name,Feature Value,Obs.,Entropy,Mean col2,Mean other,Median col2,Median other,Missing Values,Std. Dev. col2,Std. Dev. other,Value Prevalence
0,ALL FEATURES,ALL VALUES,120,,36.3417,0.7667,33.5,1.0,0,21.0779,0.4247,1.0
1,gender,0,58,0.9992,38.9138,0.7759,38.5,1.0,0,21.5482,0.4207,0.4833
2,gender,1,62,0.9992,33.9355,0.7581,30.0,1.0,0,20.5098,0.4318,0.5167
3,col1,1,41,1.5841,39.8293,0.6585,40.0,1.0,0,21.862,0.4801,0.3417
4,col1,2,41,1.5841,31.3659,0.7805,25.0,1.0,0,18.4415,0.4191,0.3417
5,col1,3,38,1.5841,37.9474,0.8684,35.5,1.0,0,22.3824,0.3426,0.3167


### Stratified Performance Tables

The stratified performance table contains model performance measures specific to each feature-value subset. If prediction probabilities are available to the model, additional ROC AUC and PR AUC values will be included.

In [14]:
measure.performance(X_test[['gender']], y_test, model_1.predict(X_test), add_overview=False)

Unnamed: 0,Feature Name,Feature Value,Obs.,Accuracy,F1-Score,FPR,Mean __y_pred,Mean __y_true,Precision,TPR
0,gender,0,58.0,0.5862,0.4286,0.2424,0.2931,0.431,0.5294,0.36
1,gender,1,62.0,0.629,0.6761,0.4444,0.5806,0.5645,0.6667,0.6857


In [15]:
measure.performance(X_test[['gender']], 
                    y_true=y_test, 
                    y_pred=model_1.predict(X_test), 
                    y_prob=model_1.predict_proba(X_test)[:,1])

Unnamed: 0,Feature Name,Feature Value,Obs.,Accuracy,F1-Score,FPR,Mean __y_pred,Mean __y_true,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,120.0,0.6083,0.5841,0.3333,0.4417,0.5,,0.6226,0.6108,0.55
1,gender,0,58.0,0.5862,0.4286,0.2424,0.2931,0.431,,0.5294,0.5545,0.36
2,gender,1,62.0,0.629,0.6761,0.4444,0.5806,0.5645,,0.6667,0.6291,0.6857


### Stratified Bias Fairness Tables

The stratified bias table contains model bias measures specific to each feature-value subset. Inspired by common measures of fairness, the tool assumes each feature-value as the "privileged" group relative to all other possible values for the feature. For example, row 3 in the table below displaying measures of "col1" value of "2" where 2 is considered to be the privileged group and all other values (1, 2, 45, and 50) are considered unprivileged.

To simplify the table, fairness measures have been reduced to their component parts. For example, measures of Equalized Odds can be determined by combining the True Positive Rate (TPR) Ratios & Differences with False Positive Rate (FPR) Ratios & Differences.

See also: [Fairness Quick References](../docs/Fairness_Quick_References.pdf) and the [Tutorial for Evaluating Fairness in Binary Classification](./Tutorial-EvaluatingFairnessInBinaryClassification.ipynb)

In [16]:
measure.bias(X_test[['gender', 'col3']], y_test, model_1.predict(X_test), flag_oor=True)

Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0618,1.1107,0.202,1.8333,0.1373,1.2593,0.2875,1.981,0.3257,1.9048
1,gender,1,-0.0618,0.9003,-0.202,0.5455,-0.1373,0.7941,-0.2875,0.5048,-0.3257,0.525
2,col3,0,0.0095,1.0157,-0.0223,0.9351,0.0613,1.1034,-0.0019,0.9956,-0.0033,0.994
3,col3,1,-0.0095,0.9845,0.0223,1.0694,-0.0613,0.9063,0.0019,1.0044,0.0033,1.0061


## Comparing Results for Multiple Models

The compare_models feature can be used to generate side-by-side fairness comparisons of multiple models. Model performance metrics such as accuracy and precision are also provided to facilitate comparison.   

Below is an example output comparing the two example models defined above. Missing values have been added for metrics requiring prediction probabilities, which the second model does not have (note the warning below).

In [17]:
# Pass the data and models to the compare models function, as above
fhrp.compare_models(X_test, y_test, X_test['gender'], 
                    {'model 1':model_1, 'model 2':model_2},
                    flag_oor=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,model 1,model 2
Metric,Measure,Unnamed: 2_level_1,Unnamed: 3_level_1
Group Fairness,Balanced Accuracy Difference,-0.0618,-0.123
Group Fairness,Balanced Accuracy Ratio,0.9003,0.7861
Group Fairness,Statistical Parity Difference,-0.2875,-0.1429
Group Fairness,Disparate Impact Ratio,0.5048,0.789
Group Fairness,Positive Predictive Parity Difference,-0.1373,-0.232
Group Fairness,Positive Predictive Parity Ratio,0.7941,0.6253
Group Fairness,Equalized Odds Difference,-0.3257,-0.2629
Group Fairness,Equalized Odds Ratio,0.525,0.6462
Group Fairness,AUC Difference,-0.0746,-0.1367
Data Metrics,Prevalence of Privileged Class (%),50.0,50.0


The compare_models function can also be used to measure two different protected attributes. Protected attributes are measured separately and cannot yet be combined together with this tool.

In [18]:
fhrp.compare_models(X_test, y_test, 
                    [X_test['gender'], X_test['ethnicity']], 
                    {'gender':model_1, 'ethnicity':model_1},
                    flag_oor=True)


Unnamed: 0_level_0,Unnamed: 1_level_0,gender,ethnicity
Metric,Measure,Unnamed: 2_level_1,Unnamed: 3_level_1
Group Fairness,Balanced Accuracy Difference,-0.0618,-0.0164
Group Fairness,Balanced Accuracy Ratio,0.9003,0.971
Group Fairness,Statistical Parity Difference,-0.2875,0.7021
Group Fairness,Disparate Impact Ratio,0.5048,9.2852
Group Fairness,Positive Predictive Parity Difference,-0.1373,-0.1958
Group Fairness,Positive Predictive Parity Ratio,0.7941,0.7552
Group Fairness,Equalized Odds Difference,-0.3257,0.7014
Group Fairness,Equalized Odds Ratio,0.525,24.8462
Group Fairness,AUC Difference,-0.0746,0.0484
Data Metrics,Prevalence of Privileged Class (%),50.0,50.0


## Analysis by Cohort

In [19]:
measure.bias(X_test['col3'], y_test, model_1.predict(X_test), 
                    flag_oor=True, cohorts=X_test['gender'])

Unnamed: 0,gender,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,0,col3,0,-0.1944,0.693,-0.0444,0.8333,-0.4667,0.3,-0.2214,0.4464,-0.4333,0.1875
1,0,col3,1,0.1944,1.443,0.0444,1.2,0.4667,3.3333,0.2214,2.24,0.4333,5.3333
2,1,col3,0,0.0511,1.0882,0.0882,1.2143,0.2286,1.4286,0.1935,1.4,0.1905,1.3333
3,1,col3,1,-0.0511,0.919,-0.0882,0.8235,-0.2286,0.7,-0.1935,0.7143,-0.1905,0.75


In [20]:
measure.bias(X_test[['other']], y_test, model_1.predict(X_test), 
                    flag_oor=True, cohorts=X_test[['ethnicity', 'gender']])


  warn(f"Possible error in column(s) {cols}. {wr}\n")

  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,ethnicity,gender,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,0,0,other,0,0.0,1.0,1.0,0.0,0.5294,0.0,1.0,0.0,1.0,0.0
1,0,0,other,1,0.0,1.0,-1.0,0.0,-0.5294,0.0,-1.0,0.0,-1.0,0.0
2,1,1,other,0,0.1875,1.375,0.125,0.0,0.8,0.0,0.3125,0.0,0.5,0.0
3,1,1,other,1,-0.1875,0.7273,-0.125,0.0,-0.8,0.0,-0.3125,0.0,-0.5,0.0
