## About
This notebook contains simple, toy examples to help you get started with FairMLHealth tool usage. This same content is mirrored in the repository's main [README](../README.md).

## Example Setup

In [1]:
from fairmlhealth import report, measure, stat_utils

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier


In [2]:
# First we'll create a semi-randomized dataframe with specific columns for our attributes of interest
np.random.seed(506)
N = 240
X = pd.DataFrame({'col1': np.random.randint(1, 4, N), 
                  'col2': np.random.randint(1, 75, N),
                  'col3': np.random.randint(0, 2, N),
                  'gender': [0, 1]*int(N/2), 
                  'ethnicity': [1, 1, 0, 0]*int(N/4),
                  'other': [1, 0, 0, 0, 1, 0, 0, 1]*int(N/8)
                 })

# Next we'll create a randomized target value
y = pd.Series(X['col3'].values + np.random.randint(0, 2, N), name='Example_Target').clip(upper=1)

# Third, we'll split the data and use it to train two generic models
splits = train_test_split(X, y, stratify=y, test_size=0.5, random_state=60)
X_train, X_test, y_train, y_test = splits

model_1 = BernoulliNB().fit(X_train, y_train)
model_2 = DecisionTreeClassifier().fit(X_train, y_train)



In [3]:
display(X.head(), y.head())

Unnamed: 0,col1,col2,col3,gender,ethnicity,other
0,1,15,0,0,1,1
1,3,51,1,1,1,0
2,1,30,1,0,0,0
3,2,28,1,1,0,0
4,1,72,0,0,1,1


0    0
1    1
2    1
3    1
4    1
Name: Example_Target, dtype: int64

## Generalized Reports
fairMLHealth has tools to create generalized reports of model bias and performance.

The primary reporting tool is now the **compare** function, which can be used to generate side-by-side comparisons for any number of models, and for either binary classifcation or for regression problems. Model performance metrics such as accuracy and precision (or MAE and RSquared for regression problems) are also provided to facilitate comparison. Below is an example output comparing the two example models defined above. Missing values have been added for metrics requiring prediction probabilities (which the second model does not have).

A flagging protocol is applied by default to highlight any cells with values that are out of range.  This can be turned off by passing ***flag_oor = False*** to report.compare().

*Note that the Equal Odds Ratio has been dropped from the example below*. This because the false positive rate is approximately zero for both the entire dataset and for the privileged class, leading to a zero in the denominator of the False Positive Rate Ratio: $\frac{{FPR}_{unprivileged}}{{FPR}_{privileged}}$. The result is therefore undefined and cannot be compared in the Equal Odds Ratio. 

In [4]:
# Generate a measure report
report.compare(X_test, y_test, X_test['ethnicity'], model_1, flag_oor=True)


  warn(f"The following measures are undefined and have been dropped: {undefined}")


Unnamed: 0_level_0,Unnamed: 1_level_0,model 1
Metric,Measure,Unnamed: 2_level_1
Group Fairness,Balanced Accuracy Difference,-0.3667
Group Fairness,Balanced Accuracy Ratio,0.5769
Group Fairness,Statistical Parity Difference,0.45
Group Fairness,Disparate Impact Ratio,1.8182
Group Fairness,Positive Predictive Parity Difference,-0.25
Group Fairness,Positive Predictive Parity Ratio,0.75
Group Fairness,Equal Odds Difference,1.0
Group Fairness,AUC Difference,-0.0778
Individual Fairness,Consistency Score,0.7683
Individual Fairness,Between-Group Gen. Entropy Error,0.0241


In [5]:
# Return the report as embedded html
from IPython.core.display import HTML
html_output = report.compare(X_test, y_test, X_test['gender'], model_1, pred_type="classification", output_type="html")
HTML(html_output)

Unnamed: 0_level_0,Unnamed: 1_level_0,model 1
Metric,Measure,Unnamed: 2_level_1
Group Fairness,Balanced Accuracy Difference,0.0988
Group Fairness,Equal Odds Ratio,0.6691
Group Fairness,Equal Odds Difference,-0.2036
Group Fairness,Positive Predictive Parity Ratio,1.0133
Group Fairness,AUC Difference,-0.025
Group Fairness,Disparate Impact Ratio,0.9068
Group Fairness,Statistical Parity Difference,-0.0759
Group Fairness,Balanced Accuracy Ratio,1.1576
Group Fairness,Positive Predictive Parity Difference,0.0111
Individual Fairness,Consistency Score,0.7683


### Comparing Results for Multiple Models

The **compare** tool can also be used to measure two different models or two different protected attributes. Protected attributes are measured separately and cannot yet be combined together with the **compare** tool, although they can be grouped as cohorts in the stratified tables [as shown below](#cohort). 

Below is an example output comparing the two test models defined above. 

In [7]:
# Example with multiple models
report.compare(X_test, y_test, X_test['gender'],
               {'model 1':model_1, 'model 2':model_2})

Unnamed: 0_level_0,Unnamed: 1_level_0,model 1,model 2
Metric,Measure,Unnamed: 2_level_1,Unnamed: 3_level_1
Group Fairness,Balanced Accuracy Difference,0.0988,0.0054
Group Fairness,Equal Odds Ratio,0.6691,0.7647
Group Fairness,Equal Odds Difference,-0.2036,-0.1086
Group Fairness,Positive Predictive Parity Ratio,1.0133,0.9763
Group Fairness,AUC Difference,-0.025,-0.002
Group Fairness,Disparate Impact Ratio,0.9068,0.8383
Group Fairness,Statistical Parity Difference,-0.0759,-0.1234
Group Fairness,Balanced Accuracy Ratio,1.1576,1.0078
Group Fairness,Positive Predictive Parity Difference,0.0111,-0.0205
Individual Fairness,Consistency Score,0.7683,0.7267


In [None]:
# Example with different protected attributes. 
# Note that the same model is passed with two different keys to clarify the column names.
report.compare(X_test, y_test, 
                    [X_test['gender'], X_test['ethnicity']], 
                    {'gender':model_1, 'ethnicity':model_1},
                    flag_oor=True)


## Detailed Analyses


### Significance Testing

It's recommended to test for the statistical significance of discrepancies in the distribution of results. This is particularly true for attributes with skewed distributions, for which small sample sizes for less common labels may affect fairness measures. However it is still generally recommended to validate whether the test sample for which fairness measures are generated is reflective of the full dataset.

FairMLHealth comes with a bootstrapping utility and supporting funcitons that can be used in some statistical testing. While the selection of proper statistical tests is beyond the scope of this notebooks, two examples using the bootstrap_significance tool with built-in test functions are shown below.

In [None]:
model_1_preds = pd.Series(model_1.predict(X_test))

In [None]:
# Example Significance Test Results Applying Kruskal-Wallis to Predictions
isMale = X_test.reset_index(drop=True)['gender'].eq(1)
reject_h0 = stat_utils.bootstrap_significance(alpha=0.05,
                                              func=stat_utils.kruskal_pval, 
                                              dist_a=model_1_preds.loc[isMale], 
                                              dist_b=model_1_preds.loc[~isMale])
print("Is it likely that the difference in the mean y value is related to gender?\n",
      reject_h0)

In [None]:
# Example Significance Test Results Applying Chi-Square to the Distribution of Prediction Successes/Failures
model_1_results = stat_utils.binary_result_labels(y_test, model_1_preds)
reject_h0 = stat_utils.bootstrap_significance(alpha=0.05,
                                              func=stat_utils.chisquare_pval, 
                                              group=X_test['gender'], 
                                              values=model_1_results)
print("Can we reject the hypothesis that prediction results are from the same", 
      "distribution for each gender?\n", reject_h0)

### Stratified Tables
FairMLHealth also provides tools for detailed analysis of model variance by way of stratified data, performance, and bias tables. Beyond evaluating fairness, these tools are intended for flexible use in any generic assessment of model bais. Tables can evaluate multiple features at once. *An important update starting in Version 1.0.0 is that all of these features are now contained in the **measure.py** module (previously named reports.py).*

All tables display a summary row for "All Features, All Values". This summary can be turned off by passing ***add_overview=False*** to measure.data().



#### Data Tables

The stratified data table can be used to evaluate data against one or multiple targets. Two methods are available for identifying which features to assess, as shown in the examples below. 


In [None]:
# Arguments Option 1: pass full set of data, subsetting with *features* argument
measure.data(X_test, y_test, features=['gender', 'other', 'col1'])

In [None]:
# Arguments Option 2: pass the data subset of interest without using the *features* argument
measure.data(X_test, X_test, features=['gender', 'col1'], targets=['col2', 'col3'])

In [None]:
# Display a similar report for multiple targets, dropping the summary row
measure.data(X=X_test, # used to define rows
             Y=X_test, # used to define columns
             features=['gender', 'col1'], # optional subset of X
             targets=['col2', 'col3'], # optional subset of Y
             add_overview=False # turns off "All Features, All Values" row
             )

#### Stratified Performance Tables

The stratified performance table evaluates model performance specific to each feature-value subset. These tables are compatible with both classification and regression models. For classification models with the *predict_proba()* method, additional ROC_AUC and PR_AUC values will be included if possible.

In [None]:
# Performance table example
measure.performance(X_test[['gender']], y_test, model_1.predict(X_test))

In [None]:
# Performance table example with probabilities included
measure.performance(X_test[['gender']], 
                    y_true=y_test, 
                    y_pred=model_1.predict(X_test), 
                    y_prob=model_1.predict_proba(X_test)[:,1])

#### Stratified Bias Tables

The stratified bias analysis table apply fairness-related metrics for each feature-value pair. It assumes a given feature-value as the "privileged" group relative to all other possible values for the feature. For example, row **2** in the table below displays measures for **"col1"** with a value of **"2"**. For this row, "2" is considered to be the privileged group, while all other non-null values (namely "1" and "3") are considered unprivileged.

To simplify the table, fairness measures have been reduced to their component parts. For example, the Equal Odds Ratio has been reduced to the True Positive Rate (TPR) Ratio and False Positive Rate (FPR) Ratio.

Note that the *flag* function is compatible with both **measure.bias()** and **measure.summary()** (which is demonstrated below). However, to enable colored cells the tool returns a pandas Styler rather than a DataTable. For this reason, *flag_oor* is set to False by default (as shown in the example above). Flagging can be turned on by passing *flag_oor=True* to either function. As an added feature, optional custom ranges can be passed to either **measure.bias()** or **measure.summary()** to facilitate regression evaluation, shown in [Example-ToolUsage_Regression](https://nbviewer.jupyter.org/github/KenSciResearch/fairMLHealth/blob/integration/examples_and_tutorials/Example-ToolUsage_Regression.ipynb).

In [None]:
# Example of bias table with flag turned on
measure.bias(X_test[['gender', 'col3']], y_test, model_1.predict(X_test), flag_oor=True)

The **measure** module also contains a summary function that works similarly to report.compare(). While it can only be applied to one model at a time, it can accept custom "fair" ranges, and accept cohort groups as will be [shown in the next section](#cohort).

In [None]:
# Example summary with performance skipped
measure.summary(X_test[['col2']], 
                y_test, 
                model_1.predict(X_test),
                prtc_attr=X_test['gender'], 
                pred_type="classification",
                skip_performance=True
               )

## <a name="cohort"></a>Analysis by Cohort

Table-generating functions in the **measure** module can all be additionally grouped using the *cohorts* argument to specify additional labels for each observation. Cohorts may consist of either a single label or a set of labels, and may be either separate from or attached to the existing data.

In [None]:
# Separate, Single-Level Cohorts
cohort_labels = X_test['gender']
measure.bias(X_test['col3'], y_test, model_1.predict(X_test), 
                    flag_oor=True, cohorts=cohort_labels)

In [None]:
## Associated, Multi-Level Cohorts
measure.data(X=X_test['col3'], Y=y_test, cohorts=X_test[['gender', 'ethnicity']])

In [None]:
# Cohorts for summary tables
measure.summary(X_test[['col2']], 
                y_test, 
                model_1.predict(X_test),
                prtc_attr=X_test['gender'], 
                pred_type="classification",
                flag_oor=False,
                skip_performance=True,
                cohorts=X_test[['ethnicity', 'col3']]
               )