## L4 - 2: Demographic Bias Analysis

### Instructions
- Using the Compas dataset prepared by Aequitas, perform a Fairness Disparity Analysis on the under 25 Asian female reference group. See the documentation for reference -https://github.com/dssg/aequitas. In particular focus your analysis on fairness and disparity for FPR and where applicable try to leverage some of the visualizations Aequitas provides.

### Aequitas Compas Dataset
- Using 2016 dataset from ProPublica for automated criminal risk assessment algorithms and adapted from Aequitas notebook - https://github.com/dssg/aequitas/blob/master/docs/source/examples/compas_demo.ipynb
- Preprocessed using this script --https://github.com/dssg/aequitas/blob/master/examples/compas_data_for_aequitas.py

The Aequitas Bias() class is used to calculate disparities between groups based on the crosstab returned by the Group() class get_crosstabs() method we used for preprocessing the dataset. 

### Instructions from Aequitas on Calculating Disparities across Reference Groups
(adapted from https://github.com/dssg/aequitas/blob/master/docs/source/examples/compas_demo.ipynb)

Disparities are calculated as a ratio of a metric for a group of interest compared to a base group. For example, the False Negative Rate Disparity for black defendants vis-a-vis whites is:$$Disparity_{FNR} =  \frac{FNR_{black}}{FNR_{white}}$$

Below, we use get_disparity_predefined_groups() which allows us to choose reference groups that clarify the output for the practitioner.

The Aequitas Bias() class includes two additional get disparity functions: get_disparity_major_group() and get_disparity_min_metric(), which automate base group selection based on sample majority (across each attribute) and minimum value for each calculated bias metric, respectively.

The get_disparity_predefined_groups() allows user to define a base group for each attribute.

Disparities Calculated Calcuated:
- True Positive Rate Disparity	'tpr_disprity'
- True Negative Rate	'tnr_disparity'
- False Omission Rate	'for_disparity'
- False Discovery Rate	'fdr_disparity'
- False Positive Rate	'fpr_disparity'
- False NegativeRate	'fnr_disparity'
- Negative Predictive Value	'npv_disparity'
- Precision Disparity	'precision_disparity'
- Predicted Positive Ratio$_k$ Disparity	'ppr_disparity'
- Predicted Positive Ratio$_g$ Disparity	'pprev_disparity'

**How do I interpret calculated disparity ratios?**
The calculated disparities from the dataframe returned by the Bias() class get_disparity_ methods are in relation to a reference group, which will always have a disparity of 1.0.

The differences in False Positive Rates, noted in the discussion of the Group() class above, are clarified using the disparity ratio (fpr_disparity). Black people are falsely identified as being high or medium risks 1.9 times the rate for white people.

As seen above, False Discovery Rates have much less disparity (fdr_disparity), or fraction of false postives over predicted positive in a group. As reference groups have disparity = 1 by design in Aequitas, the lower disparity is highlighted by the fdr_disparity value close to 1.0 (0.906) for the race attribute group 'African-American' when disparities are calculated using predefined base group 'Caucasian'. Note that COMPAS is calibrated to balance False Positive Rate and False Discovery Rates across groups.


**How does the selected reference group affect disparity calculations?**
Disparities calculated in the the Aequitas Bias() class based on the crosstab returned by the Group() class get_crosstabs() method can be derived using several different base gorups. In addition to using user-specified groups illustrated above, Aequitas can automate base group selection based on dataset characterisitcs:

Evaluating disparities calculated in relation to a different 'race' reference group
Changing even one attribute in the predefined groups will alter calculated disparities. When a different pre-defined group 'Hispanic' is used, we can see that Black people are 2.1 times more likely to be falsely identified as being high or medium risks as Hispanic people are (compared to 1.9 times more likely than white people), and even less likely to be falsely identified as low risk when compared to Hispanic people rather than white people.

In [None]:
import pandas as pd
import numpy as np

In [43]:
# Use Aequitas Data that was transformed 
compas_df = pd.read_csv("./data/compas_for_aequitas.csv")

In [44]:
compas_df.head()

Unnamed: 0,entity_id,score,label_value,race,sex,age_cat
0,1,0.0,0,Other,Male,Greater than 45
1,3,0.0,1,African-American,Male,25 - 45
2,4,0.0,1,African-American,Male,Less than 25
3,5,1.0,0,African-American,Male,Less than 25
4,6,0.0,0,Other,Male,25 - 45
