# Measure fairness in healtcare-related database using the Aequitas toolkit

Simone Callegarin - 932342 - 232592

## Imports
   All the necessary inputs in order to make the notebook works.

In [62]:
!pip install aequitas
import pandas as pd
import seaborn as sns
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.bias import Bias
from aequitas.fairness import Fairness
import aequitas.plot as ap

# import warnings; warnings.simplefilter('ignore')

%matplotlib inline

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Preprocess Data

### Dataset
The original Diabetes Dataset has been taken from [Diabetes 130-Hospitals](https://www.openml.org/search?type=data&status=active&id=43903).

This dataset represents 10 years of clinical care at 130 U.S. hospitals and delivery networks, collected from 1999 to 2008. 
Each record represents the hospital admission record for a patient diagnosed with diabetes whose stay lasted between one to fourteen days. 
The features describing each encounter include demographics, diagnoses, diabetic medications, number of visits in the year preceding the encounter, and payer information, as well as whether the patient was readmitted after release, and whether the readmission occurred within 30 days of the release.

The original "Diabetes 130-Hospitals" dataset was collected by Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore in 2014.

---

It was provided an already preprocessed version of the original one that can be found [here](https://www.kaggle.com/datasets/mathchi/diabetes-data-set).

All patients here are females at least 21 years old of Pima Indian heritage.

Here is the meaning of what the values of each column are referring:

* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigreeFunction: Diabetes pedigree function
* Age: Age (years)
* Outcome: Class variable (1 if diabetic, 0 otherwise)


### Dataset download
Below data has been taken from the csv into a table.

In [79]:
df = pd.read_csv("https://raw.githubusercontent.com/SimoneCallegarin/TIS_project/master/raw_data/diabetes.csv")

df.head()

Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,AgeCategory
0,0,6.0,148,72.0,35.0,30.5,33.6,0.627,50.0,1,1
1,1,1.0,85,66.0,29.0,30.5,26.6,0.351,31.0,0,1
2,2,8.0,183,64.0,23.0,30.5,23.3,0.672,32.0,1,1
3,3,1.0,89,66.0,23.0,94.0,28.1,0.167,21.0,0,0
4,4,0.0,137,40.0,35.0,168.0,43.1,0.374,33.0,1,1


#### Preprocessing
The next step is to pre-process the Data provided in order to assess the fairness of the Dataset.

1. *Outcome* column has been renamed in *score*.

2. *AgeCategory* column has been renamed with *label_value* because it will be the attribute on which to search for discrimination.

3. The first column that was left unnamed has been renemed in *entity_id*, a reserved column name that permit us to refer to each entity.

4. Continuous values are discretized by Aequitas that will first bin the data into quartiles and then create crosstabs with the newly defined categories.

5. The final table has been reordered with all the necessary attributes.

In [88]:
df = df.rename( columns={ 'Outcome' : 'score',  
                          'AgeCategory' : 'label_value' } )
                                    
df.columns.values[0] = 'entity_id'

df, _ = preprocess_input_df(df)

df = df[[ 'entity_id', 'score', 'label_value', 'Pregnancies', 'Glucose',	'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'	]]

df.to_csv("https://raw.githubusercontent.com/SimoneCallegarin/TIS_project/master/data/diabetes.csv")

df.head()

Unnamed: 0,entity_id,score,label_value,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0,1.0,1,3.00-6.00,141.00-199.00,64.00-72.00,31.00-45.00,14.00-30.50,32.00-36.30,0.55-1.00,40.00-66.00
1,1,0.0,1,0.00-1.00,44.00-99.00,64.00-72.00,23.00-31.00,14.00-30.50,18.20-27.50,0.24-0.37,29.00-40.00
2,2,1.0,1,6.00-13.00,141.00-199.00,38.00-64.00,10.00-23.00,14.00-30.50,18.20-27.50,0.55-1.00,29.00-40.00
3,3,0.0,0,0.00-1.00,44.00-99.00,64.00-72.00,10.00-23.00,31.25-105.00,27.50-32.00,0.08-0.24,21.00-24.00
4,4,1.0,1,0.00-1.00,117.00-141.00,38.00-64.00,31.00-45.00,105.00-272.00,36.30-50.00,0.24-0.37,29.00-40.00


The result of this preprocessing phase is a table with the following dimensions:

In [83]:
df.shape

(763, 11)

Note that people with an age equal or below 25 years are assigned to the AgeCategory (labeled_value) of 0 whilst the others have 1 in the column. 

In [84]:
df['score'].value_counts()

0    497
1    266
Name: score, dtype: int64

<a id='counts_description'></a>
The **`get_crosstabs()`** method tabulates a confusion matrix for each subgroup and calculates commonly used metrics such as false positive rate and false omission rate. It also provides counts by group and group prevelances.

#### Group Counts Calculated:

| Count Type | Column Name |
| --- | --- |
| False Positive Count | 'fp' |
| False Negative Count | 'fn' |
| True Negative Count | 'tn' |
| True Positive Count | 'tp' |
| Predicted Positive Count | 'pp' |
| Predicted Negative Count | 'pn' |
| Count of Negative Labels in Group | 'group_label_neg' |
| Count of Positive Labels in Group | 'group_label_pos' | 
| Group Size | 'group_size'|
| Total Entities | 'total_entities' |

#### Absolute Metrics Calculated:

| Metric | Column Name |
| --- | --- |
| True Positive Rate | 'tpr' |
| True Negative Rate | 'tnr' |
| False Omission Rate | 'for' |
| False Discovery Rate | 'fdr' |
| False Positive Rate | 'fpr' |
| False Negative Rate | 'fnr' |
| Negative Predictive Value | 'npv' |
| Precision | 'precision' |
| Predicted Positive Ratio$_k$ | 'ppr' |
| Predicted Positive Ratio$_g$ | 'pprev' |
| Group Prevalence | 'prev' |


In [85]:
g = Group()
xtab, _ = g.get_crosstabs(df)
absolute_metrics = g.list_absolute_metrics(xtab)
xtab[[col for col in xtab.columns if col not in absolute_metrics]]

Unnamed: 0,model_id,score_threshold,k,attribute_name,attribute_value,pp,pn,fp,fn,tn,tp,group_label_pos,group_label_neg,group_size,total_entities
0,0,binary 0/1,266,Pregnancies,0.00-1.00,67,176,23,55,121,44,99,144,243,763
1,0,binary 0/1,266,Pregnancies,1.00-3.00,50,132,16,51,81,34,85,97,182,763
2,0,binary 0/1,266,Pregnancies,3.00-6.00,58,115,5,99,16,53,152,21,173,763
3,0,binary 0/1,266,Pregnancies,6.00-13.00,91,74,1,73,1,90,163,2,165,763
4,0,binary 0/1,266,Glucose,117.00-141.00,74,116,16,74,42,58,132,58,190,763
5,0,binary 0/1,266,Glucose,141.00-199.00,130,57,19,43,14,111,154,33,187,763
6,0,binary 0/1,266,Glucose,44.00-99.00,14,178,2,80,98,12,92,100,192,763
7,0,binary 0/1,266,Glucose,99.00-117.00,48,146,8,81,65,40,121,73,194,763
8,0,binary 0/1,266,BloodPressure,38.00-64.00,45,152,15,46,106,30,76,121,197,763
9,0,binary 0/1,266,BloodPressure,64.00-72.00,83,143,12,82,61,71,153,73,226,763


In [74]:
xtab[['attribute_name', 'attribute_value'] + absolute_metrics].round(2)

Unnamed: 0,attribute_name,attribute_value,tpr,tnr,for,fdr,fpr,fnr,npv,precision,ppr,pprev,prev
0,Pregnancies,0.00-1.00,0.44,0.84,0.31,0.34,0.16,0.56,0.69,0.66,0.25,0.28,0.41
1,Pregnancies,1.00-3.00,0.4,0.84,0.39,0.32,0.16,0.6,0.61,0.68,0.19,0.27,0.47
2,Pregnancies,3.00-6.00,0.35,0.76,0.86,0.09,0.24,0.65,0.14,0.91,0.22,0.34,0.88
3,Pregnancies,6.00-13.00,0.55,0.5,0.99,0.01,0.5,0.45,0.01,0.99,0.34,0.55,0.99
4,Glucose,117.00-141.00,0.44,0.72,0.64,0.22,0.28,0.56,0.36,0.78,0.28,0.39,0.69
5,Glucose,141.00-199.00,0.72,0.42,0.75,0.15,0.58,0.28,0.25,0.85,0.49,0.7,0.82
6,Glucose,44.00-99.00,0.13,0.98,0.45,0.14,0.02,0.87,0.55,0.86,0.05,0.07,0.48
7,Glucose,99.00-117.00,0.33,0.89,0.55,0.17,0.11,0.67,0.45,0.83,0.18,0.25,0.62
8,BloodPressure,38.00-64.00,0.39,0.88,0.3,0.33,0.12,0.61,0.7,0.67,0.17,0.23,0.39
9,BloodPressure,64.00-72.00,0.46,0.84,0.57,0.14,0.16,0.54,0.43,0.86,0.31,0.37,0.68


In [72]:
b = Bias()

In [73]:
bdf = b.get_disparity_predefined_groups(xtab, original_df=df, ref_groups_dict={'race':'Caucasian', 'sex':'Male', 'age_cat':'25 - 45'}, alpha=0.05, mask_significance=True)
bdf.style

get_disparity_predefined_group()


ValueError: ignored

In [None]:
calculated_disparities = b.list_disparities(bdf)
disparity_significance = b.list_significance(bdf)

In [None]:
bdf[['attribute_name', 'attribute_value'] +  calculated_disparities + disparity_significance]

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,race,African-American,2.545667,1.690224,1.064904,0.906085,1.213154,1.912093,0.586416,1.377549,0.720526,0.913728
1,race,Asian,0.009368,0.718384,1.268317,0.611748,0.433839,0.370749,0.698482,1.275248,1.192808,1.229148
2,race,Caucasian,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,race,Hispanic,0.222482,0.857099,0.916748,1.120464,1.001616,0.915887,1.16514,0.849249,1.025773,0.999346
4,race,Native American,0.014052,1.915691,1.268317,0.611748,0.578453,1.598854,0.209544,1.721584,0.816506,1.170618
5,race,Other,0.092506,0.602147,0.920466,1.115085,1.048203,0.629057,1.41797,0.618447,1.11366,0.98049
6,sex,Female,0.216801,0.904348,0.806925,1.336425,0.734738,0.990343,1.05581,0.967101,1.004633,1.13071
7,sex,Male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,age_cat,25 - 45,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,age_cat,Greater than 45,0.204782,0.533914,0.879232,1.192804,0.746232,0.503031,1.531238,0.682963,1.248989,1.121136


In [23]:
f = Fairness()
fdf = f.get_group_value_fairness(bdf)
parity_detrminations = f.list_parities(fdf)
fdf[['attribute_name', 'attribute_value'] + absolute_metrics + calculated_disparities + parity_detrminations].style

NameError: ignored