<h1><center> Lab 5: Data Science, fairness metrics and avoiding bias </center></h1>

This week (Week 5) we are learning about ethical aspects of data science and about conceptual tools to think through ethical questions in this field. Several examples of bias and algorithmic discrimination in data science and Artificial Intelligence are discussed (e.g., discrimination in Amazon Prime and COMPAS). Several metrics of fairness are also discussed (e.g., Demographic Parity and Equalised Odds).

In this Lab we will learn how to deal with bias and discrimination in practice. We will exercise how to apply new (Python) libraries to identify and mitigate bias in machine learning classification. 

As some Werkcollege will happen before the Lectures of week 5, we suggest that you watch, before this Lab, the 7-min [video](https://canvas.uva.nl/courses/25726/pages/lab-5-preparation-video-fairness-metrics?module_item_id=1062765) we added in *Canvas > Modules > Week 5: Lab 5 preparation - Video fairness metrics*.

This notebook will also be an opportunity to exercise the contents of the previous week (Week 4) on **Decision Trees**.

We will use [Fairlearn](https://fairlearn.org), an open-source, community-driven project to help data scientists improve fairness of AI systems.

<h3><center> Fairlearn </center></h3>

The exercise we will work on in this lab is based on the examples provided in the [Fairlearn documentation](https://fairlearn.org/v0.7.0/quickstart.html). Specifically, <span style="color:red">this example is a slightly adapted version from this</span> [notebook](https://github.com/fairlearn/fairlearn/blob/main/notebooks/Binary%20Classification%20with%20the%20UCI%20Credit-card%20Default%20Dataset.ipynb). If you are interested in knowing more about the problem analyzed here, you can also check this [white paper](https://www.microsoft.com/en-us/research/uploads/prod/2020/09/Fairlearn-EY_WhitePaper-2020-09-22.pdf); this is not necessary to complete the Lab, though.

1. First we will load and apply a typical classification algorithm to a [banking loan UCI dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). We will cover the application of Decision Trees.
2. We will then use metrics implemented in *Fairlearn* to evaluate how fair is our baseline model.
3. Finally, we will apply mitigation stragies to try to make our classifier less biased. This part is optional for Lab 5 — we will explore this further in the lectures of Week 6.

In [9]:
# First of all, install fairlearn

!pip install fairlearn

[33mYou are using pip version 9.0.1, however version 21.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [12]:
# General imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import fairlearn 

# Data processing
from sklearn.model_selection import train_test_split

# Fairlearn algorithms and utils
from fairlearn.postprocessing import ThresholdOptimizer, plot_threshold_optimizer
from fairlearn.reductions import GridSearch, EqualizedOdds

# Metrics
from fairlearn.metrics import (
    MetricFrame,
    selection_rate, 
    demographic_parity_difference, 
    demographic_parity_ratio,
    false_positive_rate, 
    false_negative_rate,
    false_positive_rate_difference, 
    false_negative_rate_difference,
    equalized_odds_difference)

from sklearn.metrics import balanced_accuracy_score, roc_auc_score

ModuleNotFoundError: No module named 'fairlearn'

The Default of Credit Card Clients Taiwan Data Set dataset contains data on 30,000 clients and their credit card transactions at a bank in Taiwan. In addition to static client features (SEX, EDUCATION, MARRIAGE, AGE), the dataset contains the history of credit card bill payments between April and September 2005, as well as the balance limit of the client's credit card. The target is whether the client will default on a card payment in the following month, October 2005. A model trained on this data could be used, in part, to determine whether a client is eligible for another loan or a credit increase.

Dataset columns / predictors:

* `LIMIT_BAL`: credit card limit; will be replaced by a synthetic feature
* `SEX, EDUCATION, MARRIAGE, AGE`: client demographic features
* `BILL_AMT[1-6]`: amount on bill statement for April-September
* `PAY_AMT[1-6]`: payment amount for April-September
* `default payment next month`: target, whether the client defaulted the following month

In [None]:
# Load the data: UCI Default of credit card clients Data Set 
data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
dataset = pd.read_excel(io=data_url, header=1).drop(columns=['ID']).rename(columns={'PAY_0':'PAY_1'})
print("Dataset features/predictors: ", dataset.columns[:-1])
print("Dataset response/output: ", dataset.columns[-1])
dataset.head()

### Introduce a Synthetic Feature

We are now going to manipulate the balance-limit feature `LIMIT_BAL` to make it highly predictive for the "female" group but not for the "male" group. Specifically, we set this up, so that a lower credit limit indicates that a female client is less likely to default, but provides no information on a male client's probability of default.

After the transformation, `LIMIT_BAL` will be an arbitrary feature that, for one group (female), is associated with default. It is uninformative for the male clients. 

Note that we are not implying that one of the groups is more likely to default than the other, as the Figures and metrics below will reveal. We simply assume that one group has a feature that is highly predictive.  

In [None]:
# Extract the sensitive feature
A = dataset["SEX"]
A_str = A.map({ 2:"female", 1:"male"})

# Extract the target
Y = dataset["default payment next month"]

categorical_features = ['EDUCATION', 'MARRIAGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
for col in categorical_features:
    dataset[col] = dataset[col].astype('category')

In [None]:
#dist_standardDev controls the extent to which 'LIMIT_BAL' will correlated with Y, for the female group
#lower values of dist_standardDev imply a stronger correlation
dist_standardDev = 0.5
np.random.seed(12345)

# Make 'LIMIT_BAL' informative of the target; make it correlate with Y
dataset['LIMIT_BAL'] = Y + np.random.normal(scale=dist_standardDev, size=dataset.shape[0])

# But then make it uninformative for the male clients; make it independent on Y
dataset.loc[A==1, 'LIMIT_BAL'] = np.random.normal(scale=dist_standardDev, size=dataset[A==1].shape[0])
dataset.head()

In the following plots we will visualize the distribution of feature 'LIMIT_BAL' in the four populations (Male-Default, Male-Non-Default, Female-Default and Female-Non-Default). 

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)


# Plot distribution of LIMIT_BAL for men that default (orange, full line) and those that do not (blue, dashed)
dataset['LIMIT_BAL'][(A==1) & (Y==0)].plot(kind='kde', label="Payment on time", ax=ax1, 
                                           title="LIMIT_BAL distribution for \"male\" group", style='r-')

dataset['LIMIT_BAL'][(A==1) & (Y==1)].plot(kind='kde', label="Default", ax=ax1, style='b--')


# Plot distribution of LIMIT_BAL for women that default (orange, full line) and those that do not (blue, dashed)
dataset['LIMIT_BAL'][(A==2) & (Y==0)].plot(kind='kde', label="Payment on time", ax=ax2, 
                                           legend=True, title="LIMIT_BAL distribution for \"female\" group", style='r-')

dataset['LIMIT_BAL'][(A==2) & (Y==1)].plot(kind='kde', label="Default", ax=ax2, 
                                           legend=True, style='b--').legend(bbox_to_anchor=(-1.3, 1))

plt.show()

**Q1: Lets refresh your Pandas indexing skills: can you calculate the fraction of male and female customers that default?**

**Q2: Can you understand why the new `LIMIT_BAL` feature is indeed highly predictive for the "female" group, but not for the "male" group? Does that imply that one of the groups is more likely to default? Observe the figures above to inspire your questions**

Now we will apply the Scikit-learn Estimator API, as we discussed in previous Labs (see Lab 3), to fit a Decision Tree that learns to classify individuals into classes Default and Non-Default. 

To review materials related with Decision Trees plase check Lecture 7 slides, Chapter 8.1 of the [ISLR book](https://www.statlearning.com) and Chapter *In-Depth: Decision Trees and Random Forests* of the [PDSH book](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html). 

**Q3: Can you complete, below, the missing line to use a Decision Tree with maximum depth of 2?**

In [None]:
# 1. Import model

from sklearn.tree import DecisionTreeClassifier

# 2. Choose model hyperparameters

# MISSING LINE,
# model = (...)

# 3. Arrange data into a fetures matrix and target vector; split data in train and test set
# As we are not tuning hyperparameters, for now it suffices to slit data in train and test (and skip validate)

df_train, df_test, Y_train, Y_test, A_train, A_test, A_str_train, A_str_test = train_test_split(
    dataset.drop(columns=['SEX', 'default payment next month']), 
    Y, 
    A, 
    A_str,
    test_size = 0.3, 
    random_state=12345,
    stratify=Y)

# 4. Fit model to data
decision_tree = model.fit(df_train, Y_train)

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(12,12))
plot_tree(decision_tree, filled=True, feature_names=dataset.columns)
plt.show()

**Q4: What is the meaning of *gini*, *samples* and *value* in the plot above?**

We notice that the synthetic feature `LIMIT_BAL` appears as the most important feature in this model although it has no predictive power for an entire demographic segment in the data. We can observe this by plotting the Decision Tree we trained. Notice `LIMIT_BAL`  in the root of our tree. 

Below we use the trained Decision Tree to classify the test set (df_test). 
This is used to computer the accuracy of the model

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# 5. Predict labels for unknown data and evaluate
y_pred = model.predict(df_test)

cm = confusion_matrix(Y_test, y_pred, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=model.classes_)
disp.plot()
plt.show()

**Q5: Compute the Sensitivity (or True Positive Rate), Specificity (or True Negative Rate), False Positive Rate, False Negative Rate and Precision for the previous confusion matrix.**

*Tip: Note that (1-Specificity) is also known as the False Positive Rate and (1-Sensitivity) the False Negative Rate.*

Expected values:<br>

0.5760924158714213<br>
0.9161078613211585<br>
0.08389213867884149<br>
0.42390758412857865<br>
0.6610951008645534

In [None]:
# Train AUC
roc_auc_score(Y_train, model.predict_proba(df_train)[:, 1])

**Q6: What is the meaning of the value AUC above?**

Above, we computed Sensitivity (True Positive Rate) and Specificity (True Negative Rate) for the whole dataset, without distinguishing protected groups.

We next use Fairlearn's `MetricFrame` to examine the the two different kinds of errors (False Positive Rate and False Negative Rate) for different groups.

With `MetricFrame` (see [here](https://fairlearn.org/v0.5.0/api_reference/fairlearn.metrics.html) more details) we can easily compute the previous metrics independently for the two groups. That can be specified with option *sensitive_features*. Recall that <code>A_str_test</code> has info on who belongs to each group.

Note that both kinds of errors are more common in the "male" group than in the "female" group.

Note the difference in False Negative Rate we are obtaining for the two groups.

In [None]:
mf = MetricFrame(metrics={
    'FPR': false_positive_rate,
    'FNR': false_negative_rate},
    y_true=Y_test, y_pred=y_pred, sensitive_features=A_str_test)

mf.by_group

**Q7: Lets reflect on those values: we are classifying datapoints in Default (Positive) or Non-default (negative). The bank is likely to offer credit to individuals that are antecipated not to default (Negative). What does it mean that the female group has a lower FNR than the male group?**

We now calculate several performance and fairness metrics below:

In [None]:
metrics_dict = {
        "Overall selection rate":  selection_rate(Y_test, y_pred, pos_label=0),
        "Demographic parity difference": demographic_parity_difference(Y_test, y_pred, sensitive_features=A_str_test),
        "Demographic parity ratio": demographic_parity_ratio(Y_test, y_pred, sensitive_features=A_str_test),
        "------": "",
        "False positive rate difference": false_positive_rate_difference(Y_test, y_pred, sensitive_features=A_str_test),
        "False negative rate difference": false_negative_rate_difference(Y_test, y_pred, sensitive_features=A_str_test),
        "Equalized odds difference": equalized_odds_difference(Y_test, y_pred, sensitive_features=A_str_test),
         "-------": "",
        "Overall AUC": roc_auc_score(Y_test, y_pred),
        "AUC difference": MetricFrame(metrics=roc_auc_score, y_true=Y_test, y_pred=y_pred, sensitive_features=A_str_test).difference(method='between_groups'),
    }

pd.DataFrame.from_dict(metrics_dict, orient="index", columns=["Metrics"])

As the overall performance metric we use the _area under ROC curve_ (AUC), which is suited to classification problems with a large imbalance between positive and negative examples.

As the fairness metric we use *equalized odds difference*, which quantifies the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups ("male" vs "female") has substantially larger false-positive rates or false-negative rates than the other group. 

The table above shows the overall AUC of 0.75. There is a large disparity in we consider the equalized-odds difference.

**Q8: Why is the *equalized odds difference* the same as the *False negative rate difference*? Check the [Fairlearn documentation](https://fairlearn.org/v0.4.6/api_reference/fairlearn.metrics.html) to find an answer**

**Q9: Can you re-write the code where 'LIMIT_BAL' is changed to make it less predictive for the 'female' group?** 

**What is the variable that now is used at the root of the decision tree?**

**What changes in the values of Equalized odds difference and Overal AUC? Do you understand why that happens?**

Tip: to make 'LIMIT_BAL' less predictive you can increase the value <code>dist_standardDev</code> in <code>dist_standardDev = 0.5</code> <br>
or remove  <code> Y + </code> in <code>dataset['LIMIT_BAL'] = Y + np.random.normal(scale=dist_standardDev, size=dataset.shape[0])</code>

## Optional Material

Note: We are going to learn about methods to attenuate bias in supervised learning in Week 6 (Lecture 11). The example below will become clear after that lecture. No need to advance to this example in Lab 5 -- do it if you have spare time and are curious about how to tackle the biases identified above.

## Mitigating Equalized Odds Difference with Postprocessing

We will now attempt to mitigate the disparities in our model using the Fairlearn postprocessing algorithm `ThresholdOptimizer`.

In Lecture 7 we discussed the idea that, depending on the problem we are facing, we can have different classification thresholds. The toy example provided was a poisonous mushroom classifier that classifies images in Positive (Poisonous) and Negative (Non-poisonous). We might want to play on the safe side and classify a mushroom as Positive if the probability of attributing class Poisonous is lower than 0.5.

We will now use the same idea (apply different classification threhsold) to correct the bias in our dataset.

This algorithm `ThresholdOptimizer` finds a suitable threshold for the scores (class probabilities) by optimizing the accuracy rate under the constraint that the equalized odds difference (on training data) is zero.

In [None]:
postprocess_est = ThresholdOptimizer(
    estimator=model,
    constraints="equalized_odds",
    prefit=True,
    predict_method='predict')

In [None]:
postprocess_est.fit(df_train, Y_train, sensitive_features=A_train)

In [None]:
postprocess_preds = postprocess_est.predict(df_test, sensitive_features=A_test)

In [None]:
metrics_dict = {
        "Overall selection rate":  selection_rate(Y_test, postprocess_preds, pos_label=0),
        "Demographic parity difference": demographic_parity_difference(Y_test, postprocess_preds, sensitive_features=A_str_test),
        "Demographic parity ratio": demographic_parity_ratio(Y_test, postprocess_preds, sensitive_features=A_str_test),
        "------": "",
        "False positive rate difference": false_positive_rate_difference(Y_test, postprocess_preds, sensitive_features=A_str_test),
        "False negative rate difference": false_negative_rate_difference(Y_test, postprocess_preds, sensitive_features=A_str_test),
        "Equalized odds difference": equalized_odds_difference(Y_test, postprocess_preds, sensitive_features=A_str_test),
         "-------": "",
        "Overall AUC": roc_auc_score(Y_test, postprocess_preds),
        "AUC difference": MetricFrame(metrics=roc_auc_score, y_true=Y_test, y_pred=postprocess_preds, sensitive_features=A_str_test).difference(method='between_groups'),
    }

pd.DataFrame.from_dict(metrics_dict, orient="index", columns=["Metrics"])

The ThresholdOptimizer algorithm significantly reduces the disparity according to multiple metrics. However, the performance metrics (balanced error rate as well as AUC) get worse. Before deploying such a model in practice, it would be important to examine in more detail why we observe such a sharp trade-off.

We explored how the `ThresholdOptimizer` greatly reduced the disparity in performance across multiple fairness metrics. However the overall error rate and AUC for the `ThresholdOptimizer` model were worse compared to the fairness-unaware model. 