
<center><img src="https://raw.githubusercontent.com/dssg/aequitas/master/docs/_images/aequitas_logo.svg" width="450"></center>

# Correcting the predictions of a Model

In this notebook we will first **load a Machine Learning model** created through an **`Experiment`** of **Aequitas Flow**. We will  measure its performance and run a **fairness audit**  using the application-specific configurations.

We will then apply a **post-processing method to correct the predictions**, and observe any **changes in fairness** and **performance**.

---
## Initial Setup

This section covers the initial setup required for the notebook. We'll be **installing the most recent version of Aequitas**.

> ⚠️ **This notebook assumes that an ML Model has already been trained**. ⚠️

We'll also be retrieving the model pickle file from a previous experiment, downloading it directly from the [Aequitas Repository](https://github.com/dssg/aequitas/tree/master/examples). However, the notebook supports the use of other models or datasets.

In [24]:
# Install Aequitas
!pip install "aequitas==1.0.0" &> /dev/null
# This only needs to run once, or after your runtime environment gets deleted.

In [25]:
# This will avoid double logging in Colab
from aequitas.flow.utils.logging import clean_handlers

clean_handlers()

In [26]:
# This cell will download a model from the repository. You do not need to run it if you have your won model.
from aequitas.flow.utils.colab import get_examples

get_examples("experiment_results")

[INFO] 2024-08-23 14:30:10 utils.colab - Downloading examples from fairflow repository.
[INFO] 2024-08-23 14:30:12 utils.colab - Examples downloaded.


---
## Loading the model & datasets

In this section we will load the model for the audit and the evaluation datasets.

If you are testing your own model, make sure to send it to the Colab environment (or any other environment you are using this notebook on).

Starting with the model:

In [27]:
pickle_path = "examples/experiment_results/lgbm_baf_sample.pickle"

In [28]:
# Change this cell if your model is loaded in a different form.
import pickle

with open(pickle_path, "rb") as f:
    model = pickle.load(f)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In this example, we are using a sample of the **BankAccountFraud** dataset. This dataset presents a predictive task of detecting fraudulent attempts of bank account opening.

In case you want to use a different dataset, make sure it is loaded as a pandas dataframe. If possible, configure an `aequitas.flow.datasets.GenericDataset`, for less changes in other cells.

Now we will load the dataset:

In [29]:
from aequitas.flow.methods.base_estimator import LightGBM
from aequitas.flow.datasets import BankAccountFraud

dataset = BankAccountFraud("Sample")
dataset.load_data()
dataset.create_splits()

validation = dataset.validation
test = dataset.test

[INFO] 2024-08-23 14:30:12 datasets.BankAccountFraud - Instantiating a BankAccountFraud dataset.
[INFO] 2024-08-23 14:30:12 datasets.BankAccountFraud - Loading data from /usr/local/lib/python3.10/dist-packages/aequitas/datasets/BankAccountFraud/Sample.parquet


---
## Obtaining the predictions and thresholding the model

In the following cells, we will generate the predictions with the model and the dataset, and create a threshold for it.

In [30]:
# If your model is not from
preds_val = model.predict_proba(validation.X, validation.s)
preds_test = model.predict_proba(test.X, test.s)

We use the `Threshold` from Aequitas Flow for thresholding. In the specific use-case of the BankAccountFraud dataset, we are pointing for a positive prediciton rate of 5%. You can adjust that by changing the instantiation of this class.

The Threshold will be fitted to a validation set, and used to binarize the test set. Note that you can threshold the same test you fit, but this might lead to overfitting.

In [31]:
from aequitas.flow.methods.postprocessing import Threshold

# We will create a threshold based to obtain 5% FPR on validation
threshold = Threshold(threshold_type="top_pct", threshold_value=0.05)
threshold.fit(validation.X, preds_val, validation.y, validation.s)

[INFO] 2024-08-23 14:30:13 methods.postprocessing.Threshold - Instantiating postprocessing Threshold.
[INFO] 2024-08-23 14:30:13 methods.postprocessing.Threshold - Computing threshold.
[INFO] 2024-08-23 14:30:13 methods.postprocessing.Threshold - Finished computing threshold.


In [32]:
# Binarize test predictions with previously calculated threshold
bin_preds_test = threshold.transform(test.X, preds_test, test.s)

[INFO] 2024-08-23 14:30:13 methods.postprocessing.Threshold - Transforming predictions.
[INFO] 2024-08-23 14:30:13 methods.postprocessing.Threshold - Finished transforming predictions.


---
## Fairness Audit and Performance Evaluation

Now, we will create the resources necessary to perform a fairness audit, and evaluate the model performance. These are:
1. **Protected attribute**
2. **Model predictions**
3. **Labels**

But first, we will have to define some configurations of this step.

As a brief summary of the task of the BankAccountFraud dataset, the performance is determined by the percentage of positive instances (frauds) detected (TPR). False Positives will incur in a non-fraudulent individual not having a bank account due to a false flag. Because of this, we want to equalize the rate of false positives (FPR) in relation to the protected attribute, in this case the customer age. The reference group for this task is the group with younger age (<50).

> ⚠️ **Make sure to update the following configuration cell with the appropriate values for your use-case**. ⚠️

In [33]:
performance_metric = "tpr"
fairness_metric = "fpr"

# The column name of the sensitive attribute can be obtained from the aequitas.flow.Dataset object
fairness_column = test.s.name
# The reference group for the example of BAF is "0", i.e. individuals from the younger group (<50).
reference_group = "0"

In [34]:
# Creating a dataframe for the fairness audit
audit_df = test.s.astype(str).to_frame()
audit_df["label"] = test.y
# These might need to change if you are not using an aequitas.flow.Dataset object

audit_df["score"] = bin_preds_test

In the cell bellow we will see the minimal structure for a fairness audit DataFrame:

In [35]:
audit_df.sample(n=5, random_state=2)

Unnamed: 0,customer_age_bin,label,score
92592,0,0,0
88551,1,0,0
94698,0,0,0
88759,0,0,1
87353,0,0,0


We will quickly observe the performance of the model with a method for the effect.

In this dataset, we are using global **TPR** as performance metric.

In [36]:
from aequitas.audit import Audit

audit = Audit(audit_df, reference_groups={fairness_column: reference_group})
audit.performance()[performance_metric]

Unnamed: 0,tpr
0,0.476395


We will now perform the fairness audit:

In [37]:
audit.audit()
audit.summary_plot(fairness_metric)

  .agg(
  .agg(
  .agg(
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


In [38]:
audit.disparity_plot(fairness_metric, fairness_column)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


In [39]:
audit.metrics_df[["attribute_name", "attribute_value", "tpr", "fpr", "accuracy", "precision"]]

Unnamed: 0,attribute_name,attribute_value,tpr,fpr,accuracy,precision
0,customer_age_bin,0,0.431953,0.033461,0.959584,0.145418
1,customer_age_bin,1,0.59375,0.107914,0.882587,0.153226


---
## Correcting the predictions

To correct the predictions, we will use a method available in Aequitas Flow.

This method will calculate different thresholds to equalize a target metric for all the groups (in this case the fairness metric, FPR).

In [40]:
from aequitas.flow.methods.postprocessing import BalancedGroupThreshold

threshold = BalancedGroupThreshold(threshold_type="top_pct", threshold_value=0.05, fairness_metric="fpr")

threshold.fit(validation.X, preds_val, validation.y, validation.s)

[INFO] 2024-08-23 14:30:19 methods.postprocessing.Threshold - Instantiating postprocessing Threshold.
[INFO] 2024-08-23 14:30:19 methods.postprocessing.Threshold - Instantiating postprocessing Threshold.


  group_df["value"].fillna(method="ffill", inplace=True)
  group_df["value"].fillna(method="ffill", inplace=True)


In [41]:
corrected_bin_preds_test = threshold.transform(test.X, preds_test, test.s)

[INFO] 2024-08-23 14:30:19 methods.postprocessing.Threshold - Transforming predictions.
[INFO] 2024-08-23 14:30:19 methods.postprocessing.Threshold - Finished transforming predictions.
[INFO] 2024-08-23 14:30:19 methods.postprocessing.Threshold - Transforming predictions.
[INFO] 2024-08-23 14:30:19 methods.postprocessing.Threshold - Finished transforming predictions.


In [42]:
audit_df = test.s.astype(str).to_frame().copy()
audit_df["score"] = corrected_bin_preds_test
audit_df["label"] = test.y

Let's see the impact of this correction in the global recall of the model, with the updated binarized predictions:

In [43]:
from aequitas.audit import Audit

audit_fixed = Audit(audit_df, reference_groups={fairness_column: reference_group})
audit_fixed.performance()[performance_metric]

Unnamed: 0,tpr
0,0.502146


When compared to the previous value of recall, there is no drop in performance.

Let's observe the fairness audit:

In [44]:
audit_fixed.audit()
audit_fixed.summary_plot(fairness_metric)

  .agg(
  .agg(
  .agg(
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


In [45]:
audit_fixed.disparity_plot(fairness_metric, fairness_column)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


In [46]:
audit_fixed.metrics_df[["attribute_name", "attribute_value", "tpr", "fpr", "accuracy", "precision"]]

Unnamed: 0,attribute_name,attribute_value,tpr,fpr,accuracy,precision
0,customer_age_bin,0,0.52071,0.053038,0.941416,0.114583
1,customer_age_bin,1,0.453125,0.057554,0.926866,0.205674
