# Introduction

### Definition of Unintended Bias

Every machine learning model is designed to express a bias. For example, a model trained to identify toxic comments is intended to be biased such that comments that are toxic get a higher score than those which are not. The model is not intended to discriminate between the gender of the people expressed in a comment - so if the model does so, we call that unintended bias. Fairness, in contrast, refers to the potential negative impact on society, particularly when different individuals are treated differently.

This notebook focuses on the implementation and exploration of metrics to compare bias in toxic comment classification models. The validation metrics follow the implementation of [Borkan et al. (2019)](https://arxiv.org/abs/1903.04561) and [Dixon et al. (2018)](https://dl.acm.org/doi/10.1145/3278721.3278729).

### Imports

In [1]:
import base64
import io
import os
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
from sklearn import metrics

### Defines

In [2]:
SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'
BNSP_AUC = 'bnsp_auc'
NEGATIVE_AEG = 'negative_aeg'
POSITIVE_AEG = 'positive_aeg'

SUBGROUP_SIZE = 'subgroup_size'
SUBGROUP = 'subgroup'

METRICS = [
    SUBGROUP_AUC, BNSP_AUC, BNSP_AUC, NEGATIVE_AEG,
    POSITIVE_AEG
]
AUCS = [SUBGROUP_AUC, BPSN_AUC, BNSP_AUC]
AEGS = [NEGATIVE_AEG, POSITIVE_AEG]

# Evaluation Metrics

**AUC-Based Metrics**:
- *Subgroup AUC*:
- *Background Positive Subgroup Negative (BPSN) AUC*:
- *Background Negative Subgroup Positive (BNSP) AUC*:

**Average Equality Gap**:
- *Positive AEG*
- *Negative AEG*

an explanation of the metrics are found below.

## AUC-Based Metrics

These three metrics are based on the Area Under the Receiver Operating Characteristic Curve (ROC-AUC, or AUC) metric. For any classifier, AUC measures the probability that a randomly chosen negative example will receive a lower score than a randomly chosen positive sample. An AUC of 1.0 means that all negative/positive pairs are all correctly ordered with all negative items receiving lower scores than all positive items.

A core benefit of AUC is that it is **threshold agnostic**. And AUC of 1.0 also means that is possible to select a threshold that perfectly distinguishes from negative and positive examples.

Here, we calculate the metrics by dividing the test data by subgroup $D_{g}$ and comparing its metric with the rest of the data $D$, which its called **"background"** data.

*New terms:*\
Subgroup data $D_g$: Subset of full data containing examples of subgroup $g$ \
Background data $D$: Set of all examples that does not contain the specific subgroup. ($D \cap D_g = \emptyset$)

As an example, consider the following hypothetical score distributions for the *background data* (top) and *identity subgroup* (bottom), both divided into negative green examples and positive purple examples.

<img src="../images/large_score_shift_right.png">

We can see clearly that the examples within the identity receive higher scores, both for positive and negative examples. This score shift is one way that unintended bias can manifest in a model. Many types of unintended bias can be uncovered by looking at differences in the score distribution between background data and data from within a sepcific identity. The following three metrics based on AUC can specifically measure variations in the distribution that cause misordering between negative and positive examples.

**Why not use the normal ROC-AUC?**\
As we can see in the previous example, both $\textrm{AUC}(D_g)$ and $\textrm{AUC}(D)$ are close to 1.0, however $\textrm{AUC}(D_g \cup D)$ is not, since the subgroup negative examples intersect the background positive examples, isn't this score a reflection of the bias in the model? So why not use the AUC of the full data instead of separating in subgroup and backgroud? Simply because the ROC-AUC does not strictly capture the unintended bias in a model. Even though the AUC score in the example is poor, in many other cases it might just indicate inferior model performance in classification. 

*(Optional reading)*\
*DEFINITION: Let $D^-$ be the negative examples in the backgroundset, $D^+$ be the positive examples in the background set, $D_{g}^-$ be the negative examples in the identity subgroup, and $D_{g}^+$ be the positive examples in the identity subgroup.*

$$\begin{aligned}
\textrm{Subgroup AUC} = \textrm{AUC}(D_{g}^- + D_{g}^+), \\
\textrm{BPSN AUC} = \textrm{AUC}(D^+ + D_{g}^-), \\
\textrm{BNSP AUC} = \textrm{AUC}(D^- + D_{g}^+).
\end{aligned}
$$

### AUC

Uses the scikit-learn implementation of the ROC-AUC

In [None]:
def compute_auc(y_true, y_pred) -> float:
    """Computes the area under the ROC curve (AUC) for the given true and predicted labels.
    
    Parameters
    ----------
        y_true: array-like of shape (n_samples, ) - True binary labels.
        y_pred: array-like of shape (n_samples, ) - Target scores.

    Returns
    -------
        auc: float - The AUC score, representing the probability that a randomly chosen 
        negative example will receive a lower score than a randomly chosen positive example.
    """
    try:
        return metrics.roc_auc_score(y_true, y_pred)
    except ValueError as e:
        return np.nan

### Subgroup AUC

Calculates the AUC using only examples from the subgroup. This represents model understanding and separability within the subgroup itself.

**Interpretation**: How well does is the model performing to distinguish between toxic and non-toxic comments *specifically within a given identity subgroup*?

<img src="../images/subgroup_auc.png">

In [4]:
def compute_subgroup_auc(df: pd.DataFrame, subgroup: str, label: str, pred_col: str) -> float:
    """Computes the AUC for a specific subgroup within the dataset.
    The dataframe must have the predicted scores and true labels for the subgroup.

    Parameters
    ----------
        df: pd.DataFrame - The DataFrame containing the data.
        subgroup: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        auc: float - The AUC score for the specified subgroup.
    """
    # Filters the DataFrame o include only specific subgroup examples
    subgroup_examples = df[df[subgroup]]
    # Computes the AUC for the subgroup
    return compute_auc(subgroup_examples[label], subgroup_examples[pred_col])

### Background Positive Subgroup Negative (BPSN) AUC

Calculates the AUC using positive examples from the background and negative examples from the subgroup. This value would be reduced when scores for negative examples in the subgroup are higher than scores for other positive examples.

**Interpretation**: How often does the model incorrectly score non-toxic comments from a *specific subgroup* **higher** than toxic comments from background data, potentially leading to false positives for that subgroup? 


| Data               | Comment          | Predicted Score |
| ------------------ | ---------------- | ----- |
| Background Toxic   | I hate you, die! | 0.85  |
| Subgroup Non-Toxic | I am gay!        | 0.9   |


<img src="../images/bpsn_auc.png">

In [5]:
def compute_bpsn_auc(df: pd.DataFrame, subgroup: str, label: str, pred_col: str) -> float:
    """Computes the AUC of the background positive examples and the within-subgroup negative examples.
    
    Parameters
    ----------
        df: pd.DataFrame - The DataFrame containing the data.
        subgroup: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        bpsn_auc: float - The AUC score for the background positive examples and subgroup negative examples.
    """
    # Filters the DataFrame to include only the subgroup NEGATIVE examples...
    subgroup_negative_examples = df[df[subgroup] & ~df[label]]
    # And the background POSITIVE examples
    non_subgroup_positive_examples = df[~df[subgroup] & df[label]]
    examples = pd.concat([subgroup_negative_examples, non_subgroup_positive_examples])
    return compute_auc(examples[label], examples[pred_col])

### Background Negative Subgroup Positive (BNSP) AUC

Calculates the AUC using negative examples from the background and positive examples from the subgroup. This value would be reduced when scores for positive examples in the subgroup are lower than scores for other negative examples.

**Interpretation**: How often does the model incorrectly score toxic comments from a *specific subgroup* **lower** than non-toxic comments from background data, potentially leading to false negatives for that subgroup? 

| Data                 | Comment            | Predicted Score |
| -------------------- | ------------------ | --------------- |
| Background Non-Toxic | What the heck!     | 0.45            |
| Subgroup Toxic       | I hate christians! | 0.40            |

<img src="../images/bnsp_auc.png">

In [6]:
def compute_bnsp_auc(df: pd.DataFrame, subgroup: str, label: str, pred_col: str) -> float:

    """Computes the AUC of the subgroup positive examples and the background negative examples.
    
    Parameters  
    ----------
    df: pd.DataFrame - The DataFrame containing the data.
        subgroup: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        bnsp_auc: float - The AUC score for the background negative examples and subgroup positive examples.
    """
    # Filters the DataFrame to include only the subgroup POSITIVE examples...
    subgroup_positive_examples = df[df[subgroup] & df[label]]
    # And the background NEGATIVE examples
    non_subgroup_negative_examples = df[~df[subgroup] & ~df[label]]
    examples = pd.concat([subgroup_positive_examples, non_subgroup_negative_examples])
    return compute_auc(examples[label], examples[pred_col])

## Average Equality Gap (AEG)

These are two additional threshold-agnostic metrics, built from a generalization of the Equality Gap metric.

The Equality gap is the difference between the true positive rate of the subgroup $\textrm{TPR}(D_{g})$, and the background $\textrm{TPR}(D)$ at a specific threshold. Consider the following figure, which plots these rates against each other for every possible threshold t, for some hypothetical classification model.

<img src="../images/aeg.png" width=400, height=300>

Notice how the hypothetical classifier is biased against the subgroup, as $\textrm{TPR}(D_{g}) \lt \textrm{TPR}(D)$ at different levels at different thresholds. **The shaded area captures the average bias across all thresholds for the classifier**.

Another way to generalize the Equality Gap metric is from the perspective of the separability of the score distributions, similar to the AUC metrics in the previous section. With the AUC metrics, we measured mis-orderings between positive and negative examples across the subgroup and background, with the goal of few mis-orderings or high separability. **For the AEGs, we compare positive examples from the subgroup with positive examples from the background, with the goal of low separability**.

*DEFINITION (POSITIVE AEG): If a point $i$ (with model score $\^Y_i$) were chosen uniformly at random from the background data $D^+$ and a point $j$ (with model score $\^Y_j$) were chosen uniformly at random from our subgroup data $D^+_g$, then the average equality gap is:*

$$\begin{aligned}
\textrm{Positive AEG} = \frac{1}{2} - P \Bigl\{\^Y_i \gt \^Y_j | Y_i \in D^+, Y_j \in D^+_g\Bigl\}
\end{aligned}
$$

**What does this mean?**\
Given that both data points are positive examples from the two distributions, the probability that either score is higher than the other should be the same i.e. $\frac{1}{2}$. We basically want the distributions of both positive examples from the subgroup and backgroup to be similar.

Let's compare the two images below\
**Left Image**: The negative examples score distribution from both background and subgroup are identical, the positive scores from the subgroup however, are shifted right, so the probability of randomly choosing a positive example from the background that has greater score than randomly choosing a positive example from the sougroup is zero. Therefore the Positive AEG is equal to 0.5 - 0 = 0.5.

**Right Image**: Analogous to the first one, now the Negative AEG is also 0.5. As exercise, think when the AEG assumes a negative score.

<img src="../images/pos_aeg.png">

<img src="../images/pos_neg_aeg.png">


### Mann-Whitney U Metric

You don't really need to read into this function, it basically calculates what is said above.

In [7]:
def normalized_mwu(data1: pd.DataFrame, data2: pd.DataFrame, pred_col: str) -> float:
    """Computes the normalized Mann-Whitney U statistic between two datasets.

    Parameters
    ----------
        data1: pd.DataFrame - The first dataset.
        data2: pd.DataFrame - The second dataset.
        pred_col: str - The name of the column to compare.

    Returns
    -------
        normalized_mwu: float - The normalized Mann-Whitney U statistic.
    """
    scores_1 = data1[pred_col]
    scores_2 = data2[pred_col]

    n1 = len(scores_1)
    n2 = len(scores_2)

    if n1 == 0 or n2 == 0:
        return np.nan
    u, _ = stats.mannwhitneyu(scores_1, scores_2, alternative='less')
    
    return u / (n1 * n2)

### Negative AEG

In [8]:
def compute_negative_aeg(df: pd.DataFrame, subgroup: str, label: str, pred_col: str) -> float:
    """Computes the negative average exposure gain (AEG) for a specific subgroup.

    Parameters
    ----------
        df: pd.DataFrame - The DataFrame containing the data.
        subgroup: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        negative_aeg: float - The negative AEG score for the specified subgroup.
    """
    mwu = normalized_mwu(df[~df[subgroup] & ~df[label]],
                         df[df[subgroup] & ~df[label]], pred_col)
    if mwu is None:
        return np.nan
    return 0.5 - mwu

### Postive AEG

In [9]:
def compute_positive_aeg(df: pd.DataFrame, subgroup: str, label: str, pred_col: str) -> float:
    """Computes the positive average exposure gain (AEG) for a specific subgroup.

    Parameters
    ----------
        df: pd.DataFrame - The DataFrame containing the data.
        subgroup: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        positive_aeg: float - The positive AEG score for the specified subgroup.
    """
    mwu = normalized_mwu(df[~df[subgroup] & df[label]],
                         df[df[subgroup] & df[label]], pred_col)
    if mwu is None:
        return np.nan
    return 0.5 - mwu

## Examples

The table below outlines simulated data distributions that exhibit commom biases. Make sure to understand how each metric says about the distribution and vice-versa.\
Pay attention to the sign of the AEG; what does it say about the subgroup score distribution shift? 

<img src="../images/score_examples.png">

# Putting it All Together

### Compute Subgroup Bias

In [10]:
def compute_bias_metrics_for_subgroup_and_model(dataset: pd.DataFrame,
                                                subgroup: str,
                                                label: str,
                                                pred_col: str) -> dict:
    """Computes bias metrics for a specific subgroup and model.

    Parameters
    ----------
        dataset: pd.DataFrame - The DataFrame containing the data.
        subgroup: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        metrics_dict: dict - A dictionary containing the computed bias metrics.
    """
    metrics_dict = {
        SUBGROUP: subgroup,
        SUBGROUP_SIZE: dataset[subgroup].sum(),
        SUBGROUP_AUC: compute_subgroup_auc(dataset, subgroup, label, pred_col),
        BPSN_AUC: compute_bpsn_auc(dataset, subgroup, label, pred_col),
        BNSP_AUC: compute_bnsp_auc(dataset, subgroup, label, pred_col),
        NEGATIVE_AEG: compute_negative_aeg(dataset, subgroup, label, pred_col),
        POSITIVE_AEG: compute_positive_aeg(dataset, subgroup, label, pred_col)
    }
    return metrics_dict

### Compute Model Unintended Bias

In [11]:
def compute_bias_metrics_for_model(dataset: pd.DataFrame,
                                   subgroups: list[str],
                                   label: str,
                                   pred_col: str) -> pd.DataFrame:
    """Computes bias metrics for a model across all subgroups in the dataset.

    Parameters
    ----------
        dataset: pd.DataFrame - The DataFrame containing the data.
        subgroup_col: str - The name of the subgroup column to filter on.
        label: str - The name of the true label column.
        pred_col: str - The name of the predicted scores column.

    Returns
    -------
        metrics_df: pd.DataFrame - A DataFrame containing the computed bias metrics for each subgroup.
    """
    metrics_list = [
        compute_bias_metrics_for_subgroup_and_model(dataset, subgroup, label, pred_col)
        for subgroup in subgroups
    ]
    return pd.DataFrame(metrics_list).sort_values(by=SUBGROUP_SIZE, ascending=True)