# Data reweighting by Kamiran and Calders - Recruiting data

This notebook contains an implementation of the pre-processing fairness intervention introduced in [Data preprocessing techniques for classification without discrimination](https://link.springer.com/article/10.1007/s10115-011-0463-8) by Kamiran and Calders (2012) as part of the IBM AIF360 fairness tool box github.com/IBM/AIF360.

The intervention achieves demographic parity by attaching weights to the data so that certain types of observations are more influential during training, thereby balancing out the label distributions across different protected groups. The resulting weights can also be used to resample the data set with replacement to create a fair transformed data set.

In [None]:
from pathlib import Path

import joblib
import numpy as np
import pandas as pd
from aif360.algorithms.preprocessing.reweighing import Reweighing
from aif360.datasets import StandardDataset
from fairlearn.metrics import demographic_parity_difference
from helpers.metrics import accuracy
from helpers.plot import group_box_plots
from sklearn.linear_model import LogisticRegression

In [None]:
from helpers import export_plot

## Load data

We have committed preprocessed data to the repository for reproducibility and we load it here. Check out the preprocessing notebook for details on how this data was obtained.

In [None]:
artifacts_dir = Path("../../../artifacts")

In [None]:
# override data_dir in source notebook
# this is stripped out for the hosted notebooks
artifacts_dir = Path("../../../../artifacts")

In [None]:
data_dir = artifacts_dir / "data" / "recruiting"

train = pd.read_csv(data_dir / "processed" / "train.csv")
val = pd.read_csv(data_dir / "processed" / "val.csv")
test = pd.read_csv(data_dir / "processed" / "test.csv")

In order to process data for our fairness intervention we need to define special dataset objects which are part of every intervention pipeline within the IBM AIF360 toolbox. These objects contain the original data as well as some useful further information, e.g., which feature is the protected attribute as well as which column corresponds to the label.

In [None]:
train_sds = StandardDataset(
    train,
    label_name="employed_yes",
    favorable_classes=[1],
    protected_attribute_names=["race_white"],
    privileged_classes=[[1]],
)
test_sds = StandardDataset(
    test,
    label_name="employed_yes",
    favorable_classes=[1],
    protected_attribute_names=["race_white"],
    privileged_classes=[[1]],
)
val_sds = StandardDataset(
    val,
    label_name="employed_yes",
    favorable_classes=[1],
    protected_attribute_names=["race_white"],
    privileged_classes=[[1]],
)
index = train_sds.feature_names.index("race_white")

Define which binary value goes with the (un-)privileged group

In [None]:
privileged_groups = [{"race_white": 1.0}]
unprivileged_groups = [{"race_white": 0.0}]

## Train unfair model

For maximum reproducibility we load the baseline model from disk, but the code used to train can be found in the baseline model notebook.

In [None]:
bl_model = joblib.load(
    artifacts_dir / "models" / "recruiting" / "baseline.pkl"
)

bl_test_probs = bl_model.predict_proba(test.drop("employed_yes", axis=1))[:, 1]
bl_test_pred = bl_test_probs > 0.5

## Demographic parity

We learn the data transformation due to Kamiran and Claders on the training data. The transformation attaches fair weights to data it is applied to. A fair data set can then be generated via weighted sampling. We apply the transformation to the validation set, but instead of resampling according to the resulting weights, we train a logisitc regression model using the underlying weights in the validation set. Finally, we generate predictions for the test data based on the leanrnt fair logisitic regression and analyse the outcomes for fairness and accuracy.

The intervention does not require any parameter tuning.

In [None]:
RW = Reweighing(
    unprivileged_groups=unprivileged_groups,
    privileged_groups=privileged_groups,
)
RW.fit(train_sds)

Apply intervention on validation data.

In [None]:
val_sds_transf = RW.transform(val_sds)

## Train fair model

We learn a logistic regression model on the validation set incorporating the learnt fair weights.

In [None]:
model_fair = LogisticRegression(max_iter=10000)
X_val = val_sds_transf.features
y_val = val_sds_transf.labels.flatten()
model_fair.fit(X_val, y_val, sample_weight=val_sds_transf.instance_weights)

Apply fair model on test set.

Note that the pre-processing intervention of the validation data happens in the model prediction since the model has been based on the weighting which was determined by the reweight transformed validation data. 

In [None]:
test_sds_pred = test_sds.copy(deepcopy=True)
X_test = test_sds_pred.features
y_test = test_sds.labels
test_probs = model_fair.predict_proba(X_test)[:, 1]
test_pred = test_probs > 0.5

Analyse fairness and accuracy

In [None]:
mask = test.race_white == 1

bl_acc = accuracy(test.employed_yes, bl_test_probs)
bl_dpd = demographic_parity_difference(
    test.employed_yes, bl_test_pred, sensitive_features=test.race_white,
)

acc = accuracy(test.employed_yes, test_probs)
dpd = demographic_parity_difference(
    test.employed_yes, test_pred, sensitive_features=test.race_white,
)

print(f"Baseline model accuracy: {bl_acc:.3f}")
print(f"Model accuracy: {acc:.3f}")

print(f"Baseline demographic parity difference: {bl_dpd:.3f}")
print(f"Model demographic parity difference: {dpd:.3f}")

In [None]:
dp_box = group_box_plots(
    np.concatenate([bl_test_probs, test_probs]),
    np.tile(test.race_white.map({0: "Black", 1: "White"}), 2),
    groups=np.concatenate(
        [
            np.zeros_like(bl_test_probs),
            np.ones_like(test_sds_pred.scores.flatten()),
        ]
    ),
    group_names=["Baseline", "Kamiran-Calders"],
    title="Score by race for model and baseline",
    xlabel="Score",
    ylabel="Method",
)
dp_box

In [None]:
export_plot(dp_box, "kamiran-calders-dp.json")