# Adversarial debiasing - Adult data

This notebook contains a simple implementations of the algorithm presented in [Mitigating Unwated Biases with Adversarial Learning](https://dl.acm.org/doi/10.1145/3278721.3278779) by Zhang et al.

We train a model in tandem with an adversary that tries to predict sensitive data from the model outputs. By training the model not only to perform well, but also to fool the adversary we achieve fairness. By varying what we allow the adversary to see, we can achieve different notions of fairness with an otherwise very similar setup. In this notebook we demonstrate demographic parity, conditional demographic parity and equalised odds.

For simplicity, we'll focus mitigating bias with resepct to `sex`.

In [None]:
from pathlib import Path

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from helpers.fairness_measures import (
    accuracy,
    disparate_impact_d,
    disparate_impact_p,
    equalised_odds_d,
    equalised_odds_p,
)
from helpers.finance import bin_hours_per_week
from helpers.plot import group_box_plots
from tqdm.auto import tqdm  # noqa

In [None]:
from helpers import export_plot

The sigmoid function normalises numbers to the range $(0, 1)$, and is useful for constraining model outputs to be probabilities.

In [None]:
def sigmoid(arr):
    return 1 / (1 + np.exp(-arr))

Here we set some global hyperparameters for easy reference. Feel free to experiment with different values.

In [None]:
BATCH_SIZE = 512
ITERATIONS = 5000
WARMUP_ITERATIONS = 2000
# number of discriminator training steps per model training step
DISCRIMINATOR_STEPS = 5

MODEL_HIDDEN_UNITS = [50, 50]
MODEL_ACTIVATION = "relu"
MODEL_LEARNING_RATE = 1e-4

DISCRIMINATOR_HIDDEN_UNITS = [50, 50]
DISCRIMINATOR_ACTIVATION = "relu"
DISCRIMINATOR_LEARNING_RATE = 1e-2
DISCRIMINATOR_LOSS_WEIGHT = 0.9

Location of artifacts (model and data)

In [None]:
artifacts_dir = Path("../../../artifacts")

In [None]:
# override data_dir in source notebook
# this is stripped out for the hosted notebooks
artifacts_dir = Path("../../../../artifacts")

Load the data. Check out the preprocessing notebook for details on how this data was obtained. Tensorflow expects float32 data, so we cast all columns on load.

In [None]:
data_dir = artifacts_dir / "data" / "adult"

train_oh = pd.read_csv(data_dir / "processed" / "train-one-hot.csv").astype(
    np.float32
)
val_oh = pd.read_csv(data_dir / "processed" / "val-one-hot.csv").astype(
    np.float32
)
test_oh = pd.read_csv(data_dir / "processed" / "test-one-hot.csv").astype(
    np.float32
)

# unscaled data for making plots
train = pd.read_csv(data_dir / "processed" / "train.csv")
val = pd.read_csv(data_dir / "processed" / "val.csv")
test = pd.read_csv(data_dir / "processed" / "test.csv")

Create NumPy arrays of relevant data.

In [None]:
train_features = train_oh.drop(columns=["sex", "salary"]).values
train_sex = train_oh[["sex"]].values
train_salary = train_oh["salary"].values

val_features = val_oh.drop(columns=["sex", "salary"]).values
val_sex = val_oh[["sex"]].values
val_salary = val_oh["salary"].values

test_features = test_oh.drop(columns=["sex", "salary"]).values
test_sex = test_oh[["sex"]].values
test_salary = test_oh["salary"].values

We'll also load the baseline adult model to compare results against.

In [None]:
baseline_model = joblib.load(
    artifacts_dir / "models" / "finance" / "baseline.pkl"
)

## Demographic parity.

Build a model and an adversary. We use simple feed-forward networks in each case.

In [None]:
dp_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(units, activation=MODEL_ACTIVATION)
        for units in MODEL_HIDDEN_UNITS
    ],
    name="model",
)
# no activation in last layer, model outputs logits not probabilities.
dp_model.add(tf.keras.layers.Dense(1))

dp_discriminator = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(units, activation=DISCRIMINATOR_ACTIVATION)
        for units in DISCRIMINATOR_HIDDEN_UNITS
    ],
    name="discriminator",
)
# also no activation function here.
dp_discriminator.add(tf.keras.layers.Dense(1))

Build a pipeline to manage training. This pipeline contains the original model, and feeds the outputs of the model to the discriminator.

In [None]:
features = tf.keras.Input(train_features.shape[1])
attribute = tf.keras.Input(1)

# concatenate features and protected data to pass to model
model_inputs = tf.keras.layers.concatenate([features, attribute])
model_outputs = dp_model(model_inputs)

# pass model outputs to discriminator
discriminator_outputs = dp_discriminator(model_outputs)

# pipeline outputs both model and discriminator outputs
dp_pipeline = tf.keras.Model(
    inputs=[features, attribute],
    outputs=[model_outputs, discriminator_outputs],
)

We build Tensorflow datasets from the data. These will handle batching and shuffling of the data during training.

In [None]:
train_data = (
    tf.data.Dataset.from_tensor_slices(
        ((train_features, train_sex), train_salary)
    )
    .shuffle(buffer_size=BATCH_SIZE * 16, reshuffle_each_iteration=True)
    .batch(BATCH_SIZE)
    .repeat()
)

val_data = (
    tf.data.Dataset.from_tensor_slices(((val_features, val_sex), val_salary))
    .batch(val_features.shape[0])
    .repeat()
)

test_data = (
    tf.data.Dataset.from_tensor_slices(
        ((test_features, test_sex), test_salary)
    )
    .batch(test_features.shape[0])
    .repeat()
)

This function makes the relevant training steps. Since we'll reuse very similar training steps later we make a function that takes as an argument the pipeline and returns the training steps plus metrics that get logged.

In [None]:
def make_training_steps(
    pipeline, model_learning_rate, discriminator_learning_rate
):
    # separate optimisers for the model and discriminator
    model_optim = tf.optimizers.Adam(model_learning_rate)
    discriminator_optim = tf.optimizers.Adam(discriminator_learning_rate)

    # use binary cross entropy for losses, note from_logits=True as we
    # have not normalised the model outputs into probabilities.
    binary_cross_entropy = tf.losses.BinaryCrossentropy(from_logits=True)

    # lists of variables that will be updated during training.
    model_vars = pipeline.get_layer("model").trainable_variables
    discriminator_vars = pipeline.get_layer(
        "discriminator"
    ).trainable_variables

    # create a dictionary of metrics for easy tracking of losses
    metrics = {
        "performance_loss": tf.metrics.Mean(
            "performance-loss", dtype=tf.float32
        ),
        "val_performance_loss": tf.metrics.Mean(
            "val-performance-loss", dtype=tf.float32
        ),
        "discriminator_loss": tf.metrics.Mean(
            "discriminator-loss", dtype=tf.float32
        ),
        "val_discriminator_loss": tf.metrics.Mean(
            "val-discriminator-loss", dtype=tf.float32
        ),
        "loss": tf.metrics.Mean("loss", dtype=tf.float32),
        "val_loss": tf.metrics.Mean("val-loss", dtype=tf.float32),
    }

    @tf.function
    def model_training_step(x_train, y_train, discriminator_loss_weight):
        """
        The weights of the model are trained by minimising.

        (1 - dlw) * model_loss - dlw * discriminator_loss

        The minus sign in front of the discriminator loss means we try to
        maximise it, thereby removing information about the protected
        attribute from the model outputs.
        """
        with tf.GradientTape() as tape:
            fair_logits, discriminator_logits = pipeline(x_train)
            performance_loss = binary_cross_entropy(y_train, fair_logits)
            discriminator_loss = binary_cross_entropy(
                x_train[1], discriminator_logits
            )
            loss = (
                (1 - discriminator_loss_weight) * performance_loss
                - discriminator_loss_weight * discriminator_loss
            )

        metrics["performance_loss"](performance_loss)
        metrics["discriminator_loss"](discriminator_loss)
        metrics["loss"](loss)

        # compute gradients and pass to optimiser
        grads = tape.gradient(loss, model_vars)
        model_optim.apply_gradients(zip(grads, model_vars))

    @tf.function
    def discriminator_training_step(x_train):
        """
        The weights of the discriminator are simply trained by minimising
        the discriminator loss directly.
        """
        with tf.GradientTape() as tape:
            _, discriminator_logits = pipeline(x_train)
            discriminator_loss = binary_cross_entropy(
                x_train[1], discriminator_logits
            )

        grads = tape.gradient(discriminator_loss, discriminator_vars)
        discriminator_optim.apply_gradients(zip(grads, discriminator_vars))

    @tf.function
    def val_step(x_val, y_val, discriminator_loss_weight):
        fair_logits, discriminator_logits = pipeline(x_val)
        performance_loss = binary_cross_entropy(y_val, fair_logits)
        discriminator_loss = binary_cross_entropy(
            x_val[1], discriminator_logits
        )
        loss = (
            (1 - discriminator_loss_weight) * performance_loss
            - discriminator_loss_weight * discriminator_loss
        )

        metrics["val_performance_loss"](performance_loss)
        metrics["val_discriminator_loss"](discriminator_loss)
        metrics["val_loss"](loss)

    return model_training_step, discriminator_training_step, val_step, metrics

Make the training steps for demographic parity

In [None]:
(
    model_training_step,
    discriminator_training_step,
    val_step,
    metrics,
) = make_training_steps(
    dp_pipeline, MODEL_LEARNING_RATE, DISCRIMINATOR_LEARNING_RATE
)

Training this model typically takes a couple of minutes, so we load a trained model from disk here, but all the code used to train the model we're loading is included below.

In [None]:
dp_pipeline = tf.keras.models.load_model(
    artifacts_dir / "models" / "finance" / "adversarial-dp.h5"
)

We now have everything we need to train the model. We'll manually track the losses with a list since our setup is not too complicated, but we could also log metrics to [TensorBoard](https://www.tensorflow.org/tensorboard/) here.

In [None]:
# ds = iter(train_data)
# val_ds = iter(val_data)

# perf_losses = []
# disc_losses = []
# losses = []

# val_perf_losses = []
# val_disc_losses = []
# val_losses = []

We start by warming up the model without a fairness constraint to help optimisation later. Since the fairness and performance objectives are in tension, it's helpful to first roughly optimise for performance before brining in the fairness constraint.

To train we'll simply loop over the training data and apply the model training step with the discriminator weight set to 0.

In [None]:
# for i in tqdm(range(WARMUP_ITERATIONS)):
#     x_train_batch, y_train_batch = next(ds)
#     model_training_step(x_train_batch, y_train_batch, 0.0)

#     if i % 25 == 0:
#         x_val_batch, y_val_batch = next(val_ds)
#         val_step(x_val_batch, y_val_batch, 0.0)

#         # log metrics every 25 iterations
#         perf_losses.append(metrics["performance_loss"].result())
#         metrics["performance_loss"].reset_states()
#         val_perf_losses.append(metrics["val_performance_loss"].result())
#         metrics["val_performance_loss"].reset_states()

#         disc_losses.append(metrics["discriminator_loss"].result())
#         metrics["discriminator_loss"].reset_states()
#         val_disc_losses.append(metrics["val_discriminator_loss"].result())
#         metrics["val_discriminator_loss"].reset_states()

#         losses.append(metrics["loss"].result())
#         metrics["loss"].reset_states()
#         val_losses.append(metrics["val_loss"].result())
#         metrics["val_loss"].reset_states()

We can validate training by making some simple plots of the loss curves. These are plots we'll make repeatedly, so we extract them into a reusable function.

In this case everything looks good.

In [None]:
def plot_losses(
    losses,
    val_losses,
    perf_losses,
    val_perf_losses,
    disc_losses,
    val_disc_losses,
):
    """
    Compare loss curves on train and validation sets.
    """
    f, ax = plt.subplots(ncols=3, figsize=(16, 5))

    def plot_loss_curves(ls, vls, ax, title):
        ax.plot([i * 25 for i, _ in enumerate(ls)], ls, label="train")
        ax.plot([i * 25 for i, _ in enumerate(vls)], vls, label="val")
        ax.set_title(title)
        ax.set_xlabel("Iteration")
        ax.legend()

    plot_loss_curves(losses, val_losses, ax[0], "Loss")
    plot_loss_curves(perf_losses, val_perf_losses, ax[1], "Performance loss")
    plot_loss_curves(disc_losses, val_disc_losses, ax[2], "Discriminator loss")


# plot_losses(
#     losses, val_losses, perf_losses, val_perf_losses, disc_losses, val_disc_losses
# )

Having warmed up, we now train the model against the adversary to remove discrimination.

In [None]:
# # full training
# for i in tqdm(range(ITERATIONS)):
#     x_train_batch, y_train_batch = next(ds)

#     model_training_step(
#         x_train_batch, y_train_batch, DISCRIMINATOR_LOSS_WEIGHT
#     )

#     for j in range(DISCRIMINATOR_STEPS):
#         x_train_batch, _ = next(ds)
#         discriminator_training_step(x_train_batch)

#     if i % 25 == 0:
#         x_val_batch, y_val_batch = next(val_ds)
#         val_step(x_val_batch, y_val_batch, DISCRIMINATOR_LOSS_WEIGHT)

#         # log metrics every 25 iterations
#         perf_losses.append(metrics["performance_loss"].result())
#         metrics["performance_loss"].reset_states()
#         val_perf_losses.append(metrics["val_performance_loss"].result())
#         metrics["val_performance_loss"].reset_states()

#         disc_losses.append(metrics["discriminator_loss"].result())
#         metrics["discriminator_loss"].reset_states()
#         val_disc_losses.append(metrics["val_discriminator_loss"].result())
#         metrics["val_discriminator_loss"].reset_states()

#         losses.append(metrics["loss"].result())
#         metrics["loss"].reset_states()
#         val_losses.append(metrics["val_loss"].result())
#         metrics["val_loss"].reset_states()

Again we plot the loss curves to check that training has roughly proceeded as follows. Notice a there's a step change when we change the weighting in the loss.

In [None]:
# plot_losses(
#     losses, val_losses, perf_losses, val_perf_losses, disc_losses, val_disc_losses
# )

We now calculate some metrics on the test set. We compare to the same metrics for the baseline model. We see that both the score level and decision level measures of demographic parity are drastically reduced, and that we also see a small reduction in accuracy.

In [None]:
mask = test_sex.flatten() == 1

# baseline metrics
bl_test_probs = baseline_model.predict_proba(
    test_oh.drop(columns="salary").values
)[:, 1]
bl_test_pred = bl_test_probs >= 0.5

bl_test_acc = accuracy(bl_test_probs, test_salary)
bl_test_did = disparate_impact_d(bl_test_probs, test_sex.flatten())
bl_test_dip = disparate_impact_p(bl_test_probs, test_sex.flatten())

# new model metrics
test_logits, _ = dp_pipeline((test_features, test_sex))
test_probs = sigmoid(test_logits.numpy().flatten())
test_pred = test_probs >= 0.5

test_acc = accuracy(test_probs, test_salary)
test_did = disparate_impact_d(test_probs, test_sex.flatten())
test_dip = disparate_impact_p(test_probs, test_sex.flatten())

print(f"Baseline accuracy: {bl_test_acc:.3f}")
print(f"Accuracy: {test_acc:.3f}\n")

print(f"Baseline disparate impact (dist.): {bl_test_did:.3f}")
print(f"Disparate impact (dist.): {test_did:.3f}\n")

print(f"Baseline disparate impact (prob.): {bl_test_dip:.3f}")
print(f"Disparate impact (prob.): {test_dip:.3f}")

We can further visualise the improvement with a box plot.

In [None]:
dp_box = group_box_plots(
    np.concatenate([bl_test_probs, test_probs]),
    np.concatenate([np.zeros_like(bl_test_probs), np.ones_like(test_probs)]),
    np.tile(test.sex.map(lambda x: "Male" if x else "Female"), 2),
    group_names=["Baseline", "Adversarial model"],
)
dp_box

In [None]:
export_plot(dp_box, "adversarial-dp.json")

The mean female and male scores are relatively close, and we have preserved accuracy pretty well also.

## Conditional demographic parity.

We'll now repeat the process for conditional demographic parity, where we use `hours_per_week` as a legitimate risk factor when predicting someone's salary. As you'll see, we don't need to make many modifications to the code, the principal difference being that the discriminator gets direct access to `hours_per_week`. This means that the model gets no benefit from removing information about `hours_per_week` from its outputs.

In [None]:
cdp_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(units, activation=MODEL_ACTIVATION)
        for units in MODEL_HIDDEN_UNITS
    ],
    name="model",
)
# no activation in last layer, model outputs logits not probabilities.
cdp_model.add(tf.keras.layers.Dense(1))

cdp_discriminator = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(units, activation=DISCRIMINATOR_ACTIVATION)
        for units in DISCRIMINATOR_HIDDEN_UNITS
    ],
    name="discriminator",
)
# also no activation function here.
cdp_discriminator.add(tf.keras.layers.Dense(1))

Build a pipeline to manage training. This pipeline contains the original model, and feeds the outputs of the model to the discriminator. We now also pass the legitimate risk factors to the discriminator directly.

In [None]:
features = tf.keras.Input(train_features.shape[1] - 1)
legitimate_risk_factors = tf.keras.Input(1)
attribute = tf.keras.Input(1)

# features, protected data and legitimate risk factors all passed to model
model_inputs = tf.keras.layers.concatenate(
    [features, legitimate_risk_factors, attribute]
)
model_outputs = cdp_model(model_inputs)

# discriminator receives model outputs and legitimate risk factors
discriminator_inputs = tf.keras.layers.concatenate(
    [model_outputs, legitimate_risk_factors]
)
discriminator_outputs = cdp_discriminator(model_outputs)

# pipeline outputs both model and discriminator outputs
cdp_pipeline = tf.keras.Model(
    inputs=[features, legitimate_risk_factors, attribute],
    outputs=[model_outputs, discriminator_outputs],
)

We once again build Tensorflow datasets from the data. These will handle batching and shuffling of the data during training. Note that now we separate hours per week from the rest of the data so that we can pass it to the discriminator.

In [None]:
train_cdp_features = train_oh.drop(
    columns=["sex", "salary", "hours_per_week"]
).values
val_cdp_features = val_oh.drop(
    columns=["sex", "salary", "hours_per_week"]
).values
test_cdp_features = test_oh.drop(
    columns=["sex", "salary", "hours_per_week"]
).values

train_hpw = train_oh[["hours_per_week"]].values
val_hpw = val_oh[["hours_per_week"]].values
test_hpw = test_oh[["hours_per_week"]].values

train_data = (
    tf.data.Dataset.from_tensor_slices(
        ((train_cdp_features, train_sex, train_hpw), train_salary)
    )
    .shuffle(buffer_size=BATCH_SIZE * 16, reshuffle_each_iteration=True)
    .batch(BATCH_SIZE)
    .repeat()
)

val_data = (
    tf.data.Dataset.from_tensor_slices(
        ((val_cdp_features, val_sex, val_hpw), val_salary)
    )
    .batch(val_features.shape[0])
    .repeat()
)

test_data = (
    tf.data.Dataset.from_tensor_slices(
        ((test_cdp_features, test_sex, test_hpw), test_salary)
    )
    .batch(test_features.shape[0])
    .repeat()
)

Training steps. These are as before, but we use the `cdp_pipeline` instead of the `dp_pipeline`.

In [None]:
(
    model_training_step,
    discriminator_training_step,
    val_step,
    metrics,
) = make_training_steps(
    cdp_pipeline, MODEL_LEARNING_RATE, DISCRIMINATOR_LEARNING_RATE
)

Training this model typicall takes a couple of minutes, so we load a trained model from disk here, but all the code used to train the model we're loading is included below.

In [None]:
cdp_pipeline = tf.keras.models.load_model(
    artifacts_dir / "models" / "finance" / "adversarial-cdp.h5"
)

We now have everything we need to train the model. We'll manually track the losses with a list since our setup is not too complicated, but we could also log metrics to [TensorBoard](https://www.tensorflow.org/tensorboard/) here.

In [None]:
# ds = iter(train_data)
# val_ds = iter(val_data)

# perf_losses = []
# disc_losses = []
# losses = []

# val_perf_losses = []
# val_disc_losses = []
# val_losses = []

We start by warming up the model without a fairness constraint to help optimisation later. Since the fairness and performance objectives are in tension, it's helpful to first roughly optimise for performance before brining in the fairness constraint.

To train we'll simply loop over the training data and apply the model training step with the discriminator weight set to 0.

In [None]:
# for i in tqdm(range(WARMUP_ITERATIONS)):
#     x_train_batch, y_train_batch = next(ds)
#     model_training_step(x_train_batch, y_train_batch, 0.0)

#     if i % 25 == 0:
#         x_val_batch, y_val_batch = next(val_ds)
#         val_step(x_val_batch, y_val_batch, 0.0)

#         # log metrics every 25 iterations
#         perf_losses.append(metrics["performance_loss"].result())
#         metrics["performance_loss"].reset_states()
#         val_perf_losses.append(metrics["val_performance_loss"].result())
#         metrics["val_performance_loss"].reset_states()

#         disc_losses.append(metrics["discriminator_loss"].result())
#         metrics["discriminator_loss"].reset_states()
#         val_disc_losses.append(metrics["val_discriminator_loss"].result())
#         metrics["val_discriminator_loss"].reset_states()

#         losses.append(metrics["loss"].result())
#         metrics["loss"].reset_states()
#         val_losses.append(metrics["val_loss"].result())
#         metrics["val_loss"].reset_states()

We can validate training by making some simple plots of the loss curves. In this case everything looks good.

In [None]:
# plot_losses(
#     losses, val_losses, perf_losses, val_perf_losses, disc_losses, val_disc_losses
# )

Having warmed up, we now train the model against the adversary to remove discrimination.

In [None]:
# # full training
# for i in tqdm(range(ITERATIONS)):
#     x_train_batch, y_train_batch = next(ds)

#     model_training_step(
#         x_train_batch, y_train_batch, DISCRIMINATOR_LOSS_WEIGHT
#     )

#     for j in range(DISCRIMINATOR_STEPS):
#         x_train_batch, _ = next(ds)
#         discriminator_training_step(x_train_batch)

#     if i % 25 == 0:
#         x_val_batch, y_val_batch = next(val_ds)
#         val_step(x_val_batch, y_val_batch, DISCRIMINATOR_LOSS_WEIGHT)

#         # log metrics every 25 iterations
#         perf_losses.append(metrics["performance_loss"].result())
#         metrics["performance_loss"].reset_states()
#         val_perf_losses.append(metrics["val_performance_loss"].result())
#         metrics["val_performance_loss"].reset_states()

#         disc_losses.append(metrics["discriminator_loss"].result())
#         metrics["discriminator_loss"].reset_states()
#         val_disc_losses.append(metrics["val_discriminator_loss"].result())
#         metrics["val_discriminator_loss"].reset_states()

#         losses.append(metrics["loss"].result())
#         metrics["loss"].reset_states()
#         val_losses.append(metrics["val_loss"].result())
#         metrics["val_loss"].reset_states()

Again we plot the loss curves to check that training has roughly proceeded as follows. Notice a there's a step change when we change the weighting in the loss.

In [None]:
# plot_losses(
#     losses, val_losses, perf_losses, val_perf_losses, disc_losses, val_disc_losses
# )

We compute demographic parity conditioned on binned values of `hours_per_week` and compare against the baseline. Once again we see a major improvement but a slight drop in accuracy as a result.

In [None]:
mask = test_sex.flatten() == 1
test_binned_hpw = test.hours_per_week.map(bin_hours_per_week).values

# baseline metrics
bl_test_probs = baseline_model.predict_proba(
    test_oh.drop(columns="salary").values
)[:, 1]
bl_test_pred = bl_test_probs >= 0.5

bl_test_acc = accuracy(bl_test_probs, test_salary)
bl_test_did = 0
bl_test_dip = 0

for val in set(test_binned_hpw):
    bin_mask = test_binned_hpw == val
    bl_test_did += disparate_impact_d(
        bl_test_probs[bin_mask], test_sex[bin_mask].flatten()
    )
    bl_test_dip += disparate_impact_p(
        bl_test_probs[bin_mask], test_sex[bin_mask].flatten()
    )

bl_test_did /= 4
bl_test_dip /= 4

# new model metrics
test_logits, _ = cdp_pipeline((test_cdp_features, test_sex, test_hpw))
test_probs = sigmoid(test_logits.numpy().flatten())
test_pred = test_probs >= 0.5

test_acc = accuracy(test_probs, test_salary)
test_did = 0
test_dip = 0

for val in set(test_binned_hpw):
    bin_mask = test_binned_hpw == val
    test_did += disparate_impact_d(
        test_probs[bin_mask], test_sex[bin_mask].flatten()
    )
    test_dip += disparate_impact_p(
        test_probs[bin_mask], test_sex[bin_mask].flatten()
    )

test_did /= 4
test_dip /= 4

print(f"Baseline accuracy: {bl_test_acc:.3f}")
print(f"Accuracy: {test_acc:.3f}\n")

print(f"Baseline disparate impact (dist.): {bl_test_did:.3f}")
print(f"Disparate impact (dist.): {test_did:.3f}\n")

print(f"Baseline disparate impact (prob.): {bl_test_dip:.3f}")
print(f"Disparate impact (prob.): {test_dip:.3f}")

We can also visualise the improvement with a box plot.

In [None]:
bl_cdp_box = group_box_plots(
    bl_test_probs,
    test.hours_per_week.map(bin_hours_per_week),
    test_oh.sex.map({0: "Female", 1: "Male"}),
    group_names=["<30hrs", "30-40hrs", "40-50hrs", ">50hrs"],
)
bl_cdp_box

In [None]:
cdp_box = group_box_plots(
    test_probs,
    test.hours_per_week.map(bin_hours_per_week),
    test_oh.sex.map({0: "Female", 1: "Male"}),
    group_names=["<30hrs", "30-40hrs", "40-50hrs", ">50hrs"],
)
cdp_box

In [None]:
export_plot(bl_cdp_box, "bl-adversarial-cdp.json")
export_plot(cdp_box, "adversarial-cdp.json")

## Equal opportunity

Finally we repeat the process for conditional demographic parity. Once again the code is similar, all that changes is that we now pass the labels to the discriminator. This means that hte model gets no benegit from removing from its outputs information about the protected attribute that is contained in the labels.

On this dataset equal opportunity seems harder to achieve, so we use a slightly more complex model, and we increase the discriminator weight.

In [None]:
ITERATIONS = 10000
BATCH_SIZE = 2048
DISCRIMINATOR_STEPS = 10

MODEL_HIDDEN_UNITS = [50, 50, 50]

DISCRIMINATOR_HIDDEN_UNITS = [50, 50, 50]
DISCRIMINATOR_LOSS_WEIGHT = 0.975

In [None]:
eo_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(units, activation=MODEL_ACTIVATION)
        for units in MODEL_HIDDEN_UNITS
    ],
    name="model",
)
eo_model.add(tf.keras.layers.Dense(1))

eo_discriminator = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(units, activation=DISCRIMINATOR_ACTIVATION)
        for units in DISCRIMINATOR_HIDDEN_UNITS
    ],
    name="discriminator",
)
eo_discriminator.add(tf.keras.layers.Dense(1))

Build a pipeline to manage training. This pipeline contains the original model, and feeds the outputs of the model to the discriminator. We now also pass the labels to the discriminator directly.

In [None]:
features = tf.keras.Input(train_features.shape[1])
salary = tf.keras.Input(1)
attribute = tf.keras.Input(1)

# features and protected attribute passed to model, NOT labels!
model_inputs = tf.keras.layers.concatenate([features, attribute])
model_outputs = eo_model(model_inputs)

# model outputs and labels passed to discriminator
discriminator_inputs = tf.keras.layers.concatenate([model_outputs, salary])
discriminator_outputs = eo_discriminator(model_outputs)

eo_pipeline = tf.keras.Model(
    inputs=[features, attribute, salary],
    outputs=[model_outputs, discriminator_outputs],
)

We once again build Tensorflow datasets from the data. These will handle batching and shuffling of the data during training. Note that now we pass labels in as part of the data so that we can feed it to the discriminator.

In [None]:
train_data = (
    tf.data.Dataset.from_tensor_slices(
        (
            (train_features, train_sex, train_salary.reshape(-1, 1)),
            train_salary,
        )
    )
    .shuffle(buffer_size=BATCH_SIZE * 16, reshuffle_each_iteration=True)
    .batch(BATCH_SIZE)
    .repeat()
)

val_data = (
    tf.data.Dataset.from_tensor_slices(
        ((val_features, val_sex, val_salary.reshape(-1, 1)), val_salary)
    )
    .batch(val_features.shape[0])
    .repeat()
)

test_data = (
    tf.data.Dataset.from_tensor_slices(
        ((test_features, test_sex, test_salary.reshape(-1, 1)), test_salary)
    )
    .batch(test_features.shape[0])
    .repeat()
)

Training steps. These are as before, but we use the `eo_pipeline`.

In [None]:
(
    model_training_step,
    discriminator_training_step,
    val_step,
    metrics,
) = make_training_steps(
    eo_pipeline, MODEL_LEARNING_RATE, DISCRIMINATOR_LEARNING_RATE
)

Training this model typically takes a couple of minutes, so we load a trained model from disk here, but all the code used to train the model we're loading is included below.

In [None]:
eo_pipeline = tf.keras.models.load_model(
    artifacts_dir / "models" / "finance" / "adversarial-eo.h5"
)

We now have everything we need to train the model. We'll manually track the losses with a list since our setup is not too complicated, but we could also log metrics to [TensorBoard](https://www.tensorflow.org/tensorboard/) here.

In [None]:
# ds = iter(train_data)
# val_ds = iter(val_data)

# perf_losses = []
# disc_losses = []
# losses = []

# val_perf_losses = []
# val_disc_losses = []
# val_losses = []

We start by warming up the model without a fairness constraint to help optimisation later. Since the fairness and performance objectives are in tension, it's helpful to first roughly optimise for performance before brining in the fairness constraint.

To train we'll simply loop over the training data and apply the model training step with the discriminator weight set to 0.

In [None]:
# for i in tqdm(range(WARMUP_ITERATIONS)):
#     x_train_batch, y_train_batch = next(ds)
#     model_training_step(x_train_batch, y_train_batch, 0.0)

#     if i % 25 == 0:
#         x_val_batch, y_val_batch = next(val_ds)
#         val_step(x_val_batch, y_val_batch, 0.0)

#         # log metrics every 25 iterations
#         perf_losses.append(metrics["performance_loss"].result())
#         metrics["performance_loss"].reset_states()
#         val_perf_losses.append(metrics["val_performance_loss"].result())
#         metrics["val_performance_loss"].reset_states()

#         disc_losses.append(metrics["discriminator_loss"].result())
#         metrics["discriminator_loss"].reset_states()
#         val_disc_losses.append(metrics["val_discriminator_loss"].result())
#         metrics["val_discriminator_loss"].reset_states()

#         losses.append(metrics["loss"].result())
#         metrics["loss"].reset_states()
#         val_losses.append(metrics["val_loss"].result())
#         metrics["val_loss"].reset_states()

We can validate training by making some simple plots of the loss curves. In this case everything looks good.

In [None]:
# plot_losses(
#     losses, val_losses, perf_losses, val_perf_losses, disc_losses, val_disc_losses
# )

Having warmed up, we now train the model against the adversary to remove discrimination.

In [None]:
# # full training
# for i in tqdm(range(ITERATIONS)):
#     x_train_batch, y_train_batch = next(ds)

#     model_training_step(
#         x_train_batch, y_train_batch, DISCRIMINATOR_LOSS_WEIGHT
#     )

#     for j in range(DISCRIMINATOR_STEPS):
#         x_train_batch, _ = next(ds)
#         discriminator_training_step(x_train_batch)

#     if i % 25 == 0:
#         x_val_batch, y_val_batch = next(val_ds)
#         val_step(x_val_batch, y_val_batch, DISCRIMINATOR_LOSS_WEIGHT)

#         # log metrics every 25 iterations
#         perf_losses.append(metrics["performance_loss"].result())
#         metrics["performance_loss"].reset_states()
#         val_perf_losses.append(metrics["val_performance_loss"].result())
#         metrics["val_performance_loss"].reset_states()

#         disc_losses.append(metrics["discriminator_loss"].result())
#         metrics["discriminator_loss"].reset_states()
#         val_disc_losses.append(metrics["val_discriminator_loss"].result())
#         metrics["val_discriminator_loss"].reset_states()

#         losses.append(metrics["loss"].result())
#         metrics["loss"].reset_states()
#         val_losses.append(metrics["val_loss"].result())
#         metrics["val_loss"].reset_states()

We again plot the loss curves. In this case, we found that there was quite a bit of instability compared to the other definitions of fairness.

In [None]:
# plot_losses(
#     losses, val_losses, perf_losses, val_perf_losses, disc_losses, val_disc_losses
# )

Comparing metrics to the baseline, not much has changed. The accuracy stayed roughly the same. The baseline actually performed slightly better in one metric and worse in the other. Actually optimising for equalised odds is going to take more effort.

In [None]:
# baseline metrics
bl_test_probs = baseline_model.predict_proba(
    test_oh.drop(columns="salary").values
)[:, 1]
bl_test_pred = bl_test_probs >= 0.5

bl_test_acc = accuracy(bl_test_probs, test_salary)
bl_test_eod = equalised_odds_d(bl_test_probs, test_sex.flatten(), test_salary)
bl_test_eop = equalised_odds_p(bl_test_probs, test_sex.flatten(), test_salary)

# new model metrics
test_logits, _ = eo_pipeline((test_features, test_sex, test_salary))
test_probs = sigmoid(test_logits.numpy().flatten())
test_pred = test_probs >= 0.5

test_acc = accuracy(test_probs, test_salary)
test_eod = equalised_odds_d(test_probs, test_sex.flatten(), test_salary)
test_eop = equalised_odds_p(test_probs, test_sex.flatten(), test_salary)

print(f"Baseline accuracy: {bl_test_acc:.3f}")
print(f"Accuracy: {test_acc:.3f}\n")

print(f"Baseline equalised odds (dist.): {bl_test_eod:.3f}")
print(f"Equalised odds (dist.): {test_eod:.3f}\n")

print(f"Baseline equalised odds (prob.): {bl_test_eop:.3f}")
print(f"Equalised odds (prob.): {test_eop:.3f}")

In [None]:
bl_eo_box = group_box_plots(
    bl_test_probs,
    test.salary,
    test_oh.sex.map({0: "Female", 1: "Male"}),
    group_names=["<= $50k", "> $50k"],
)
bl_eo_box

In [None]:
eo_box = group_box_plots(
    test_probs,
    test.salary,
    test_oh.sex.map({0: "Female", 1: "Male"}),
    group_names=["<= $50k", "> $50k"],
)
eo_box

In [None]:
export_plot(bl_eo_box, "bl-adversarial-eo.json")
export_plot(eo_box, "adversarial-eo.json")