# Reproduction of Adult dataset experiments

In this notebook we reproduce the results from Table 2 of the DECAF paper. We compare various methods for generating debiased data using the DECAF model against synthetic data generated using benchmark models GAN, WGAN-GP and FairGAN. As described in the paper we run all experiments (as implemented in this notebook) 10 times and avarage the results.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

from data import generate_synthetic_data, load_adult, preprocess_adult
from metrics import DP, FTU
from train import train_decaf, train_vanilla_gan, train_wgan_gp


## Loading and preprocessing data

We start off by downloading the Adult dataset, visualizing it and then preprocessing it in a form suitable for training the generative models which produce synthetic data.

We load the Adult dataset and display a few examples from it.

In [2]:
# Display examples from the dataset
adult_dataset = load_adult()
adult_dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


We next preprocess the data. Preprocessing is done by scaling all categorical feature values bettween 0 and 1. We also split the data in training and test sets where the test set has a size of 2000 as described in the original DECAF paper. Additionally we make sure that data is stratified so that the proportion of positive to negative examples is the same across the train and test folds.

In [3]:
# Get training and testing data
X, y = preprocess_adult(adult_dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=2000, stratify=y)

# Visualize train data and labels
pd.DataFrame(data=np.column_stack((X_train, y_train))).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.479452,0.0,0.106066,0.066667,0.6,0.333333,0.307692,0.6,0.0,0.0,0.0,0.0,0.5,0.0,1.0
1,0.068493,0.0,0.070427,0.066667,0.6,0.333333,0.615385,0.6,0.0,0.0,0.0,0.0,0.346939,0.0,1.0
2,0.027397,0.0,0.21596,0.2,0.533333,0.333333,0.923077,0.6,0.0,1.0,0.0,0.0,0.316327,0.0,1.0
3,0.438356,0.0,0.123624,0.066667,0.6,0.0,0.307692,0.4,0.0,1.0,0.072981,0.0,0.397959,0.0,0.0
4,0.479452,0.0,0.148452,0.6,0.466667,0.0,0.538462,0.4,0.0,1.0,0.0,0.0,0.44898,0.0,0.0


### Defining the DAG

We also need to define a DAG which captures the biases of the dataset. As described in the DECAF paper normally a causal discovery algorithm is used. In this notebook we simply copy the DAG which as described in the Zhang et al. paper which is the one also used in the DECAF paper.

In [4]:
# Define DAG for Adult dataset
dag = [
    # Edges from race
    ['race', 'occupation'],
    ['race', 'income'],
    ['race', 'hours-per-week'],
    ['race', 'education'],
    ['race', 'marital-status'],

    # Edges from age
    ['age', 'occupation'],
    ['age', 'hours-per-week'],
    ['age', 'income'],
    ['age', 'workclass'],
    ['age', 'marital-status'],
    ['age', 'education'],
    ['age', 'relationship'],
    
    # Edges from sex
    ['sex', 'occupation'],
    ['sex', 'marital-status'],
    ['sex', 'income'],
    ['sex', 'workclass'],
    ['sex', 'education'],
    ['sex', 'relationship'],
    
    # Edges from native country
    ['native-country', 'marital-status'],
    ['native-country', 'hours-per-week'],
    ['native-country', 'education'],
    ['native-country', 'workclass'],
    ['native-country', 'income'],
    ['native-country', 'relationship'],
    
    # Edges from marital status
    ['marital-status', 'occupation'],
    ['marital-status', 'hours-per-week'],
    ['marital-status', 'income'],
    ['marital-status', 'workclass'],
    ['marital-status', 'relationship'],
    ['marital-status', 'education'],
    
    # Edges from education
    ['education', 'occupation'],
    ['education', 'hours-per-week'],
    ['education', 'income'],
    ['education', 'workclass'],
    ['education', 'relationship'],
    
    # All remaining edges
    ['occupation', 'income'],
    ['hours-per-week', 'income'],
    ['workclass', 'income'],
    ['relationship', 'income'],
]

def dag_to_idx(df, dag):
    """Convert columns in a DAG to the corresponding indices."""

    dag_idx = []
    for edge in dag:
        dag_idx.append([df.columns.get_loc(edge[0]), df.columns.get_loc(edge[1])])

    return dag_idx

# Convert the DAG to one that can be provided to the DECAF model
dag_seed = dag_to_idx(adult_dataset, dag)
print(dag_seed)

[[8, 6], [8, 14], [8, 12], [8, 3], [8, 5], [0, 6], [0, 12], [0, 14], [0, 1], [0, 5], [0, 3], [0, 7], [9, 6], [9, 5], [9, 14], [9, 1], [9, 3], [9, 7], [13, 5], [13, 12], [13, 3], [13, 1], [13, 14], [13, 7], [5, 6], [5, 12], [5, 14], [5, 1], [5, 7], [5, 3], [3, 6], [3, 12], [3, 14], [3, 1], [3, 7], [6, 14], [12, 14], [1, 14], [7, 14]]


It's also necessary to define edges we want to remove from the DAG in order to meet the various fairness criteria described in the paper.

In [5]:
def create_bias_dict(df, edge_map):
    """
    Convert the given edge tuples to a bias dict used for generating
    debiased synthetic data.
    """
    bias_dict = {}
    for key, val in edge_map.items():
        bias_dict[df.columns.get_loc(key)] = [df.columns.get_loc(f) for f in val]
    
    return bias_dict

# Bias dictionary to satisfy FTU
bias_dict_ftu = create_bias_dict(adult_dataset, {'income': ['sex']})
print('Bias dict FTU:', bias_dict_ftu)

# Bias dictionary to satisfy DP
bias_dict_dp = create_bias_dict(adult_dataset, {'income': [
    'occupation', 'hours-per-week', 'marital-status', 'education', 'sex',
    'workclass', 'relationship']})
print('Bias dict DP:', bias_dict_dp)

# Bias dictionary to satisfy CF
bias_dict_cf = create_bias_dict(adult_dataset, {'income': [
    'marital-status', 'sex']})
print('Bias dict CF:', bias_dict_cf)

Bias dict FTU: {14: [9]}
Bias dict DP: {14: [6, 12, 5, 3, 9, 1, 7]}
Bias dict CF: {14: [5, 9]}


## Experiments

We have loaded and preprocessed the data and we are ready to run the experiments. For each experiment we train a generative model, sample synthetic data from the trained model and then obtain metrics by training and evaluating a downstream multi-layer perceptron using the test fold we generated in the previous section. We use the MLP model from `sklearn` with default parameters which matches the settings described in Appendix D of the paper.

In [6]:
def eval_model(mlp, X_train, y_train, X_test=X_test, y_test=y_test):
    """Helper function that prints evaluation metrics."""

    y_pred = mlp.predict(X_test)

    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    auroc = roc_auc_score(y_test, y_pred)
    dp = DP(mlp, X_test)
    ftu = FTU(X_train, y_train, X_test)

    return {'precision': precision, 'recall': recall, 'auroc': auroc, 'dp': dp, 'ftu': ftu}

### Original dataset

As a benchmark we want to first train the downstream model on the original dataset.

In [7]:
mlp = MLPClassifier().fit(X_train, y_train)
eval_model(mlp, X_train, y_train, X_test, y_test)



{'precision': 0.8639200998751561,
 'recall': 0.9214380825565912,
 'auroc': 0.7418435392702634,
 'dp': 0.11743453851737821,
 'ftu': 0.05407935884178032}

In the following sections we train various models in order to reproduce the results from Table 2 of the DECAF paper.

### GAN

In [8]:
gan = train_vanilla_gan()

2022-01-24 17:57:34.066734: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-01-24 17:57:34.272402: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  0%|          | 0/10 [00:00<?, ?it/s]2022-01-24 17:57:35.168284: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
 10%|█         | 1/10 [00:10<01:33, 10.41s/it]

0 [D loss: 0.000130, acc.: 100.00%] [G loss: 33.094780]
generated_data


 20%|██        | 2/10 [00:19<01:17,  9.73s/it]

1 [D loss: 0.000756, acc.: 100.00%] [G loss: 58.786240]


 30%|███       | 3/10 [00:28<01:06,  9.50s/it]

2 [D loss: 0.013608, acc.: 99.61%] [G loss: 86.380211]


 40%|████      | 4/10 [00:37<00:54,  9.12s/it]

3 [D loss: 0.000005, acc.: 100.00%] [G loss: 101.344086]


 50%|█████     | 5/10 [00:45<00:44,  8.91s/it]

4 [D loss: 0.000000, acc.: 100.00%] [G loss: 117.346016]


 60%|██████    | 6/10 [00:55<00:35,  8.96s/it]

5 [D loss: 0.000000, acc.: 100.00%] [G loss: 129.285889]


 70%|███████   | 7/10 [01:04<00:27,  9.03s/it]

6 [D loss: 0.000001, acc.: 100.00%] [G loss: 144.738297]


 80%|████████  | 8/10 [01:15<00:19,  9.84s/it]

7 [D loss: 0.000000, acc.: 100.00%] [G loss: 162.553253]


 90%|█████████ | 9/10 [01:25<00:09,  9.83s/it]

8 [D loss: 0.000000, acc.: 100.00%] [G loss: 174.425842]


100%|██████████| 10/10 [01:35<00:00,  9.51s/it]

9 [D loss: 0.000000, acc.: 100.00%] [G loss: 183.833832]





In [9]:
synth_df = gan.sample(len(adult_dataset))
X_synth, y_synth = preprocess_adult(synth_df)
synth_df.head()

Synthetic data generation: 100%|██████████| 236/236 [00:02<00:00, 103.29it/s]


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,18,Self-emp-inc,41895,Doctorate,13,Married-AF-spouse,Other-service,Wife,Black,Male,373,-97,3,Taiwan,>50K
1,19,Federal-gov,38613,HS-grad,11,Separated,Farming-fishing,Wife,Other,Female,-41,-108,3,Nicaragua,<=50K
2,18,Federal-gov,40242,1st-4th,5,Never-married,Machine-op-inspct,Not-in-family,Asian-Pac-Islander,Female,27,-92,2,Thailand,>50K
3,19,Self-emp-inc,47360,9th,2,Married-spouse-absent,Exec-managerial,Husband,White,Female,45,-107,3,Holand-Netherlands,>50K
4,19,Federal-gov,50762,Prof-school,4,Married-civ-spouse,Exec-managerial,Own-child,Amer-Indian-Eskimo,Female,-78,-96,4,Hungary,>50K


In [10]:
mlp = MLPClassifier().fit(X_synth, y_synth)
eval_model(mlp, X_synth, y_synth, X_test, y_test)



{'precision': 0.696866485013624,
 'recall': 0.6810918774966711,
 'auroc': 0.39375879015395804,
 'dp': 0.2678333332484898,
 'ftu': 0.19709672372361148}

### WGAN-GP

In [11]:
wgan_gp = train_wgan_gp()

 10%|█         | 1/10 [00:04<00:44,  4.92s/it]

Epoch: 0 | disc_loss: 1.0493888854980469 | gen_loss: 0.0024936457630246878


 20%|██        | 2/10 [00:07<00:27,  3.49s/it]

Epoch: 1 | disc_loss: 4.879275321960449 | gen_loss: 0.010172907263040543


 30%|███       | 3/10 [00:09<00:21,  3.07s/it]

Epoch: 2 | disc_loss: 5.498307228088379 | gen_loss: 0.010316535830497742


 40%|████      | 4/10 [00:13<00:18,  3.11s/it]

Epoch: 3 | disc_loss: 0.6758951544761658 | gen_loss: 0.03112035244703293


 50%|█████     | 5/10 [00:17<00:16,  3.40s/it]

Epoch: 4 | disc_loss: 0.051433153450489044 | gen_loss: 0.05552292615175247


 60%|██████    | 6/10 [00:21<00:14,  3.62s/it]

Epoch: 5 | disc_loss: -0.024509325623512268 | gen_loss: 0.04996667802333832


 70%|███████   | 7/10 [00:24<00:11,  3.70s/it]

Epoch: 6 | disc_loss: -0.029461752623319626 | gen_loss: 0.0560053326189518


 80%|████████  | 8/10 [00:28<00:07,  3.68s/it]

Epoch: 7 | disc_loss: 0.4348539113998413 | gen_loss: 0.06448204070329666


 90%|█████████ | 9/10 [00:32<00:03,  3.61s/it]

Epoch: 8 | disc_loss: 0.3692534863948822 | gen_loss: 0.0648110881447792


100%|██████████| 10/10 [00:35<00:00,  3.60s/it]

Epoch: 9 | disc_loss: -0.0524517185986042 | gen_loss: 0.07477548718452454





In [12]:
synth_df = wgan_gp.sample(len(adult_dataset))
X_synth, y_synth = preprocess_adult(synth_df)
synth_df.head()

Synthetic data generation: 100%|██████████| 61/61 [00:01<00:00, 40.94it/s]


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,48,Local-gov,59756,Assoc-voc,11,Separated,Exec-managerial,Own-child,Other,Male,-9466,189,34,Ireland,>50K
1,53,Self-emp-inc,33796,Prof-school,5,Divorced,Sales,Wife,Asian-Pac-Islander,Female,-6891,-48,39,Taiwan,>50K
2,51,Local-gov,18129,Masters,9,Divorced,Handlers-cleaners,Own-child,White,Male,-13059,200,38,Cuba,>50K
3,51,Federal-gov,46350,5th-6th,10,Never-married,Transport-moving,Not-in-family,Asian-Pac-Islander,Female,-13165,122,43,Japan,>50K
4,58,Self-emp-not-inc,26969,Prof-school,12,Separated,Other-service,Own-child,Black,Female,-10135,223,45,Peru,<=50K


In [13]:
mlp = MLPClassifier().fit(X_synth, y_synth)
eval_model(mlp, X_synth, y_synth, X_test, y_test)



{'precision': 0.6913099870298314,
 'recall': 0.3548601864181092,
 'auroc': 0.43847426991588195,
 'dp': 0.328974925705189,
 'ftu': 0.22759274591220616}

### DECAF

In [14]:
model, dm = train_decaf(X_train, y_train, dag_seed)

  rank_zero_deprecation("DataModule property `dims` was deprecated in v1.5 and will be removed in v1.7.")
  rank_zero_deprecation("DataModule property `dims` was deprecated in v1.5 and will be removed in v1.7.")
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.")

  | Name          | Type             | Params
---------------------------------------------------
0 | generator     | Generator_causal | 134 K 
1 | discriminator | Discriminator    | 43.6 K
---------------------------------------------------
178 K     Trainable params
225       Non-trainable params
178 K     Total params
0.713     Total estimated model params size (MB)
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(


Initialised adjacency matrix as parsed:
 Parameter containing:
tensor([[0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 1.],
        [0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 1., 0., 1., 0.

In [15]:
X_synth, y_synth = generate_synthetic_data(model, dm)
pd.DataFrame(data=np.column_stack((X_synth, y_synth))).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.07875,6.898178e-05,0.193788,0.292212,0.670328,0.58506,0.589974,0.924082,8.088188e-06,0.003137,0.00224,0.000161,0.101669,5.560403e-13,1.0
1,0.058457,1.623612e-08,0.23257,0.027444,0.712811,0.345917,0.129528,0.570842,1.57445e-09,1.0,0.021018,0.000955,0.494674,3.081539e-07,1.0
2,0.185823,4.318405e-09,0.125791,0.105664,0.572919,0.460711,0.078483,0.490799,6.685104e-18,1.0,0.004792,0.002169,0.542236,2.943965e-06,0.0
3,0.164186,4.41156e-06,0.032183,0.164686,0.491094,0.05412,0.185706,0.322611,0.9678963,1.0,0.001666,7e-06,0.550422,0.2928007,1.0
4,0.532269,4.156069e-08,0.138216,0.32832,0.637247,0.36835,0.715154,0.793034,1.608216e-07,0.002404,0.014398,0.000922,0.447102,2.268934e-07,1.0


#### DECAF-ND

In [16]:
X_synth, y_synth = generate_synthetic_data(model, dm)
mlp = MLPClassifier().fit(X_synth, y_synth)
eval_model(mlp, X_synth, y_synth, X_test, y_test)



{'precision': 0.7733473242392445,
 'recall': 0.9813581890812251,
 'auroc': 0.5569441547815763,
 'dp': 0.3019233015366842,
 'ftu': 0.1274305204685909}