# Customized models and datasets for structured inputs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HanXudong/fairlib/blob/main/tutorial/Structured_Inputs.ipynb)

In this tutorial we will:
- Show how to add a model for structural classification
- Show how to add a dataloader with structured data preprocessing

We will be using the Northpointe's Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) score, used in states like California and Florida.


## Installation

Again, the first step will be installing our libarary

In [1]:
!pip install fairlib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fairlib
  Downloading fairlib-0.0.9-py3-none-any.whl (85 kB)
[K     |████████████████████████████████| 85 kB 2.6 MB/s 
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 22.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 62.2 MB/s 
Collecting pickle5
  Downloading pickle5-0.0.12-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (256 kB)
[K     |████████████████████████████████| 256 kB 47.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.5 MB/s 
Collecting PyYAML
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |██

In [2]:
import fairlib

## Download and preprocess the COMPAS dataset

https://github.com/google-research/google-research/blob/master/group_agnostic_fairness/data_utils/CreateCompasDatasetFiles.ipynb

In [3]:
from fairlib import datasets
datasets.prepare_dataset("compas", "data")

saving to /content/data/compas-scores-two-years.csv


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
import pandas as pd

pd.read_pickle("data/COMPAS_dev.pkl").keys()

Index(['juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count',
       'age', 'c_charge_degree_0', 'c_charge_degree_1', 'c_charge_desc_262',
       'c_charge_desc_43', 'c_charge_desc_55',
       ...
       'c_charge_desc_323', 'c_charge_desc_66', 'c_charge_desc_40',
       'c_charge_desc_33', 'age_cat_1', 'age_cat_0', 'age_cat_2', 'sex',
       'race', 'is_recid'],
      dtype='object', length=450)

## Train Models

In [5]:
from fairlib import networks, BaseOptions, dataloaders
import torch

In [6]:
Shared_options = {
    # The name of the dataset, correponding dataloader will be used,
    "dataset":  "COMPAS",

    # Specifiy the path to the input data
    "data_dir": "./data",

    # Device for computing, -1 is the cpu
    "device_id": -1,

    # The default path for saving experimental results
    "results_dir":  r"results",

    # The same as the dataset
    "project_dir":  r"dev",

    # We will focusing on TPR GAP, implying the Equalized Odds for binay classification.
    "GAP_metric_name":  "TPR_GAP",

    # The overall performance will be measured as accuracy
    "Performance_metric_name":  "accuracy",

    # Model selections are based on DTO
    "selection_criterion":  "DTO",

    # Default dirs for saving checkpoints
    "checkpoint_dir":   "models",
    "checkpoint_name":  "checkpoint_epoch",


    "n_jobs":   1,
}

In [7]:
args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Give a name to the exp, which will be used in the path
    "exp_id":"vanilla",

    "emb_size": 450-3,
    "lr": 0.001,
    "batch_size": 128,
    "hidden_size": 32,
    "n_hidden": 1,
    "activation_function": "ReLu",

    "num_classes": 2,
    "num_groups": 3, # Balck; White; and Other
}

# Init the argument
options = BaseOptions()
state = options.get_state(args=args, silence=True)

INFO:root:Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-1cc80252-6b1d-427f-a63d-8b70911dfc3b.json']
INFO:root:Logging to ./results/dev/COMPAS/vanilla/output.log


2022-07-21 07:10:35 [INFO ]  Base directory is ./results/dev/COMPAS/vanilla
2022-07-21 07:10:35 [INFO ]  Exception type : AssertionError 
2022-07-21 07:10:35 [INFO ]  Exception message : Not implemented
2022-07-21 07:10:35 [INFO ]  Stack trace : ['File : /usr/local/lib/python3.7/dist-packages/fairlib/src/base_options.py , Line : 486, Func.Name : set_state, Message : train_iterator, dev_iterator, test_iterator = dataloaders.get_dataloaders(state)', 'File : /usr/local/lib/python3.7/dist-packages/fairlib/src/dataloaders/__init__.py , Line : 40, Func.Name : get_dataloaders, Message : ], "Not implemented"']
2022-07-21 07:10:35 [INFO ]  dataloaders need to be initialized!


## Customize dataset loader

In [8]:
import os
import pandas as pd
import numpy as np

In [9]:
class CustomizedDataset(dataloaders.utils.BaseDataset):

    def load_data(self):

        self.data_dir = os.path.join(self.args.data_dir, "COMPAS_{}.pkl".format(self.split))

        data = pd.read_pickle(self.data_dir)

        self.X = data.drop(['sex', 'race', 'is_recid'], axis=1).to_numpy().astype(np.float32)
        self.y = list(data["is_recid"])
        self.protected_label = list(data["race"])

In [10]:
customized_train_data = CustomizedDataset(args=state, split="train")
customized_dev_data = CustomizedDataset(args=state, split="dev")
customized_test_data = CustomizedDataset(args=state, split="test")

# DataLoader Parameters
tran_dataloader_params = {
        'batch_size': state.batch_size,
        'shuffle': True,
        'num_workers': state.num_workers}

eval_dataloader_params = {
        'batch_size': state.test_batch_size,
        'shuffle': False,
        'num_workers': state.num_workers}

# init dataloader
customized_training_generator = torch.utils.data.DataLoader(customized_train_data, **tran_dataloader_params)
customized_validation_generator = torch.utils.data.DataLoader(customized_dev_data, **eval_dataloader_params)
customized_test_generator = torch.utils.data.DataLoader(customized_test_data, **eval_dataloader_params)

Loaded data shapes: (4544, 447), (4544,), (4544,)
Loaded data shapes: (505, 447), (505,), (505,)
Loaded data shapes: (2165, 447), (2165,), (2165,)


In [11]:
model = networks.classifier.MLP(state)

2022-07-21 07:10:35 [INFO ]  MLP( 
2022-07-21 07:10:35 [INFO ]    (output_layer): Linear(in_features=32, out_features=2, bias=True)
2022-07-21 07:10:35 [INFO ]    (AF): ReLU()
2022-07-21 07:10:35 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 07:10:35 [INFO ]      (0): Linear(in_features=447, out_features=32, bias=True)
2022-07-21 07:10:35 [INFO ]      (1): ReLU()
2022-07-21 07:10:35 [INFO ]    )
2022-07-21 07:10:35 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 07:10:35 [INFO ]  )
2022-07-21 07:10:35 [INFO ]  Total number of parameters: 14402 



In [12]:
model.train_self(
    train_generator = customized_training_generator,
    dev_generator = customized_validation_generator,
    test_generator = customized_test_generator,
)

2022-07-21 07:10:35 [INFO ]  Evaluation at Epoch 0
2022-07-21 07:10:35 [INFO ]  Validation accuracy: 65.35	macro_fscore: 64.41	micro_fscore: 65.35	TPR_GAP: 37.09	FPR_GAP: 37.09	PPR_GAP: 38.61	
2022-07-21 07:10:35 [INFO ]  Test accuracy: 67.30	macro_fscore: 66.29	micro_fscore: 67.30	TPR_GAP: 28.96	FPR_GAP: 28.96	PPR_GAP: 33.07	
2022-07-21 07:10:35 [INFO ]  Evaluation at Epoch 1
2022-07-21 07:10:35 [INFO ]  Validation accuracy: 66.14	macro_fscore: 65.92	micro_fscore: 66.14	TPR_GAP: 30.20	FPR_GAP: 30.20	PPR_GAP: 32.54	
2022-07-21 07:10:35 [INFO ]  Test accuracy: 67.53	macro_fscore: 67.12	micro_fscore: 67.53	TPR_GAP: 31.45	FPR_GAP: 31.45	PPR_GAP: 36.71	
2022-07-21 07:10:36 [INFO ]  Evaluation at Epoch 2
2022-07-21 07:10:36 [INFO ]  Validation accuracy: 66.14	macro_fscore: 66.00	micro_fscore: 66.14	TPR_GAP: 31.74	FPR_GAP: 31.74	PPR_GAP: 34.83	
2022-07-21 07:10:36 [INFO ]  Test accuracy: 68.55	macro_fscore: 68.24	micro_fscore: 68.55	TPR_GAP: 32.02	FPR_GAP: 32.02	PPR_GAP: 38.27	
2022-07-21 07

In [13]:
debiasing_args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Give a name to the exp, which will be used in the path
    "exp_id":"BT_Adv",

    "emb_size": 450-3,
    "lr": 0.001,
    "batch_size": 128,
    "hidden_size": 32,
    "n_hidden": 1,
    "activation_function": "ReLu",

    "num_classes": 2,
    "num_groups": 3, # Balck; White; and Other

    # Perform adversarial training if True
    "adv_debiasing":True,

    # Specify the hyperparameters for Balanced Training
    "BT":"Resampling",
    "BTObj":"EO",
}

# Init the argument
debias_options = BaseOptions()
debias_state = debias_options.get_state(args=debiasing_args, silence=True)

customized_train_data = CustomizedDataset(args=debias_state, split="train")
customized_dev_data = CustomizedDataset(args=debias_state, split="dev")
customized_test_data = CustomizedDataset(args=debias_state, split="test")

# DataLoader Parameters
tran_dataloader_params = {
        'batch_size': state.batch_size,
        'shuffle': True,
        'num_workers': state.num_workers}

eval_dataloader_params = {
        'batch_size': state.test_batch_size,
        'shuffle': False,
        'num_workers': state.num_workers}

# init dataloader
customized_training_generator = torch.utils.data.DataLoader(customized_train_data, **tran_dataloader_params)
customized_validation_generator = torch.utils.data.DataLoader(customized_dev_data, **eval_dataloader_params)
customized_test_generator = torch.utils.data.DataLoader(customized_test_data, **eval_dataloader_params)

debias_model = networks.classifier.MLP(debias_state)

2022-07-21 07:10:37 [INFO ]  Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-1cc80252-6b1d-427f-a63d-8b70911dfc3b.json']
2022-07-21 07:10:37 [INFO ]  Logging to ./results/dev/COMPAS/BT_Adv/output.log
2022-07-21 07:10:37 [INFO ]  Base directory is ./results/dev/COMPAS/BT_Adv
2022-07-21 07:10:37 [INFO ]  Exception type : AssertionError 
2022-07-21 07:10:37 [INFO ]  Exception message : Not implemented
2022-07-21 07:10:37 [INFO ]  Stack trace : ['File : /usr/local/lib/python3.7/dist-packages/fairlib/src/base_options.py , Line : 486, Func.Name : set_state, Message : train_iterator, dev_iterator, test_iterator = dataloaders.get_dataloaders(state)', 'File : /usr/local/lib/python3.7/dist-packages/fairlib/src/dataloaders/__init__.py , Line : 40, Func.Name : get_dataloaders, Message : ], "Not implemented"']
2022-07-21 07:10:37 [INFO ]  dataloaders need to be initialized!
2022-07-21 07:10:37 [INFO ]  SubDiscriminator( 
2022-07-21 07:10:37 [INFO ]    (grad_rev): GradientReversal

In [14]:
debias_model.train_self(
    train_generator = customized_training_generator,
    dev_generator = customized_validation_generator,
    test_generator = customized_test_generator,
)

2022-07-21 07:10:37 [INFO ]  Evaluation at Epoch 0
2022-07-21 07:10:38 [INFO ]  Validation accuracy: 65.54	macro_fscore: 65.27	micro_fscore: 65.54	TPR_GAP: 32.32	FPR_GAP: 32.32	PPR_GAP: 35.63	
2022-07-21 07:10:38 [INFO ]  Test accuracy: 67.44	macro_fscore: 67.00	micro_fscore: 67.44	TPR_GAP: 30.82	FPR_GAP: 30.82	PPR_GAP: 36.06	
2022-07-21 07:10:38 [INFO ]  Evaluation at Epoch 1
2022-07-21 07:10:38 [INFO ]  Validation accuracy: 64.36	macro_fscore: 64.04	micro_fscore: 64.36	TPR_GAP: 31.50	FPR_GAP: 31.50	PPR_GAP: 34.76	
2022-07-21 07:10:38 [INFO ]  Test accuracy: 67.67	macro_fscore: 67.06	micro_fscore: 67.67	TPR_GAP: 28.13	FPR_GAP: 28.13	PPR_GAP: 34.07	
2022-07-21 07:10:39 [INFO ]  Evaluation at Epoch 2
2022-07-21 07:10:39 [INFO ]  Validation accuracy: 61.58	macro_fscore: 61.53	micro_fscore: 61.58	TPR_GAP: 32.47	FPR_GAP: 32.47	PPR_GAP: 34.93	
2022-07-21 07:10:39 [INFO ]  Test accuracy: 66.65	macro_fscore: 66.41	micro_fscore: 66.65	TPR_GAP: 27.35	FPR_GAP: 27.36	PPR_GAP: 33.06	
2022-07-21 07