# Customized models and datasets for structured inputs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HanXudong/fairlib/blob/main/tutorial/COMPAS.ipynb)

In this tutorial we will:
- Show how to add a model for structural classification
- Show how to add a dataloader with structured data preprocessing

We will be using the Northpointe's Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) score, used in states like California and Florida.
## Installation

Again, the first step will be installing our libarary

In [1]:
!pip install fairlib

Collecting fairlib
  Downloading fairlib-0.0.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 4.1 MB/s 
Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 16.9 MB/s 
Collecting pickle5
  Downloading pickle5-0.0.12-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (256 kB)
[K     |████████████████████████████████| 256 kB 45.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 41.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 28.3 MB/s 
Collecting PyYAML
  Downloading PyYAML-6.

In [2]:
import fairlib

# Download and preprocess the COMPAS dataset

https://github.com/google-research/google-research/blob/master/group_agnostic_fairness/data_utils/CreateCompasDatasetFiles.ipynb

In [3]:
import os

In [4]:
os.makedirs("data", exist_ok=True)

In [5]:
!wget --no-check-certificate --content-disposition "https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv" -O "data/compas-scores-two-years.csv"

--2022-04-08 15:09:58--  https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2546489 (2.4M) [text/plain]
Saving to: ‘data/compas-scores-two-years.csv’


2022-04-08 15:09:58 (70.2 MB/s) - ‘data/compas-scores-two-years.csv’ saved [2546489/2546489]



In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import numpy as np

In [7]:
pd.options.display.float_format = '{:,.2f}'.format
dataset_base_dir = "data/"
dataset_file_name = 'compas-scores-two-years.csv'

In [8]:
file_path = os.path.join(dataset_base_dir,dataset_file_name)
temp_df = pd.read_csv(file_path)

# Columns of interest
columns = ['juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count',
                'age', 
                'c_charge_degree', 
                'c_charge_desc',
                'age_cat',
                'sex', 'race',  'is_recid']
target_variable = 'is_recid'
target_value = 'Yes'

# Drop duplicates
temp_df = temp_df[['id']+columns].drop_duplicates()
df = temp_df[columns].copy()

# Convert columns of type ``object`` to ``category`` 
df = pd.concat([
        df.select_dtypes(include=[], exclude=['object']),
        df.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
        ], axis=1).reindex(df.columns, axis=1)

# Binarize target_variable
df['is_recid'] = df.apply(lambda x: 'Yes' if x['is_recid']==1.0 else 'No', axis=1).astype('category')

# Process protected-column values
race_dict = {'African-American':'Black','Caucasian':'White'}
df['race'] = df.apply(lambda x: race_dict[x['race']] if x['race'] in race_dict.keys() else 'Other', axis=1).astype('category')

In [9]:
df

Unnamed: 0,juv_fel_count,juv_misd_count,juv_other_count,priors_count,age,c_charge_degree,c_charge_desc,age_cat,sex,race,is_recid
0,0,0,0,0,69,F,Aggravated Assault w/Firearm,Greater than 45,Male,Other,No
1,0,0,0,0,34,F,Felony Battery w/Prior Convict,25 - 45,Male,Black,Yes
2,0,0,1,4,24,F,Possession of Cocaine,Less than 25,Male,Black,Yes
3,0,1,0,1,23,F,Possession of Cannabis,Less than 25,Male,Black,No
4,0,0,0,2,43,F,arrest case no charge,25 - 45,Male,Other,No
...,...,...,...,...,...,...,...,...,...,...,...
7209,0,0,0,0,23,F,Deliver Cannabis,Less than 25,Male,Black,No
7210,0,0,0,0,23,F,Leaving the Scene of Accident,Less than 25,Male,Black,No
7211,0,0,0,0,57,F,Aggravated Battery / Pregnant,Greater than 45,Male,Other,No
7212,0,0,0,3,33,M,Battery on Law Enforc Officer,25 - 45,Female,Black,No


In [10]:
# Create splits
train_df, test_df = train_test_split(df, test_size=0.30, random_state=42)
train_df, dev_df = train_test_split(train_df, test_size=0.1, random_state=42)

In [11]:
cat_cols = train_df.select_dtypes(include='category').columns
vocab_dict = {}
for col in cat_cols:
  vocab_dict[col] = list(set(train_df[col].cat.categories))
print(vocab_dict)

{'c_charge_degree': ['F', 'M'], 'c_charge_desc': ['Aiding Escape', 'Cash Item w/Intent to Defraud', 'Lewd or Lascivious Molestation', 'Attempted Burg/Convey/Unocc', 'Aggravated Assault W/o Firearm', 'Disorderly Conduct', 'Crlty Twrd Child Urge Oth Act', 'Agg Battery Grt/Bod/Harm', 'Poss Contr Subst W/o Prescript', 'Soliciting For Prostitution', 'Tresspass Struct/Conveyance', 'Unlaw Lic Use/Disply Of Others', 'Aggrav Battery w/Deadly Weapon', 'Opert With Susp DL 2ND Offense', 'Deliver Cannabis', 'Battery', 'Intoxicated/Safety Of Another', 'Fail Sex Offend Report Bylaw', 'Attempted Deliv Control Subst', 'Lewd Act Presence Child 16-', 'Violation Of Boater Safety Id', 'Del Morphine at/near Park', 'Possession of LSD', 'Possession Of Diazepam', 'Possession Of Anabolic Steroid', 'Aggravated Battery (Firearm)', 'Possession Of Carisoprodol', 'Possession Of Cocaine', 'Purchase/P/W/Int Cannabis', 'Attempt Armed Burglary Dwell', 'Cause Anoth Phone Ring Repeat', 'Sexual Performance by a Child', 'Pu

In [12]:
temp_dict = train_df.describe().to_dict()
mean_std_dict = {}
for key, value in temp_dict.items():
  mean_std_dict[key] = [value['mean'],value['std']]
print(mean_std_dict)

{'juv_fel_count': [0.0721830985915493, 0.5187066204966256], 'juv_misd_count': [0.09793133802816902, 0.5348148993571356], 'juv_other_count': [0.10629401408450705, 0.47329289176000755], 'priors_count': [3.504181338028169, 4.9829540064651585], 'age': [34.875220070422536, 11.929411671055068]}


In [13]:
def preprocessing(tmp_df):
    features = {}
    # Normalize numberiacal columns
    for col_name in mean_std_dict.keys():
        _mean, _std = mean_std_dict[col_name]
        features[col_name] = ((tmp_df[col_name]-_mean)/_std)
    # Encode categorical columns as indices
    for col_name in vocab_dict.keys():
        features[col_name] = tmp_df[col_name].map(
            {
                j:i for i,j in enumerate(vocab_dict[col_name])
            }
        )
    # One-hot encoding categorical features
    for col_name in ["c_charge_degree", "c_charge_desc", "age_cat"]:
        features[col_name] = pd.get_dummies(features[col_name], prefix=col_name)
    return pd.concat(features.values(), axis=1)

In [14]:
train_df = preprocessing(train_df)
dev_df =  preprocessing(dev_df)
test_df = preprocessing(test_df)

In [15]:
train_df.to_pickle(os.path.join(dataset_base_dir, "train.pkl"))
dev_df.to_pickle(os.path.join(dataset_base_dir, "dev.pkl"))
test_df.to_pickle(os.path.join(dataset_base_dir, "test.pkl"))

In [16]:
from fairlib import networks, BaseOptions, dataloaders
import torch

In [17]:
Shared_options = {
    # The name of the dataset, correponding dataloader will be used,
    "dataset":  "COMPAS",

    # Specifiy the path to the input data
    "data_dir": "./data",

    # Device for computing, -1 is the cpu
    "device_id": -1,

    # The default path for saving experimental results
    "results_dir":  r"results",

    # The same as the dataset
    "project_dir":  r"dev",

    # We will focusing on TPR GAP, implying the Equalized Odds for binay classification.
    "GAP_metric_name":  "TPR_GAP",

    # The overall performance will be measured as accuracy
    "Performance_metric_name":  "accuracy",

    # Model selections are based on DTO
    "selection_criterion":  "DTO",

    # Default dirs for saving checkpoints
    "checkpoint_dir":   "models",
    "checkpoint_name":  "checkpoint_epoch",


    "n_jobs":   1,
}

In [18]:
args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Give a name to the exp, which will be used in the path
    "exp_id":"vanilla",

    "emb_size": 450-3,
    "lr": 0.001,
    "batch_size": 128,
    "hidden_size": 32,
    "n_hidden": 1,
    "activation_function": "ReLu",

    "num_classes": 2,
    "num_groups": 3, # Balck; White; and Other
}

# Init the argument
options = BaseOptions()
state = options.get_state(args=args, silence=True)

INFO:root:Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-2b94b31d-201f-46ea-842f-e5f6c625b168.json']
INFO:root:Logging to ./results/dev/COMPAS/vanilla/output.log


2022-04-08 15:10:45 [INFO ]  Base directory is ./results/dev/COMPAS/vanilla
Not implemented
2022-04-08 15:10:45 [INFO ]  dataloaders need to be initialized!


In [19]:
class CustomizedDataset(dataloaders.utils.BaseDataset):

    def load_data(self):

        self.data_dir = os.path.join(self.args.data_dir, "{}.pkl".format(self.split))

        data = pd.read_pickle(self.data_dir)

        self.X = data.drop(['sex', 'race', 'is_recid'], axis=1).to_numpy().astype(np.float32)
        self.y = list(data["is_recid"])
        self.protected_label = list(data["race"])

In [20]:
customized_train_data = CustomizedDataset(args=state, split="train")
customized_dev_data = CustomizedDataset(args=state, split="dev")
customized_test_data = CustomizedDataset(args=state, split="test")

# DataLoader Parameters
tran_dataloader_params = {
        'batch_size': state.batch_size,
        'shuffle': True,
        'num_workers': state.num_workers}

eval_dataloader_params = {
        'batch_size': state.test_batch_size,
        'shuffle': False,
        'num_workers': state.num_workers}

# init dataloader
customized_training_generator = torch.utils.data.DataLoader(customized_train_data, **tran_dataloader_params)
customized_validation_generator = torch.utils.data.DataLoader(customized_dev_data, **eval_dataloader_params)
customized_test_generator = torch.utils.data.DataLoader(customized_test_data, **eval_dataloader_params)

Loaded data shapes: (4544, 447), (4544,), (4544,)
Loaded data shapes: (505, 447), (505,), (505,)
Loaded data shapes: (2165, 447), (2165,), (2165,)


In [21]:
model = networks.classifier.MLP(state)

2022-04-08 15:10:53 [INFO ]  MLP( 
2022-04-08 15:10:53 [INFO ]    (output_layer): Linear(in_features=32, out_features=2, bias=True)
2022-04-08 15:10:53 [INFO ]    (AF): ReLU()
2022-04-08 15:10:53 [INFO ]    (hidden_layers): ModuleList(
2022-04-08 15:10:53 [INFO ]      (0): Linear(in_features=447, out_features=32, bias=True)
2022-04-08 15:10:53 [INFO ]      (1): ReLU()
2022-04-08 15:10:53 [INFO ]    )
2022-04-08 15:10:53 [INFO ]    (criterion): CrossEntropyLoss()
2022-04-08 15:10:53 [INFO ]  )
2022-04-08 15:10:53 [INFO ]  Total number of parameters: 14402 



In [22]:
model.train_self(
    train_generator = customized_training_generator,
    dev_generator = customized_validation_generator,
    test_generator = customized_test_generator,
)

2022-04-08 15:10:56 [INFO ]  Evaluation at Epoch 0
2022-04-08 15:10:56 [INFO ]  Validation accuracy: 65.35	macro_fscore: 64.74	micro_fscore: 65.35	TPR_GAP: 29.34	FPR_GAP: 29.34	PPR_GAP: 31.29	
2022-04-08 15:10:56 [INFO ]  Test accuracy: 67.16	macro_fscore: 66.42	micro_fscore: 67.16	TPR_GAP: 30.35	FPR_GAP: 30.35	PPR_GAP: 35.61	
2022-04-08 15:10:56 [INFO ]  Evaluation at Epoch 1
2022-04-08 15:10:56 [INFO ]  Validation accuracy: 66.53	macro_fscore: 66.34	micro_fscore: 66.53	TPR_GAP: 28.61	FPR_GAP: 28.61	PPR_GAP: 31.18	
2022-04-08 15:10:56 [INFO ]  Test accuracy: 67.39	macro_fscore: 67.07	micro_fscore: 67.39	TPR_GAP: 31.21	FPR_GAP: 31.21	PPR_GAP: 36.49	
2022-04-08 15:10:57 [INFO ]  Evaluation at Epoch 2
2022-04-08 15:10:57 [INFO ]  Validation accuracy: 66.53	macro_fscore: 66.42	micro_fscore: 66.53	TPR_GAP: 30.15	FPR_GAP: 30.15	PPR_GAP: 32.84	
2022-04-08 15:10:57 [INFO ]  Test accuracy: 68.73	macro_fscore: 68.44	micro_fscore: 68.73	TPR_GAP: 32.73	FPR_GAP: 32.73	PPR_GAP: 38.85	
2022-04-08 15

In [23]:
debiasing_args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Give a name to the exp, which will be used in the path
    "exp_id":"BT_Adv",

    "emb_size": 450-3,
    "lr": 0.001,
    "batch_size": 128,
    "hidden_size": 32,
    "n_hidden": 1,
    "activation_function": "ReLu",

    "num_classes": 2,
    "num_groups": 3, # Balck; White; and Other

    # Perform adversarial training if True
    "adv_debiasing":True,

    # Specify the hyperparameters for Balanced Training
    "BT":"Resampling",
    "BTObj":"EO",
}

# Init the argument
debias_options = BaseOptions()
debias_state = debias_options.get_state(args=debiasing_args, silence=True)

customized_train_data = CustomizedDataset(args=debias_state, split="train")
customized_dev_data = CustomizedDataset(args=debias_state, split="dev")
customized_test_data = CustomizedDataset(args=debias_state, split="test")

# DataLoader Parameters
tran_dataloader_params = {
        'batch_size': state.batch_size,
        'shuffle': True,
        'num_workers': state.num_workers}

eval_dataloader_params = {
        'batch_size': state.test_batch_size,
        'shuffle': False,
        'num_workers': state.num_workers}

# init dataloader
customized_training_generator = torch.utils.data.DataLoader(customized_train_data, **tran_dataloader_params)
customized_validation_generator = torch.utils.data.DataLoader(customized_dev_data, **eval_dataloader_params)
customized_test_generator = torch.utils.data.DataLoader(customized_test_data, **eval_dataloader_params)

debias_model = networks.classifier.MLP(debias_state)

2022-04-08 15:11:04 [INFO ]  Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-2b94b31d-201f-46ea-842f-e5f6c625b168.json']
2022-04-08 15:11:04 [INFO ]  Logging to ./results/dev/COMPAS/BT_Adv/output.log
2022-04-08 15:11:04 [INFO ]  Base directory is ./results/dev/COMPAS/BT_Adv
Not implemented
2022-04-08 15:11:04 [INFO ]  dataloaders need to be initialized!
2022-04-08 15:11:04 [INFO ]  SubDiscriminator( 
2022-04-08 15:11:04 [INFO ]    (grad_rev): GradientReversal()
2022-04-08 15:11:04 [INFO ]    (output_layer): Linear(in_features=300, out_features=3, bias=True)
2022-04-08 15:11:04 [INFO ]    (AF): ReLU()
2022-04-08 15:11:04 [INFO ]    (hidden_layers): ModuleList(
2022-04-08 15:11:04 [INFO ]      (0): Linear(in_features=32, out_features=300, bias=True)
2022-04-08 15:11:04 [INFO ]      (1): ReLU()
2022-04-08 15:11:04 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-04-08 15:11:04 [INFO ]      (3): ReLU()
2022-04-08 15:11:04 [INFO ]    )
2022-04-0

In [24]:
debias_model.train_self(
    train_generator = customized_training_generator,
    dev_generator = customized_validation_generator,
    test_generator = customized_test_generator,
)

2022-04-08 15:11:08 [INFO ]  Evaluation at Epoch 0
2022-04-08 15:11:08 [INFO ]  Validation accuracy: 55.16	macro_fscore: 43.87	micro_fscore: 55.16	TPR_GAP: 9.77	FPR_GAP: 9.77	PPR_GAP: 8.73	
2022-04-08 15:11:08 [INFO ]  Test accuracy: 66.24	macro_fscore: 47.19	micro_fscore: 66.24	TPR_GAP: 7.53	FPR_GAP: 7.53	PPR_GAP: 5.11	
2022-04-08 15:11:08 [INFO ]  Evaluation at Epoch 1
2022-04-08 15:11:08 [INFO ]  Validation accuracy: 55.95	macro_fscore: 45.35	micro_fscore: 55.95	TPR_GAP: 7.64	FPR_GAP: 7.64	PPR_GAP: 7.14	
2022-04-08 15:11:08 [INFO ]  Test accuracy: 66.67	macro_fscore: 48.27	micro_fscore: 66.67	TPR_GAP: 10.50	FPR_GAP: 10.50	PPR_GAP: 6.18	
2022-04-08 15:11:08 [INFO ]  Evaluation at Epoch 2
2022-04-08 15:11:08 [INFO ]  Validation accuracy: 58.73	macro_fscore: 50.64	micro_fscore: 58.73	TPR_GAP: 19.20	FPR_GAP: 19.20	PPR_GAP: 16.67	
2022-04-08 15:11:08 [INFO ]  Test accuracy: 67.52	macro_fscore: 51.29	micro_fscore: 67.52	TPR_GAP: 13.91	FPR_GAP: 13.91	PPR_GAP: 8.09	
2022-04-08 15:11:09 [INF