# Learning Fair Representation: Demographic Parity vs Exempt and Non-Exempt Disparity Notations

This project has been build on top of the implementation code of the Learning Adversarially Fair and Transferable Representations paper ([project link](https://github.com/VectorInstitute/laftr/tree/master)). However, we have included the following modifications:

1. We modified the main files so that one can directly specify which type of experiment (Demographic Parity or Non-exempt Disparity experiment) from the terminal input.
2. We modified the trainer object to provide a plot of the training losses at the end of the training in the experiment file.
3. We modified the dataset loader to take into account the existence of the feature xc for the Non-exempt Disparity experiment.
4. We created a new trainer function for the trainer object to take into account the existence of the feature xc for training.
5. We modified the tester to evaluate the Non-exempt Disparity measure.
6. We created a new model class inspired by the model class "WeightedEqoddsWassGan" and called it "WeightedEqoddsWassGanNEW," which considers the feature xc as an input to the adversary. (This also included the creation of new parent model classes for the model WeightedEqoddsWassGanNEW.)
7. Added new configuration files of type .json that contains specifications about the dataset ACSIncome and the training specifications.
For more details, please see ([Project repository](https://github.com/VectorInstitute/laftr/tree/master)).

In [None]:
# Clone the repository
token = 'git@github.com:SokratALDARMINI/Learning-Fair-Representation-Demographic-Parity-vs-Fairness-with-Exempt-Disparity.git'
!git clone https://{token}@github.com/SokratALDARMINI/Representation_Learning.git

# Change directory to the repository
%cd Representation_Learning

Cloning into 'Representation_Learning'...
remote: Enumerating objects: 89, done.[K
remote: Counting objects: 100% (89/89), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 89 (delta 9), reused 86 (delta 9), pack-reused 0[K
Receiving objects: 100% (89/89), 1.18 MiB | 972.00 KiB/s, done.
Resolving deltas: 100% (9/9), done.
/content/Representation_Learning


*italicized text*# Dataset preparation
## Adult Dataset (This section can be neglected)
To run the code, the datasets have to be placed in the correct directories. **It is worth notting that the Adult datasets is already included inside the project files, and one can neglect the next three cells.** Adult dataset exists in: /content/Representation_Learning/data/adult/.

In [None]:
#Change directory to the dataset directoy
%cd /content/Representation_Learning/data/adult/

/content/Representation_Learning/data/adult


Dataset cloning

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'
]

# Read the dataset without headers
adult_data = pd.read_csv(url, names=column_names, header=None)

# Split the dataset into training and test sets
train_data, test_data = train_test_split(adult_data, test_size=0.2, random_state=42)
print(adult_data.shape)

# Save the datasets without headers
train_data.to_csv('adult.data', index=False, header=False)
test_data.to_csv('adult.test', index=False, header=False)

# # Print the first few rows of each dataset to verify
# print("Training Data:")
# print(train_data.head())
# print("\nTest Data:")
# print(test_data.head())

(32561, 15)


Dataset preprocessing: Project expect files of type .npz


In [None]:
import numpy as np
import sys

# Define a small epsilon to prevent division by zero
EPS = 1e-8

# Function to bucketize a value based on given bucket thresholds
def bucket(x, buckets):
    x = float(x)
    n = len(buckets)
    label = n
    for i in range(len(buckets)):
        if x <= buckets[i]:
            label = i
            break
    template = [0. for j in range(n + 1)]
    template[label] = 1.
    return template

# Function to one-hot encode a value based on given choices
def onehot(x, choices):
    if not x in choices:
        print('could not find "{}" in choices'.format(x))
        print(choices)
        raise Exception()
    label = choices.index(x)
    template = [0. for j in range(len(choices))]
    template[label] = 1.
    return template

# Function to return a value as a continuous float
def continuous(x):
    return [float(x)]

# Function to parse a row of data and return the processed features, label, and sensitive attribute
def parse_row(row, headers, headers_use):
    new_row_dict = {}
    for i in range(len(row)):
        x = row[i]
        hdr = headers[i]
        new_row_dict[hdr] = fns[hdr](x)

    sens_att = new_row_dict[sensitive]
    label = new_row_dict[target]
    new_row = []

    for h in headers_use:
        new_row = new_row + new_row_dict[h]
    return new_row, label, sens_att

# Function to standardize (whiten) the data by subtracting the mean and dividing by the standard deviation
def whiten(X, mn, std):
    mntile = np.tile(mn, (X.shape[0], 1))
    stdtile = np.maximum(np.tile(std, (X.shape[0], 1)), EPS)
    X = X - mntile
    X = np.divide(X, stdtile)
    return X

# Main function to process the dataset
if __name__ == '__main__':
    f_in_tr = 'adult.data'
    f_in_te = 'adult.test'

    f_out_np = 'adult.npz'
    hd_file = 'adult.headers'
    f_out_csv = 'adult.csv'

    header_list = open(hd_file, 'w')

    REMOVE_MISSING = True
    MISSING_TOKEN = '?'

    # Define headers and columns to use
    headers = 'age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income'.split(',')
    headers_use = 'age,workclass,education,education-num,marital-status,occupation,relationship,race,capital-gain,capital-loss,hours-per-week,native-country'.split(',')
    target = 'income'
    sensitive = 'sex'

    # Define processing options for each feature
    options = {
        'age': 'buckets',
        'workclass': 'Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked',
        'fnlwgt': 'continuous',
        'education': 'Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool',
        'education-num': 'continuous',
        'marital-status': 'Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse',
        'occupation': 'Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces',
        'relationship': 'Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried',
        'race': 'White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black',
        'sex': 'Female, Male',
        'capital-gain': 'continuous',
        'capital-loss': 'continuous',
        'hours-per-week': 'continuous',
        'native-country': 'United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands',
        'income': ' <=50K,>50K'
    }

    # Define bucket thresholds for age
    buckets = {'age': [18, 25, 30, 35, 40 ,45, 50, 55, 60, 65]}

    # Process options into sorted lists
    options = {k: [s.strip() for s in sorted(options[k].split(','))] for k in options}

    # Define processing functions for each feature
    fns = {
        'age': lambda x: bucket(x, buckets['age']),
        'workclass': lambda x: onehot(x, options['workclass']),
        'fnlwgt': lambda x: continuous(x),
        'education': lambda x: onehot(x, options['education']),
        'education-num': lambda x: continuous(x),
        'marital-status': lambda x: onehot(x, options['marital-status']),
        'occupation': lambda x: onehot(x, options['occupation']),
        'relationship': lambda x: onehot(x, options['relationship']),
        'race': lambda x: onehot(x, options['race']),
        'sex': lambda x: onehot(x, options['sex']),
        'capital-gain': lambda x: continuous(x),
        'capital-loss': lambda x: continuous(x),
        'hours-per-week': lambda x: continuous(x),
        'native-country': lambda x: onehot(x, options['native-country']),
        'income': lambda x: onehot(x.strip('.'), options['income']),
    }

    D = {}
    for f, phase in [(f_in_tr, 'training'), (f_in_te, 'test')]:
        dat = [s.strip().split(',') for s in open(f, 'r').readlines()]

        X = []
        Y = []
        A = []
        print(phase)

        for r in dat:
            row = [s.strip() for s in r]
            if MISSING_TOKEN in row and REMOVE_MISSING:
                continue
            if row in ([''], ['|1x3 Cross validator']):
                continue
            newrow, label, sens_att = parse_row(row, headers, headers_use)
            X.append(newrow)
            Y.append(label)
            A.append(sens_att)

        npX = np.array(X)
        npY = np.array(Y)
        npA = np.array(A)
        npA = np.expand_dims(npA[:, 1], 1)

        D[phase] = {}
        D[phase]['X'] = npX
        D[phase]['Y'] = npY
        D[phase]['A'] = npA

        print(npX.shape)
        print(npY.shape)
        print(npA.shape)

    # Standardize the data
    mn = np.mean(D['training']['X'], axis=0)
    std = np.std(D['training']['X'], axis=0)

    D['training']['X'] = whiten(D['training']['X'], mn, std)
    D['test']['X'] = whiten(D['test']['X'], mn, std)

    # Write headers to file
    f = open(hd_file, 'w')
    i = 0
    for h in headers_use:
        if options[h] == 'continuous':
            f.write('{:d},{}\n'.format(i, h))
            i += 1
        elif options[h][0] == 'buckets':
            for b in buckets[h]:
                colname = '{}_{:d}'.format(h, b)
                f.write('{:d},{}\n'.format(i, colname))
                i += 1
        else:
            for opt in options[h]:
                colname = '{}_{}'.format(h, opt)
                f.write('{:d},{}\n'.format(i, colname))
                i += 1

    # Split data into training and validation sets
    n = D['training']['X'].shape[0]
    shuf = np.random.permutation(n)
    valid_pct = 0.2
    valid_ct = int(n * valid_pct)
    valid_inds = shuf[:valid_ct]
    train_inds = shuf[valid_ct:]

    # Save processed data to .npz file
    np.savez(f_out_np, x_train=D['training']['X'], x_test=D['test']['X'],
             y_train=D['training']['Y'], y_test=D['test']['Y'],
             attr_train=D['training']['A'], attr_test=D['test']['A'],
             train_inds=train_inds, valid_inds=valid_inds)


training
(24157, 112)
(24157, 2)
(24157, 1)
test
(6005, 112)
(6005, 2)
(6005, 1)


## ACSIncome Dataset (This section can not be neglected)
**This section can not be neglected as the ACS Income data set is not included in the repository due to its huge size.**

In [None]:
# Create directory for the ACSIncome dataset.
%cd /content/Representation_Learning/data/
!mkdir ACSIncome
%cd ACSIncome

/content/Representation_Learning/data
/content/Representation_Learning/data/ACSIncome


In [None]:
# install folktables package. See https://github.com/socialfoundations/folktables/tree/main for more details about the usage of the folktable datasets
!pip install folktables

Collecting folktables
  Downloading folktables-0.0.12-py3-none-any.whl.metadata (533 bytes)
Downloading folktables-0.0.12-py3-none-any.whl (17 kB)
Installing collected packages: folktables
Successfully installed folktables-0.0.12


Get ACSDataSource dataset for Michigan state

In [None]:
from folktables import ACSDataSource, ACSIncome, ACSPublicCoverage, ACSMobility, ACSEmployment, ACSTravelTime
from folktables import generate_categories
import math
data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states = ['MI'], download=True)

Downloading data for 2018 1-Year person survey for MI...


Define new filtering task to filter to create the ACSIncome dataset.

In [None]:
from folktables import BasicProblem
from folktables import acs
import numpy as np
import pandas as pd
def target_fun(v):
    v [ACSIncome.target]= v[ACSIncome.target]>50000
    v [ACSEmployment.target]= v[ACSEmployment.target]==1
    return v

New_Task = BasicProblem(
    features= ACSIncome.features +[ACSIncome.target],
    target= ACSIncome.target,
    target_transform=lambda x: x > 50000,
    group='SEX',
    preprocess=acs.adult_filter,
)

df = New_Task.df_to_pandas(acs_data)
# print('Number of missing values for each attribute')
# print(df[0].isna().sum())
# print('DataFrame shape')
# print(df[0].shape)
data_frame = df[0]
data_frame[ACSIncome.target] = data_frame[ACSIncome.target].apply(lambda x: x>50000)

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data_frame, test_size=0.2, random_state=42)
train_df.to_csv('new_dataset.data', index=False, header=False)
test_df.to_csv('new_dataset.test', index=False, header=False)

Get categories defintions

In [None]:
from folktables import generate_categories
import math
definition_df = data_source.get_definitions(download=True)
categories = generate_categories(features=New_Task.features, definition_df=definition_df)
print(categories['SCHL'].keys())
print(data_frame.columns.values.tolist())

dict_keys([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, nan])
['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P', 'PINCP']


Dataset preprocessing

In [None]:
# This cell is a similar version of the one for the Adult dataset, with modifictioan related to the name of the features and the categories.
import numpy as np
import sys

# make noNULL file with: grep -v NULL rawdata_mkmk01.csv | cut -f1,3,4,6- -d, > rawdata_mkmk01_noNULL.csv
EPS = 1e-8
#### LOOK AT THIS FUNCTION!!!! GETTING STD = 0
def bucket(x, buckets):
    x = float(x)
    n = len(buckets)
    label = n
    for i in range(len(buckets)):
        if x <= buckets[i]:
            label = i
            break
    template = [0. for j in range(n + 1)]
    template[label] = 1.
    return template

def onehot(x, choices):
    # g = False
    # print('First',x)
    try:
      x = float(x)
      # print('is numeric')
    except:
      x = (x == 'True')
      # print('is not numeric')
      # print('Second',x)
      # g = True
    # print('x=',x)
    if not x in choices:
        print('could not find "{}" in choices'.format(x))
        print(choices)
        print(type(choices))
        print(type(x))
        raise Exception()
    label = choices.index(x)
    # if g:
      # print(choices)
      # print(label)
    template = [0. for j in range(len(choices))]
    template[label] = 1.
    return template

def continuous(x):
    return [float(x)]


def parse_row(row, headers, headers_use):
    new_row_dict = {}
    # print(headers)
    # print(row)
    for i in range(len(row)):
        x = row[i]
        hdr = headers[i]
        # print('hdr=',hdr)
        # print('x=',x)
        new_row_dict[hdr] = funs[hdr](x)
    sens_att = new_row_dict[sensitive]
    label = new_row_dict[target]
    # print(label)
    new_row = []
    for h in headers_use:
        new_row = new_row + new_row_dict[h]
        # if h =='SCHL':
          #  print(len(new_row))
    return new_row, label, sens_att

def whiten(X, mn, std):
    mntile = np.tile(mn, (X.shape[0], 1))
    stdtile = np.maximum(np.tile(std, (X.shape[0], 1)), EPS)
    X = X - mntile
    X = np.divide(X, stdtile)
    return X


if __name__ == '__main__':
    f_in_tr = 'new_dataset.data'
    f_in_te = 'new_dataset.test'

    f_out_np = 'ACSIncome.npz'
    hd_file = 'ACSIncome.headers'
    f_out_csv = 'ACSIncome.csv'

    header_list = open(hd_file, 'w')

    REMOVE_MISSING = True
    MISSING_TOKEN = '-1'

    headers =  data_frame.columns.values.tolist()
    headers_use = [item for item in headers if item != ACSIncome.target and item != 'SEX']
    target = ACSIncome.target
    sensitive = 'SEX'
    #['MIG', 'DIS', 'ANC', 'ESP', 'DEYE', 'NATIVITY', 'SCHL', 'AGEP', 'MAR', 'POBP', 'CIT', 'DEAR', 'COW', 'OCCP', 'MIL', 'DREM', 'WKHP', 'RAC1P', 'RELP', 'SEX', 'PINCP', 'ESR']
    # WKHP
    options = {
        'AGEP': 'continuous',
        'WKHP': 'continuous',
        'SCHL': 'Continuous',
        'COW': [x for x in categories['COW'].keys() if not math.isnan(x)],
        'POBP': [x for x in categories['POBP'].keys() if not math.isnan(x)],
        'MAR': [x for x in categories['MAR'].keys() if not math.isnan(x)],
        'OCCP': [x for x in categories['OCCP'].keys() if not math.isnan(x)],
        'RELP': [x for x in categories['RELP'].keys() if not math.isnan(x)],
        'RAC1P': [x for x in categories['RAC1P'].keys() if not math.isnan(x)],
        'SEX': [x for x in categories['SEX'].keys() if not math.isnan(x)],
        ACSIncome.target: [False, True]
    }

    buckets = {'age': [18, 25, 30, 35, 40 ,45, 50, 55, 60, 65]}

    # options = {k: [s.strip() for s in sorted(options[k].split(','))] for k in options}
    #['MIG', 'DIS', 'ANC', 'ESP', 'DEYE', 'NATIVITY', 'SCHL', 'AGEP', 'MAR', 'POBP', 'CIT', 'DEAR', 'COW', 'OCCP', 'MIL', 'DREM', 'WKHP', 'RAC1P', 'RELP', 'SEX', 'PINCP', 'ESR']

    funs ={
        'AGEP': lambda x: continuous(x),
        'WKHP': lambda x: continuous(x),
        'SCHL': lambda x: continuous(x),
        'COW': lambda x: onehot(x, options['COW']),
        'POBP': lambda x: onehot(x, options['POBP']),
        'MAR': lambda x: onehot(x, options['MAR']),
        'OCCP': lambda x: onehot(x, options['OCCP']),
        'RELP': lambda x: onehot(x, options['RELP']),
        'RAC1P': lambda x: onehot(x, options['RAC1P']),
        'SEX': lambda x: onehot(x, options['SEX']),
        ACSIncome.target: lambda x: onehot(x, options[ACSIncome.target])
    }

    D = {}
    for f, phase in [(f_in_tr, 'training'), (f_in_te, 'test')]:
        dat = [s.strip().split(',') for s in open(f, 'r').readlines()]

        X = []
        Y = []
        A = []
        print(phase)

        for r in dat:
            row = [s.strip() for s in r]
            # print(row)
            # print(headers)
            # print(headers_use)
            if MISSING_TOKEN in row and REMOVE_MISSING:
                continue
            if row in ([''], ['|1x3 Cross validator']):
                continue
            newrow, label, sens_att = parse_row(row, headers, headers_use)
            X.append(newrow)
            Y.append(label)
            A.append(sens_att)

        npX = np.array(X)
        npY = np.array(Y)
        npA = np.array(A)
        npA = np.expand_dims(npA[:,1], 1)

        D[phase] = {}
        D[phase]['X'] = npX
        D[phase]['Y'] = npY
        D[phase]['A'] = npA

        print(npX.shape)
        print(npY.shape)
        print(npA.shape)

    #should do normalization and centring
    mn = np.mean(D['training']['X'], axis=0)
    std = np.std(D['training']['X'], axis=0)
    print(mn, std)
    D['training']['X'] = whiten(D['training']['X'], mn, std)
    D['test']['X'] = whiten(D['test']['X'], mn, std)

    #should write headers file
    f = open(hd_file, 'w')
    i = 0
    for h in headers_use:
        if options[h] == 'continuous':
            f.write('{:d},{}\n'.format(i, h))
            i += 1
        elif options[h][0] == 'buckets':
            for b in buckets[h]:
                colname = '{}_{:d}'.format(h, b)
                f.write('{:d},{}\n'.format(i, colname))
                i += 1
        else:
            for opt in options[h]:
                colname = '{}_{}'.format(h, opt)
                f.write('{:d},{}\n'.format(i, colname))
                i += 1

    n = D['training']['X'].shape[0]
    shuf = np.random.permutation(n)
    valid_pct = 0.2
    valid_ct = int(n * valid_pct)
    valid_inds = shuf[:valid_ct]
    train_inds = shuf[valid_ct:]

    np.savez(f_out_np, x_train=D['training']['X'], x_test=D['test']['X'],
                y_train=D['training']['Y'], y_test=D['test']['Y'],
                attr_train=D['training']['A'], attr_test=D['test']['A'],
             train_inds=train_inds, valid_inds=valid_inds)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
First 0.0
First 1.0
First 1.0
First True
First 1.0
First 1.0
First 3255.0
First 26.0
First 1.0
First 2.0
First 1.0
First True
First 1.0
First 3.0
First 6260.0
First 26.0
First 0.0
First 1.0
First 1.0
First False
First 1.0
First 1.0
First 8990.0
First 53.0
First 1.0
First 1.0
First 1.0
First True
First 1.0
First 1.0
First 6260.0
First 303.0
First 0.0
First 1.0
First 1.0
First True
First 7.0
First 1.0
First 10.0
First 13.0
First 0.0
First 1.0
First 1.0
First True
First 1.0
First 1.0
First 750.0
First 26.0
First 0.0
First 2.0
First 1.0
First True
First 1.0
First 3.0
First 9122.0
First 26.0
First 13.0
First 1.0
First 1.0
First True
First 1.0
First 5.0
First 6250.0
First 39.0
First 13.0
First 1.0
First 1.0
First True
First 1.0
First 1.0
First 1450.0
First 26.0
First 0.0
First 1.0
First 1.0
First True
First 1.0
First 5.0
First 7925.0
First 26.0
First 10.0
First 1.0
First 1.0
First False
First 6.0
First 1.0
First 2014.0
First 26

KeyboardInterrupt: 

# Setting up the environment

First, download conda and create environment with python 3.6 as the code has been implemented using Tensorflow 1.x which is not supported by python later than 3.6. Also, Colab does not support python version 3.6 without using Anaconda.

In [None]:
# Navigate to Representation_Learning directory
%cd /content/Representation_Learning/
%env PYTHONPATH = # /env/python
!wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh
!chmod +x Miniconda3-py38_4.12.0-Linux-x86_64.sh
!./Miniconda3-py38_4.12.0-Linux-x86_64.sh -b -f -p /usr/local
!conda update conda
import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')
!conda create -n myenv python=3.6



/content/Representation_Learning
env: PYTHONPATH=# /env/python
--2024-06-19 17:25:29--  https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.32.241, 104.16.191.158, 2606:4700::6810:20f1, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.32.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76120962 (73M) [application/x-sh]
Saving to: ‘Miniconda3-py38_4.12.0-Linux-x86_64.sh’


2024-06-19 17:25:29 (196 MB/s) - ‘Miniconda3-py38_4.12.0-Linux-x86_64.sh’ saved [76120962/76120962]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ done
Solving environment: / - \ | / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - _openmp_mutex==4.5=1_gnu
    - brotlipy==0.7.0=py38h27cfd23_1003
    - ca-certificates==2022.3.29=h06a4308_1
    - certifi==20

Now install the dependencies

In [None]:
%%shell
eval "$(conda shell.bash hook)"
conda activate myenv
pip install MarkupSafe==1.1.1
pip install absl-py==0.2.2
pip install astor==0.7.1
pip install gast==0.2.0
pip install grpcio==1.13.0
pip install Jinja2==2.10
pip install Markdown==2.6.11
pip install numpy==1.14.5
pip install protobuf==3.6.0
pip install six==1.11.0
pip install tensorboard==1.9.0
pip install tensorflow-gpu==1.9.0
pip install tensorflow==1.9.0
pip install termcolor==1.1.0
pip install Werkzeug==0.14.1
pip install matplotlib==3.3.0
pip install ipykernel

Collecting MarkupSafe==1.1.1
  Downloading MarkupSafe-1.1.1-cp36-cp36m-manylinux2010_x86_64.whl (32 kB)
Installing collected packages: MarkupSafe
Successfully installed MarkupSafe-1.1.1
Collecting absl-py==0.2.2
  Downloading absl-py-0.2.2.tar.gz (82 kB)
[K     |████████████████████████████████| 82 kB 1.1 MB/s 
[?25hCollecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Building wheels for collected packages: absl-py
  Building wheel for absl-py (setup.py) ... [?25l[?25hdone
  Created wheel for absl-py: filename=absl_py-0.2.2-py3-none-any.whl size=98945 sha256=5220bba3269c3d7d1a8e5989f9c3fdd07010b5d75117067613f19ebb3788a8f6
  Stored in directory: /root/.cache/pip/wheels/5d/d0/0d/fe80a7cd6b46dc7a7f99a55d2823a38f4babbebff0b24a424a
Successfully built absl-py
Installing collected packages: six, absl-py
Successfully installed absl-py-0.2.2 six-1.16.0
Collecting astor==0.7.1
  Downloading astor-0.7.1-py2.py3-none-any.whl (27 kB)
Installing collected packages: astor
Successf



## Command generation
To facilitate running the training code, we implemented a cell that generates the terminal commands with the required input options, especially since the code should executed from the terminal using Python 3.6 from the conda environment.

The two python files that will be used for training are run_laftr.py and run_unf_clf.py that exit in /content/Representation_Learning/src

For command generation, we take into account the following parameters:

1. Exp_name: Name of the experiment, used for logging and saving purposes

2. data: Dataset being used for the experiment ('adult' or 'ACSIncome')

3. num_experiment: the number of experiments to run. It will be used to distinguish between experiments with different values of the fair_coeff (\gamma).

4. train_epochs: Number of epochs to train the model (for representation learning part)

5. transfer_epochs: Number of epochs for training the naive classifier on the represenation.

6. aud_steps: Number of adversarial update steps (the adversary parameters can be updated for more than one step while the encoder, decoder, and the classifer parameters are updated for one step)

7. batch_size: Size of each batch for training


8. recon_coeff: Coefficient for the reconstruction loss term


9. fair_coeff: Coefficient for the fairness loss term (\gamma)


10. learning_rate: Learning rate for the optimizer

11. transfer_epoch_number: Epoch number after which transfer learning starts
transfer_epoch_number = 950

12. New: Flag indicating whether the experiment is NEw (what we implemented with \Delta_{NE}) or old with from the orginal code.


13. index: Index of the feature to be used for non-exempt discrimination measure. (it is 35 for Adult dataset and 10 for ACSIncome dataset.)



In [None]:
%cd /content/Representation_Learning/

/content/Representation_Learning


In [None]:
def generate_experiment_commands(Exp_name,num_experiments, train_epochs, transfer_epochs, aud_steps, batch_size, recon_coeff, fair_coeff, learning_rate, transfer_epoch_number, data, New, index):
    commands = []
    i = num_experiments
    exp_name =Exp_name+ f"_Exp_{i}"
    if New:
      command1 = (
            f"python src/run_laftr.py conf/transfer/laftr_then_naive.json "
            f"-o exp_name=\"laftr_example/{exp_name}\","
            f"train.n_epochs={train_epochs},"
            f"train.aud_steps={aud_steps},"
            f"train.batch_size={batch_size},"
            f"model.class=WeightedEqoddsWassGanNEW,"
            f"model.recon_coeff={recon_coeff},"
            f"model.fair_coeff={fair_coeff},"
            f"optim.learning_rate={learning_rate},"
            f"transfer.n_epochs={transfer_epochs} "
            f"-n new={New},"
            f"index={index} "
            f"--data {data} --dirs local"
        )

      command2 = (
            f"python src/run_unf_clf.py conf/transfer/laftr_then_naive.json "
            f"-o exp_name=\"laftr_example/{exp_name}/Exp_{i}_classification_transfer\","
            f"train.n_epochs={train_epochs},"
            f"train.aud_steps={aud_steps},"
            f"train.batch_size={batch_size},"
            f"model.class=WeightedEqoddsWassGanNEW,"
            f"model.recon_coeff={recon_coeff},"
            f"model.fair_coeff={fair_coeff},"
            f"optim.learning_rate={learning_rate},"
            f"transfer.n_epochs={transfer_epochs},"
            f"transfer.epoch_number={transfer_epoch_number} "
            f"-n new={New},"
            f"index={index} "
            f"--data {data} --dirs local"
        )

    else:
      command1 = (
            f"python src/run_laftr.py conf/transfer/laftr_then_naive.json "
            f"-o exp_name=\"laftr_example/{exp_name}\","
            f"train.n_epochs={train_epochs},"
            f"train.aud_steps={aud_steps},"
            f"train.batch_size={batch_size},"
            f"model.recon_coeff={recon_coeff},"
            f"model.fair_coeff={fair_coeff},"
            f"optim.learning_rate={learning_rate},"
            f"transfer.n_epochs={transfer_epochs} "
            f"-n new={New},"
            f"index={index} "
            f"--data {data} --dirs local"
        )

      command2 = (
            f"python src/run_unf_clf.py conf/transfer/laftr_then_naive.json "
            f"-o exp_name=\"laftr_example/{exp_name}/Exp_{i}_classification_transfer\","
            f"train.n_epochs={train_epochs},"
            f"train.aud_steps={aud_steps},"
            f"train.batch_size={batch_size},"
            f"model.recon_coeff={recon_coeff},"
            f"model.fair_coeff={fair_coeff},"
            f"optim.learning_rate={learning_rate},"
            f"transfer.n_epochs={transfer_epochs},"
            f"transfer.epoch_number={transfer_epoch_number} "
            f"-n new={New},"
            f"index={index} "
            f"--data {data} --dirs local"
        )


    commands.append((command1, command2))
    return commands

# # Example usage
# Exp_name= 'Adult'
# data = 'adult'
# num_experiment = 3
# train_epochs = 1000
# transfer_epochs = 500
# aud_steps = 3
# batch_size = 128
# recon_coeff = 1
# fair_coeff = 1
# learning_rate = 0.0002
# transfer_epoch_number=950

# New=False
# index=35

# commands = generate_experiment_commands(Exp_name, num_experiment, train_epochs, transfer_epochs, aud_steps, batch_size, recon_coeff, fair_coeff, learning_rate, transfer_epoch_number, data, New, index)

# for command1, command2 in commands:
#     print(command1)
#     print(command2)
#     print()

experiments = [
    {'Exp_name': 'Adult', 'data': 'adult', 'num_experiment': 1, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.1, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': False, 'index': 35},
    {'Exp_name': 'Adult', 'data': 'adult', 'num_experiment': 2, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.5, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': False, 'index': 35},
    {'Exp_name': 'Adult', 'data': 'adult', 'num_experiment': 3, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 3, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 1, 'learning_rate': 0.0002, 'transfer_epoch_number': 950, 'New': False, 'index': 35},
    {'Exp_name': 'ACSIncome', 'data': 'ACSIncome', 'num_experiment': 1, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.1, 'learning_rate': 0.001, 'transfer_epoch_number': 950, 'New': False, 'index': 10},
    {'Exp_name': 'ACSIncome', 'data': 'ACSIncome', 'num_experiment': 2, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.5, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': False, 'index': 10},
    {'Exp_name': 'ACSIncome', 'data': 'ACSIncome', 'num_experiment': 3, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 1, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': False, 'index': 10},
    {'Exp_name': 'AdultXc', 'data': 'adult', 'num_experiment': 1, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.1, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': True, 'index': 35},
    {'Exp_name': 'AdultXc', 'data': 'adult', 'num_experiment': 2, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.5, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': True, 'index': 35},
    {'Exp_name': 'AdultXc', 'data': 'adult', 'num_experiment': 3, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 3, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 1, 'learning_rate': 0.0002, 'transfer_epoch_number': 950, 'New': True, 'index': 35},
    {'Exp_name': 'ACSIncomeXc', 'data': 'ACSIncome', 'num_experiment': 1, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.1, 'learning_rate': 0.001, 'transfer_epoch_number': 950, 'New': True, 'index': 10},
    {'Exp_name': 'ACSIncomeXc', 'data': 'ACSIncome', 'num_experiment': 2, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.5, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': True, 'index': 10},
    {'Exp_name': 'ACSIncomeXc', 'data': 'ACSIncome', 'num_experiment': 3, 'train_epochs': 1000, 'transfer_epochs': 500, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 1, 'learning_rate': 0.0005, 'transfer_epoch_number': 950, 'New': True, 'index': 10},
]

#example
experiments =  [
    {'Exp_name': 'Adult', 'data': 'adult', 'num_experiment': 1, 'train_epochs': 100, 'transfer_epochs': 100, 'aud_steps': 2, 'batch_size': 128, 'recon_coeff': 1, 'fair_coeff': 0.1, 'learning_rate': 0.0005, 'transfer_epoch_number': 50, 'New': False, 'index': 35}]

for exp in experiments:
    commands = generate_experiment_commands(
        Exp_name=exp['Exp_name'],
        num_experiments=exp['num_experiment'],
        train_epochs=exp['train_epochs'],
        transfer_epochs=exp['transfer_epochs'],
        aud_steps=exp['aud_steps'],
        batch_size=exp['batch_size'],
        recon_coeff=exp['recon_coeff'],
        fair_coeff=exp['fair_coeff'],
        learning_rate=exp['learning_rate'],
        transfer_epoch_number=exp['transfer_epoch_number'],
        data=exp['data'],
        New=exp['New'],
        index=exp['index']
    )
    for command1, command2 in commands:

        print('#_'+exp['Exp_name']+'_'+str(exp['num_experiment']))
        print(command1)
        print(command2)
        print()

#_Adult_1
python src/run_laftr.py conf/transfer/laftr_then_naive.json -o exp_name="laftr_example/Adult_Exp_1",train.n_epochs=100,train.aud_steps=2,train.batch_size=128,model.recon_coeff=1,model.fair_coeff=0.1,optim.learning_rate=0.0005,transfer.n_epochs=100 -n new=False,index=35 --data adult --dirs local
python src/run_unf_clf.py conf/transfer/laftr_then_naive.json -o exp_name="laftr_example/Adult_Exp_1/Exp_1_classification_transfer",train.n_epochs=100,train.aud_steps=2,train.batch_size=128,model.recon_coeff=1,model.fair_coeff=0.1,optim.learning_rate=0.0005,transfer.n_epochs=100,transfer.epoch_number=50 -n new=False,index=35 --data adult --dirs local



Run commands in the terminal. For each pair of commands: The first one, train the representation model, and the second one train the naive classifier.

To find the training figures and the final test metrics, for the example below:
1. For first training part go to directly: /content/Representation_Learning/sxperiments/Adult_Exp_1. There you can find training loss figure and fairness metrics. File test_metrics.csv contant the test dataset evalution results.
2. For the second training part (naive classifier from the represenation), go to directoy: /content/Representation_Learning/Exp_1_classification_transfer.   There you can find training loss figure and fairness metrics. File test_metrics.csv contant the test dataset evalution results. Reported data in the paper is from test_metrics.csv in this experiment.

In [None]:
%%shell
eval "$(conda shell.bash hook)"
conda activate myenv
python src/run_laftr.py conf/transfer/laftr_then_naive.json -o exp_name="laftr_example/Adult_Exp_1",train.n_epochs=100,train.aud_steps=2,train.batch_size=128,model.recon_coeff=1,model.fair_coeff=0.1,optim.learning_rate=0.0005,transfer.n_epochs=100 -n new=False,index=35 --data adult --dirs local
python src/run_unf_clf.py conf/transfer/laftr_then_naive.json -o exp_name="laftr_example/Adult_Exp_1/Exp_1_classification_transfer",train.n_epochs=100,train.aud_steps=2,train.batch_size=128,model.recon_coeff=1,model.fair_coeff=0.1,optim.learning_rate=0.0005,transfer.n_epochs=100,transfer.epoch_number=50 -n new=False,index=35 --data adult --dirs local



  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
{'exp_name': 'laftr_example/Adult_Exp_1', 'train': {'n_epochs': '100', 'aud_steps': '2', 'batch_size': '128'}, 'model': {'recon_coeff': '1', 'fair_coeff': '0.1'}, 'optim': {'learning_rate': '0.0005'}, 'transfer': {'n_epochs': '100'}}
y shape (24157, 2)
changing shape
2024-06-19 17:39:06.542923: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
starting Epoch 0
Class DP last epoch: 100000.000; Disc DP Bound last epoch: 100000.000
E0: trained class 150, trained aud 300
DI:  0.014098703861236572
DP:  0.004413619
(4736, 1)
(4736, 8)
(4736, 1)
(4736, 1)
(4736, 1)
Test score: Class CE: 0.271, D

