### This notebook contains simulations for constructing and evaluating the max-entropy distributions for small version of Compas dataset and Adult dataset

The code is based on the following paper:

**Data preprocessing to mitigate bias: A maximum-entropy based approach** <br>
L.Elisa Celis, Vijay Keswani, Nisheeth K. Vishnoi <br>
ICML 2020

In [1]:
import sys 
sys.path.append("..")
# This project requires the IBM AIF360 package for the datasets (https://github.com/ibm/aif360)

import numpy as np
from Fair_Max_Entropy_Distributions.FairMaxEnt.domain import Domain
from Fair_Max_Entropy_Distributions.FairMaxEnt.memory import MemoryTrie
from Fair_Max_Entropy_Distributions.FairMaxEnt.maximum_entropy_distribution import MaxEnt
from Fair_Max_Entropy_Distributions.FairMaxEnt.fair_maximum_entropy import FairMaximumEntropy
from Fair_Max_Entropy_Distributions.FairMaxEnt.fair_maximum_entropy import reweightSamples
from Fair_Max_Entropy_Distributions.Codes.Utils import *
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm as tqdm
# from tqdm import tqdm_notebook as tqdm


[<class 'str'>, <class 'bytes'>]
[<class 'bool'>, <class 'numpy.bool_'>]
[<class 'bool'>, <class 'numpy.bool_'>, numpy.unsignedinteger[typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any], <class 'float'>, numpy.floating[typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any], <class 'float'>, numpy.floating[typing.Any], <class 'complex'>, numpy.complexfloating[typing.Any, typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any], <class 'numpy.timedelta64'>]
[<class 'int'>, <class 'float'>, <class 'complex'>, numpy.number[typing.Any], <class 'numpy.bool_'>]
[<class 'int'>, <class 'float'>, <class 'complex'>, <class 'str'>, <class 'bytes'>, <class 'numpy.generic'>]
[tuple[typing.Any, ...], <class 'numpy.void'>]
[<class 'typing.SupportsIndex'>, collections.abc.Sequence[typing.SupportsIndex]]
[



[typing.Callable, <class 'NoneType'>]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>, tuple[str, int]]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>, <class 'int'>, <class 'str'>]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>, <class 'int'>, <class 'str'>, tuple[typing.Union[pandas.util.version.InfinityType, pandas.util.version.NegativeInfinityType, int, str], str], tuple[pandas.util.version.NegativeInfinityType, typing.Union[pandas.util.version.InfinityType, pandas.util.version.NegativeInfinityType, int, str]]]
[<class 'pandas.util.version.NegativeInfinityType'>, tuple[typing.Union[pandas.util.version.InfinityType, pandas.util.version.NegativeInfinityType, int, str, tuple, tuple], ...]]
[tuple[int, tuple[int, ...], typing.Union[pandas.util.ver

#### Loading the dataset

Replace with Adult dataset for equivalent evaluation (load functions are present in Utils file). <br>
The rest of the code (other than sensitive attribute index and domainarray) is mostly dataset non-specific. <br>
Use notebook FairMaxEnt-expts-2 for large compas dataset expts. <br>


In [2]:
simpleDomain, simpleSamples = getAdultDataset()
simpleDomain, len(simpleSamples)

Missing Data: 3620 rows removed from AdultDataset.


KeyError: 'sexx'

In [None]:
dsfgadas

In [None]:
simpleSamples[0:5]


In [None]:
simpleDomain.labels

In [None]:
domainArray = getAdultDomain()

#### Example runs of max-entropy optimization program

The setup is done and now we can run the experiments <br>
C - smoothing parameter <br>
delta - error parameter

In [None]:
C = 0.1
delta = 0

sens_attr = simpleDomain.labels.index("race")    # for Compas

# labelIndex denotes the index of class label
labelIndex = len(simpleSamples[0]) - 1

In [None]:
simpleSamples.shape

In [None]:
sens_attr

The fairness metrics, evaluated over the original raw dataset, have the following values

In [None]:
print("Statistical Rate: ", getDisparateImpact(simpleSamples, sens_attr))
print("Representation Rate: ", getGenderRatio(simpleSamples, sens_attr))

Since we also need to calculate KL-divergence from empirical distribution of original dataset, we first find this distribution

In [None]:
# This utility evaluation procedure does not work for large COMPAS dataset, due to large size of domain.

domain = getDomain(domainArray)
rawDataDist = getDistribution(simpleSamples, domain) + np.array([0.0000001]*len(domain))

getUtility(simpleSamples, rawDataDist, domain)

#### We first look at max-entropy distribution using dataset mean and prior
The prior in this case is $q_C^d$ and the expected value is mean of original dataset, $\theta^d$

In [None]:
maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, labelIndex, reweight=False, weightedMean=False)
dataset = maxEnt.sample(10000)

print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

The raw dataset, in this case, is quite biased. Hence using just $q_C^d$ and $\theta^d$ will not lead to a fair max-entropy distribution

#### Next we can set different prior and expected values to debias the distribution

The parameter *reweight* can be set true if the re-weighted prior distribution (i.e., $q_C^w$) should be used.
The parameter *weightedMean* can be set true if the re-weighted expected value should be used (i.e., $\theta^w$).
Using these, we get a max-entropy distribution that has high statistical and representation rate.

In [None]:
maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, 0,
                                reweight=True, reweightXindices=[sens_attr],
                                reweightYindices=[len(simpleSamples[0])-1], weightedMean=True)
dataset = maxEnt.sample(10000)

print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

In [None]:
_, testData = getTrainAndTestData(simpleSamples, 3)    
getClfAccAndDI(dataset, testData, sens_attr, clf = DecisionTreeClassifier(random_state=0))


Next we *reweight* to be true, but *weightedMean* to be false, i.e., using fair prior but the expected value is the mean of the original dataset. With this combination, we get a distribution with high statistical rate but low representation rate.

In [None]:
labelIndex = len(simpleSamples[0])-1
maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, sens_attr,
                            reweight=True, reweightXindices=[sens_attr],
                            reweightYindices=[labelIndex])

dataset = maxEnt.sample(10000)
print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

To use $\theta^b$, we need to set *alterMean* to be true. With this combination, we again get a distribution with high statistical rate and high representation rate.

In [None]:
%%time
# The parameter alterMean can be set true if the balanced expected value should be used


maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, sens_attr,
                                reweight=True, reweightXindices=[sens_attr],
                                reweightYindices=[labelIndex], alterMean = True)

dataset = maxEnt.sample(10000)
print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

In [None]:
print(dataset.shape)
dataset

In [None]:
dataset[0]