### This notebook contains simulations for constructing and evaluating the max-entropy distributions for small version of Compas dataset and Adult dataset

The code is based on the following paper:

**Data preprocessing to mitigate bias: A maximum-entropy based approach** <br>
L.Elisa Celis, Vijay Keswani, Nisheeth K. Vishnoi <br>
ICML 2020

In [1]:
import sys 
sys.path.append("..")
# This project requires the IBM AIF360 package for the datasets (https://github.com/ibm/aif360)

import numpy as np
from Fair_Max_Entropy_Distributions.FairMaxEnt.domain import Domain
from Fair_Max_Entropy_Distributions.FairMaxEnt.memory import MemoryTrie
from Fair_Max_Entropy_Distributions.FairMaxEnt.maximum_entropy_distribution import MaxEnt
from Fair_Max_Entropy_Distributions.FairMaxEnt.fair_maximum_entropy import FairMaximumEntropy
from Fair_Max_Entropy_Distributions.FairMaxEnt.fair_maximum_entropy import reweightSamples
from Fair_Max_Entropy_Distributions.Codes.Utils import *
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm as tqdm
# from tqdm import tqdm_notebook as tqdm


[<class 'str'>, <class 'bytes'>]
[<class 'bool'>, <class 'numpy.bool_'>]
[<class 'bool'>, <class 'numpy.bool_'>, numpy.unsignedinteger[typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any], <class 'float'>, numpy.floating[typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any], <class 'float'>, numpy.floating[typing.Any], <class 'complex'>, numpy.complexfloating[typing.Any, typing.Any]]
[<class 'bool'>, <class 'numpy.bool_'>, <class 'int'>, numpy.integer[typing.Any], <class 'numpy.timedelta64'>]
[<class 'int'>, <class 'float'>, <class 'complex'>, numpy.number[typing.Any], <class 'numpy.bool_'>]
[<class 'int'>, <class 'float'>, <class 'complex'>, <class 'str'>, <class 'bytes'>, <class 'numpy.generic'>]
[tuple[typing.Any, ...], <class 'numpy.void'>]
[<class 'typing.SupportsIndex'>, collections.abc.Sequence[typing.SupportsIndex]]
[



[typing.Callable, <class 'NoneType'>]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>, tuple[str, int]]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>, <class 'int'>, <class 'str'>]
[<class 'pandas.util.version.InfinityType'>, <class 'pandas.util.version.NegativeInfinityType'>, <class 'int'>, <class 'str'>, tuple[typing.Union[pandas.util.version.InfinityType, pandas.util.version.NegativeInfinityType, int, str], str], tuple[pandas.util.version.NegativeInfinityType, typing.Union[pandas.util.version.InfinityType, pandas.util.version.NegativeInfinityType, int, str]]]
[<class 'pandas.util.version.NegativeInfinityType'>, tuple[typing.Union[pandas.util.version.InfinityType, pandas.util.version.NegativeInfinityType, int, str, tuple, tuple], ...]]
[tuple[int, tuple[int, ...], typing.Union[pandas.util.ver

#### Loading the dataset

Replace with Adult dataset for equivalent evaluation (load functions are present in Utils file). <br>
The rest of the code (other than sensitive attribute index and domainarray) is mostly dataset non-specific. <br>
Use notebook FairMaxEnt-expts-2 for large compas dataset expts. <br>


In [2]:
simpleDomain, simpleSamples = getSmallCompasDataset()
simpleDomain, len(simpleSamples)

Missing Data: 5 rows removed from CompasDataset.


  dfcutQ['sex'] = dfcutQ['sex'].replace({'Female': 1.0, 'Male': 0.0})


(Domain in 11 with 6, 5278)

In [3]:
simpleDomain.labels

['sex', 'race', 'age', 'priors_count', 'c_charge_degree', 'two_year_recid']

In [4]:
domainArray = getSmallCompasDomain()

#### Example runs of max-entropy optimization program

The setup is done and now we can run the experiments <br>
C - smoothing parameter <br>
delta - error parameter

In [5]:
C = 0.1
delta = 0

sens_attr = simpleDomain.labels.index("race")    # for Compas

# labelIndex denotes the index of class label
labelIndex = len(simpleSamples[0]) - 1

In [6]:
simpleSamples.shape

(5278, 11)

In [7]:
sens_attr

1

The fairness metrics, evaluated over the original raw dataset, have the following values

In [8]:
print("Statistical Rate: ", getDisparateImpact(simpleSamples, sens_attr))
print("Representation Rate: ", getGenderRatio(simpleSamples, sens_attr))

Statistical Rate:  0.7471480065031378
Representation Rate:  0.6623622047244094


Since we also need to calculate KL-divergence from empirical distribution of original dataset, we first find this distribution

In [9]:
# This utility evaluation procedure does not work for large COMPAS dataset, due to large size of domain.

domain = getDomain(domainArray)
rawDataDist = getDistribution(simpleSamples, domain) + np.array([0.0000001]*len(domain))

getUtility(simpleSamples, rawDataDist, domain)

2.0032834494455346e-07

#### We first look at max-entropy distribution using dataset mean and prior
The prior in this case is $q_C^d$ and the expected value is mean of original dataset, $\theta^d$

In [10]:
maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, labelIndex, reweight=False, weightedMean=False)
dataset = maxEnt.sample(10000)

print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

Statistical Rate:  0.8025073266037122
Representation Rate:  0.6666666666666666
KL-divergence wrt raw data:  0.02270855976106769


The raw dataset, in this case, is quite biased. Hence using just $q_C^d$ and $\theta^d$ will not lead to a fair max-entropy distribution

#### Next we can set different prior and expected values to debias the distribution

The parameter *reweight* can be set true if the re-weighted prior distribution (i.e., $q_C^w$) should be used.
The parameter *weightedMean* can be set true if the re-weighted expected value should be used (i.e., $\theta^w$).
Using these, we get a max-entropy distribution that has high statistical and representation rate.

In [11]:
maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, 0,
                                reweight=True, reweightXindices=[sens_attr],
                                reweightYindices=[len(simpleSamples[0])-1], weightedMean=True)
dataset = maxEnt.sample(10000)

print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

Statistical Rate:  0.9810178549973025
Representation Rate:  0.9657951641438962
KL-divergence wrt raw data:  0.05349458280249288


In [12]:
_, testData = getTrainAndTestData(simpleSamples, 3)    
getClfAccAndDI(dataset, testData, sens_attr, clf = DecisionTreeClassifier(random_state=0))


(0.6616113744075829, 0.9381480330255939)

Next we *reweight* to be true, but *weightedMean* to be false, i.e., using fair prior but the expected value is the mean of the original dataset. With this combination, we get a distribution with high statistical rate but low representation rate.

In [13]:
labelIndex = len(simpleSamples[0])-1
maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, sens_attr,
                            reweight=True, reweightXindices=[sens_attr],
                            reweightYindices=[labelIndex])

dataset = maxEnt.sample(10000)
print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

Statistical Rate:  0.9943290132963473
Representation Rate:  0.6605778811026237
KL-divergence wrt raw data:  0.03441313417136953


To use $\theta^b$, we need to set *alterMean* to be true. With this combination, we again get a distribution with high statistical rate and high representation rate.

In [14]:
%%time
# The parameter alterMean can be set true if the balanced expected value should be used


maxEnt = FairMaximumEntropy(simpleDomain, simpleSamples, C, delta, sens_attr,
                                reweight=True, reweightXindices=[sens_attr],
                                reweightYindices=[labelIndex], alterMean = True)

dataset = maxEnt.sample(10000)
print("Statistical Rate: ", getDisparateImpact(dataset, sens_attr))
print("Representation Rate: ", getGenderRatio(dataset, sens_attr))
print("KL-divergence wrt raw data: ", getUtility(dataset, rawDataDist, domain))

Statistical Rate:  0.9921182969283651
Representation Rate:  0.9813750743015652
KL-divergence wrt raw data:  0.0614629481869033
CPU times: user 862 ms, sys: 1.67 ms, total: 864 ms
Wall time: 862 ms


In [19]:
print(dataset.shape)
dataset

(10000, 11)


array([[1., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 1., 0., 1.],
       [0., 0., 1., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       [1., 1., 1., ..., 0., 1., 0.]])

#### Experiments in the paper
In the paper, we report value after 5-fold cross-validation. For each fold, we do 100 simulations.
We also provide results for different values of C. 

The max-entropy distribution computed below represent all possible choices of prior distribution, expected vector and parameter $C$.

In [15]:
maxEnts = {'maxEnt_unif_wt_unif_mean' : {}, 
                  'maxEnt_re_wt_unif_mean' : {}, 
                  'maxEnt_unif_wt_alt_mean' : {}, 
                  'maxEnt_re_wt_alt_mean' : {}, 
                  'maxEnt_unif_wt_wt_mean' : {}, 
                  'maxEnt_re_wt_wt_mean' : {}}

for key in maxEnts.keys():
    for D in range(11):
        C = D/10.0
        maxEnts[key][C] = {}
        

delta = 0
labelIndex = len(simpleSamples[0]) - 1

for fold in tqdm(range(5)):
    
    trainData, testData = getTrainAndTestData(simpleSamples, fold)
    
    for D in tqdm(range(11)):
        C = D/10.0
        key = "{fold}_{C}".format(fold=fold, C=C)
        
        
        maxEnts['maxEnt_unif_wt_unif_mean'][C][fold] = FairMaximumEntropy(simpleDomain, trainData, 
                                          C, 
                                          delta, 
                                          sens_attr)
        
        
        maxEnts['maxEnt_re_wt_unif_mean'][C][fold] = FairMaximumEntropy(simpleDomain, 
                                          trainData, 
                                          C, 
                                          delta, 
                                          sens_attr,
                                          reweight=True, 
                                          reweightXindices=[sens_attr],
                                          reweightYindices=[labelIndex])


        maxEnts['maxEnt_unif_wt_alt_mean'][C][fold] = FairMaximumEntropy(simpleDomain, 
                                          trainData, 
                                          C, 
                                          delta, 
                                          sens_attr,
                                          alterMean = True)

        
        maxEnts['maxEnt_re_wt_alt_mean'][C][fold] = FairMaximumEntropy(simpleDomain, trainData, C, delta, sens_attr,
                                reweight=True, reweightXindices=[sens_attr],
                                reweightYindices=[labelIndex], alterMean = True)

        
        maxEnts['maxEnt_unif_wt_wt_mean'][C][fold] = FairMaximumEntropy(simpleDomain, trainData, C, delta, sens_attr,
                                weightedMean=True, reweightXindices=[sens_attr],
                                reweightYindices=[labelIndex])

            
        maxEnts['maxEnt_re_wt_wt_mean'][C][fold] = FairMaximumEntropy(simpleDomain, trainData, C, delta, sens_attr,
                                reweight=True, reweightXindices=[sens_attr],
                                reweightYindices=[labelIndex], weightedMean=True)
    
        

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

From these max-entropy distributions, we can sample 10000 elements to create a new dataset and compute the fairness-accuracy metrics of this dataset.
We can also train a classifier on this new dataset and check the fairness and accuracy of the trained classifier on the test-fold of the original dataset

In [None]:
folds= 5
repetitions = 1.369
samples = 10000

DIs, KLs, GRs, Accs, CDs = [], [], [], [], []
for key in tqdm(maxEnts.keys()):
    diForKey, klForKey, grForKey, accForKey, cdForKey = [], [], [], [], []
    for D in tqdm(range(11)):
        C = D/10.0
        for fold in range(folds):
            di, kl, gr, acc, cd = [] ,[], [], [], []
            for _ in range(repetitions):
                dataset = maxEnts[key][C][fold].sample(samples)
            
                di.append(getDisparateImpact(dataset, sens_attr))
                kl.append(getUtility(dataset, rawDataDist, domain))
                gr.append(getGenderRatio(dataset, sens_attr))
                
                _, testData = getTrainAndTestData(simpleSamples, fold)    
                a1, cd1 = getClfAccAndDI(dataset, testData, sens_attr, clf = DecisionTreeClassifier(random_state=0))
                acc.append(a1)
                cd.append(cd1)
            
            diForKey = diForKey + di
            klForKey = klForKey + kl
            grForKey = grForKey + gr
            accForKey = accForKey + acc
            cdForKey = cdForKey + cd
    DIs.append(diForKey)
    KLs.append(klForKey)
    GRs.append(grForKey)
    Accs.append(accForKey)
    CDs.append(cdForKey)
            
        

The plots below show the variation of the metrics with C and for different prior and expected value parameters.

In [None]:
fig = plt.figure(figsize=(12, 4))
cPlot(DIs, "Data Statistical Rate", title="Data Statistical Rate vs C")

In [None]:
fig = plt.figure(figsize=(12, 4))
cPlot(GRs, "Representation Rate", title="Representation Rate vs C")

In [None]:
fig = plt.figure(figsize=(12, 4))
cPlot(KLs, "KL-divergence wrt raw datat", title="Divergence vs C")

In [None]:
fig = plt.figure(figsize=(12, 4))
cPlot(CDs, "Classifier Statistical Rate", title="Classifier fairness vs C")

In [None]:
fig = plt.figure(figsize=(12, 4))
cPlot(Accs, "Classifier Accuracy", title="Classifier Accuracy vs C")