## Missing values with Adult data
This notebook demonstrates the effect of MAR and MNAR missing values on fairness using Adult data. <br>
In this notebook, we first import packages needed in this file

In [1]:
import sys
sys.path.append("models")
import numpy as np
from adult_model import get_distortion_adult, AdultDataset, reweight_df, get_evaluation
from aif360.algorithms.preprocessing.optim_preproc import OptimPreproc
from aif360.algorithms.preprocessing.optim_preproc_helpers.opt_tools import OptTools
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

The function below process data and create missing values in the dataset. <br>
In Adult dataset, we have sex as sensitive attribute and use age (binned into decade) and education years as features to predict if the income is above or below \$50K pre year. <br>
In this dataset, we create missing values in the feature "Education Years" with MNAR and MAR type of missing values. In the function below, the missing value mechanism is MNAR that the missing values depends on the feature itself. 

In [2]:
def load_preproc_data_adult(protected_attributes=None):
    def custom_preprocessing(df):
        """The custom pre-processing function is adapted from
            https://github.com/fair-preprocessing/nips2017/blob/master/Adult/code/Generate_Adult_Data.ipynb
        """
        np.random.seed(1)
        # Group age by decade
        df['Age (decade)'] = df['age'].apply(lambda x: x // 10 * 10)
        def group_edu(x):
            if x == -1:
                return 'missing_edu'
            elif x <= 5:
                return '<6'
            elif x >= 13:
                return '>12'
            else:
                return x

        def age_cut(x):
            if x >= 70:
                return '>=70'
            else:
                return x

        def group_race(x):
            if x == "White":
                return 1.0
            else:
                return 0.0

        # Cluster education and age attributes.
        # Limit education range
        df['Education Years'] = df['education-num'].apply(
            lambda x: group_edu(x))
        df['Education Years'] = df['Education Years'].astype('category')

        # Limit age range
        df['Age (decade)'] = df['Age (decade)'].apply(lambda x: age_cut(x))

        # Rename income variable
        df['Income Binary'] = df['income-per-year']

        # Recode sex and race
        df['sex'] = df['sex'].replace({'Female': 0.0, 'Male': 1.0})
        df['race'] = df['race'].apply(lambda x: group_race(x))

        # Here we define a column called mis_prob to assign the probability of each observation 
        # being missed
        df['mis_prob'] = 0
        for index, row in df.iterrows():
            # Here, the probability of missing values in Education Years depends on sex and 
            # Education Years, so in this case the missing values are under MNAR
            # To change the distribution of missing values, we can change the probability here
            if row['sex']==0 and row['Education Years'] =='>12':
                df.loc[index,'mis_prob'] = 0.65
            elif row['sex']==1 and row['Education Years'] =='=8':
                df.loc[index,'mis_prob'] = 0.15
            else:
                df.loc[index,'mis_prob'] = 0.1
        new_label = []
        for index, row in df.iterrows():
            if np.random.binomial(1, float(row['mis_prob']), 1)[0] == 1:
                new_label.append('missing_edu')
            else:
                new_label.append(row['Education Years'])
        df['Education Years'] = new_label
        print('Number of missing values')
        print(len(df.loc[df['Education Years'] == 'missing_edu', :]))
        print('Total number of observations')
        print(len(df))
        return df

    XD_features = ['Age (decade)', 'Education Years', 'sex']
    D_features = [
        'sex'] if protected_attributes is None else protected_attributes
    Y_features = ['Income Binary']
    X_features = list(set(XD_features) - set(D_features))
    categorical_features = ['Age (decade)', 'Education Years']
    all_privileged_classes = {"sex": [1.0]}
    all_protected_attribute_maps = {"sex": {1.0: 'Male', 0.0: 'Female'}}

    return AdultDataset(
        label_name=Y_features[0],
        favorable_classes=['>50K', '>50K.'],
        protected_attribute_names=D_features,
        privileged_classes=[all_privileged_classes[x] for x in D_features],
        instance_weights_name=None,
        categorical_features=categorical_features,
        features_to_keep=X_features + Y_features + D_features,
        na_values=['?'],
        metadata={'label_maps': [{1.0: '>50K', 0.0: '<=50K'}],
                  'protected_attribute_maps': [all_protected_attribute_maps[x]
                                               for x in D_features]},
        custom_preprocessing=custom_preprocessing)


The code below is to load the data and run the fairness fixing algorithm proposed by Calmon et al. \[1\]. We set missing values as a new category in features containing missing values. <br>
Note that we modified the distortion function at ```get_distortion_adult```. In this function, we define the penalty for the fairness fixing algorithm to change values in each feature. In this distortion function, we set penalty to be 0 if the original observation value changes from the missing category to a non-missing category and we set a big penalty if the original value changes from a non-missing category to the missing category or the original values remain at the missing category. <br> 

In [3]:
privileged_groups = [{'sex': 1}]
unprivileged_groups = [{'sex': 0}]
dataset_orig = load_preproc_data_adult(['sex'])

optim_options = {
    "distortion_fun": get_distortion_adult,
    "epsilon": 0.02,
    "clist": [0.99, 1.99, 2.99],
    "dlist": [.1, 0.05, 0]
}

dataset_orig_train, dataset_orig_vt = dataset_orig.split(
    [0.7], shuffle=True)

OP = OptimPreproc(OptTools, optim_options,
                  unprivileged_groups=unprivileged_groups,
                  privileged_groups=privileged_groups)

OP = OP.fit(dataset_orig_train)

dataset_transf_cat_test = OP.transform(dataset_orig_vt, transform_Y=True)
dataset_transf_cat_test = dataset_orig_vt.align_datasets(
    dataset_transf_cat_test)

dataset_transf_cat_train = OP.transform(
    dataset_orig_train, transform_Y=True)
dataset_transf_cat_train = dataset_orig_train.align_datasets(
    dataset_transf_cat_train)

Number of missing values
6817
Total number of observations
48842
Optimized Preprocessing: Objective converged to 0.124961


In this part we use the training data obtained from the fairness fixing algorithm by Calmon et al. \[1\] to train a logistic regression classifier and validate the classifier on the test set.

In [4]:
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)

lmod = LogisticRegression()
lmod.fit(X_train, y_train)
y_pred = lmod.predict(X_test)
print('Without reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,1)

Without reweight
Accuracy
0.7601856275165495
p-rule
0.6191770420783865
FPR for unpriv group
0.14370982552800737
FNR for unpriv group
0.6374745417515275
FPR for priv group
0.16569851873366248
FNR for priv group
0.4910958904109589


After getting the accuracy and fairness results, we apply our reweighting algorithm to train a new logistic regression classifier and validate the classifier on the same test set.

In [5]:
dataset_orig_train.instance_weights = reweight_df(dataset_orig_train)
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()

X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)

lmod = LogisticRegression()
lmod.fit(X_train, y_train, sample_weight=dataset_orig_train.instance_weights)
y_pred = lmod.predict(X_test)
print('With reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,1)

With reweight
Accuracy
0.7506995154575855
p-rule
0.7662700956069585
FPR for unpriv group
0.18158861340679522
FNR for unpriv group
0.584521384928717
FPR for priv group
0.16569851873366248
FNR for priv group
0.4910958904109589


By comparing the two results, the fairness scores increase with a small tradeoff in accuracy (about 1\% decrease in accuracy) <br>
The code below process data and create missing values with MAR missing type. <br>
The function below process data and create missing values in the dataset. In the function below, the missing value mechanism is MAR that the missing values do not depend on the feature itself.<br>

In [6]:
def load_preproc_data_adult(protected_attributes=None):
    def custom_preprocessing(df):
        """The custom pre-processing function is adapted from
            https://github.com/fair-preprocessing/nips2017/blob/master/Adult/code/Generate_Adult_Data.ipynb
        """
        np.random.seed(1)
        # Group age by decade
        df['Age (decade)'] = df['age'].apply(lambda x: x // 10 * 10)
        def group_edu(x):
            if x == -1:
                return 'missing_edu'
            elif x <= 5:
                return '<6'
            elif x >= 13:
                return '>12'
            else:
                return x

        def age_cut(x):
            if x >= 70:
                return '>=70'
            else:
                return x

        def group_race(x):
            if x == "White":
                return 1.0
            else:
                return 0.0

        # Cluster education and age attributes.
        # Limit education range
        df['Education Years'] = df['education-num'].apply(
            lambda x: group_edu(x))
        df['Education Years'] = df['Education Years'].astype('category')

        # Limit age range
        df['Age (decade)'] = df['Age (decade)'].apply(lambda x: age_cut(x))

        # Rename income variable
        df['Income Binary'] = df['income-per-year']

        # Recode sex and race
        df['sex'] = df['sex'].replace({'Female': 0.0, 'Male': 1.0})
        df['race'] = df['race'].apply(lambda x: group_race(x))
        
        # Here we define a column called mis_prob to assign the probability of each observation 
        # being missed
        df['mis_prob'] = 0
        for index, row in df.iterrows():
            # Here, the probability of missing values in Education Years depends on sex and 
            # Income Binary, so in this case the missing values are under MAR because the missingness 
            # does not depend on the feature Education Years
            # To change the distribution of missing values, we can change the probability here
            if row['sex']==0 and row['Income Binary'] =='>50K':
                df.loc[index,'mis_prob'] = 0.4
            elif row['sex']==0:
                df.loc[index,'mis_prob'] = 0.1
            else:
                df.loc[index,'mis_prob'] = 0.05
        new_label = []
        for index, row in df.iterrows():
            if np.random.binomial(1, float(row['mis_prob']), 1)[0] == 1:
                new_label.append('missing_edu')
            else:
                new_label.append(row['Education Years'])
                
        df['Education Years'] = new_label
        print('Total number of missing values')
        print(len(df.loc[df['Education Years'] == 'missing_edu', :].index))
        print('Total number of observations')
        print(len(df.index))
        return df
    XD_features = ['Age (decade)', 'Education Years', 'sex']
    D_features = [
        'sex'] if protected_attributes is None else protected_attributes
    Y_features = ['Income Binary']
    X_features = list(set(XD_features) - set(D_features))
    categorical_features = ['Age (decade)', 'Education Years']

    # privileged classes
    all_privileged_classes = {"sex": [1.0]}

    # protected attribute maps
    all_protected_attribute_maps = {"sex": {1.0: 'Male', 0.0: 'Female'}}

    return AdultDataset(
        label_name=Y_features[0],
        favorable_classes=['>50K', '>50K.'],
        protected_attribute_names=D_features,
        privileged_classes=[all_privileged_classes[x] for x in D_features],
        instance_weights_name=None,
        categorical_features=categorical_features,
        features_to_keep=X_features + Y_features + D_features,
        na_values=['?'],
        metadata={'label_maps': [{1.0: '>50K', 0.0: '<=50K'}],
                  'protected_attribute_maps': [all_protected_attribute_maps[x]
                                               for x in D_features]},
        custom_preprocessing=custom_preprocessing)

Same as above, we load the data and run the fairness fixing algorithm proposed by Calmon et al.

In [7]:
privileged_groups = [{'sex': 1}]
unprivileged_groups = [{'sex': 0}]
dataset_orig = load_preproc_data_adult(['sex'])

optim_options = {
    "distortion_fun": get_distortion_adult,
    "epsilon": 0.03,
    "clist": [0.99, 1.99, 2.99],
    "dlist": [.1, 0.05, 0]
}

dataset_orig_train, dataset_orig_vt = dataset_orig.split(
    [0.7], shuffle=True)

OP = OptimPreproc(OptTools, optim_options,
                  unprivileged_groups=unprivileged_groups,
                  privileged_groups=privileged_groups)

OP = OP.fit(dataset_orig_train)

dataset_transf_cat_test = OP.transform(dataset_orig_vt, transform_Y=True)
dataset_transf_cat_test = dataset_orig_vt.align_datasets(
    dataset_transf_cat_test)

dataset_transf_cat_train = OP.transform(
    dataset_orig_train, transform_Y=True)
dataset_transf_cat_train = dataset_orig_train.align_datasets(
    dataset_transf_cat_train)

Total number of missing values
3580
Total number of observations
48842
Optimized Preprocessing: Objective converged to 0.065679


Same as MNAR case, we first train a logistic regression classifier without reweight and train another logistic regression classifier with reweight and validate both of them on the same test set

In [8]:
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)

lmod = LogisticRegression()
lmod.fit(X_train, y_train)
y_pred = lmod.predict(X_test)
print('Without reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,1)

Without reweight
Accuracy
0.7677608680816215
p-rule
0.6894999359511296
FPR for unpriv group
0.1480716253443526
FNR for unpriv group
0.5193482688391038
FPR for priv group
0.15742085390647687
FNR for priv group
0.4859589041095891


In [9]:
dataset_orig_train.instance_weights = reweight_df(dataset_orig_train)
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)
lmod = LogisticRegression()
lmod.fit(X_train, y_train, sample_weight=dataset_orig_train.instance_weights)
y_pred = lmod.predict(X_test)
print('With reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,1)

With reweight
Accuracy
0.762778953115403
p-rule
0.7773321284771667
FPR for unpriv group
0.16896235078053257
FNR for unpriv group
0.484725050916497
FPR for priv group
0.1565495207667732
FNR for priv group
0.4876712328767123


Similar to results from MNAR, our reweighting algorithm improves the fairness scores with a small tradeoff in accuracy. <br>
# Reference
[1] Optimized Pre-Processing for Discrimination Prevention <br>
Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy and Kush R. Varshney.
31st Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, December 2017.