## Missing values with COMPAS data
This notebook demonstrates the effect of MAR and MNAR missing values on fairness using Adult data. <br>
In this notebook, we first import packages needed in this file

In [1]:
import sys
sys.path.append("models")
sys.path.append("AIF360/")
import numpy as np
from compas_model import get_distortion_compas, CompasDataset, reweight_df, get_evaluation
from aif360.algorithms.preprocessing.optim_preproc import OptimPreproc
from aif360.algorithms.preprocessing.optim_preproc_helpers.opt_tools import OptTools
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

The function below process data and create missing values in the dataset. <br>
In COMPAS dataset, we have race as sensitive attribute and use age (binned into 3 bins), priors_count (number of prior criminal cases, binned into 3 bins), c_charge_degree (degree charged by the prosecutor) and score_text (a risk score assigned to the defendant) as features to predict two_year_recid (if person will be re-arrested for a violent offense within two years). <br>
In this dataset, we create missing values in the feature priors_count with MNAR and MAR type of missing values. In the function below, the missing value mechanism is MNAR that the missing values depends on the feature itself. 

In [2]:
def load_preproc_data_compas(protected_attributes=None):
    def custom_preprocessing(df):
        """The custom pre-processing function is adapted from
            https://github.com/fair-preprocessing/nips2017/blob/master/compas/code/Generate_Compas_Data.ipynb
        """

        df = df[['age',
                 'c_charge_degree',
                 'race',
                 'age_cat',
                 'score_text',
                 'sex',
                 'priors_count',
                 'days_b_screening_arrest',
                 'decile_score',
                 'is_recid',
                 'two_year_recid',
                 'length_of_stay']]

        # Indices of data samples to keep
        ix = df['days_b_screening_arrest'] <= 30
        ix = (df['days_b_screening_arrest'] >= -30) & ix
        ix = (df['is_recid'] != -1) & ix
        ix = (df['c_charge_degree'] != "O") & ix
        ix = (df['score_text'] != 'N/A') & ix
        df = df.loc[ix, :]
        # Restrict races to African-American and Caucasian
        dfcut = df.loc[~df['race'].isin(
            ['Native American', 'Hispanic', 'Asian', 'Other']), :]

        # Restrict the features to use
        dfcutQ = dfcut[['sex',
                        'race',
                        'age_cat',
                        'c_charge_degree',
                        'score_text',
                        'priors_count',
                        'is_recid',
                        'two_year_recid',
                        'length_of_stay']].copy()

        # Quantize priors count between 0, 1-3, and >3
        def quantizePrior(x):
            if x == 0:
                return '0'
            elif x == 1:
                return '1 to 3'
            elif x == 2:
                return 'More than 3'
            elif x == 'missing':
                return 'missing'
        # Quantize length of stay

        def quantizeLOS(x):
            if x == 0:
                return '<week'
            if x == 1:
                return '<3months'
            else:
                return '>3 months'

        # Quantize length of stay
        def adjustAge(x):
            if x == 0:
                return '25 to 45'
            elif x == 1:
                return 'Greater than 45'
            elif x == 2:
                return 'Less than 25'

        def quantizeScore(x):
            if x == 1:
                return 'MediumHigh'
            else:
                return 'Low'

        def group_race(x):
            if x == "Caucasian":
                return 1.0
            else:
                return 0.0
        
        np.random.seed(10)
        df1 = dfcutQ[['priors_count', 'c_charge_degree', 'race',
                  'age_cat', 'score_text', 'two_year_recid']]
        # Here we define a column called mis_prob to assign the probability of each observation 
        # being missed
        dfcutQ['mis_prob'] = 0
        # Here, the probability of missing values in priors_count depends on race, two_year_recid
        # and priors_count, so in this case the missing values are under MNAR that the probability
        # of missing values depends on the feature itself
        # To change the distribution of missing values, we can change the probability here
        for index, row in dfcutQ.iterrows():
            if row['race'] == 'African-American' and row['two_year_recid']==0 and row['priors_count']==0:
                dfcutQ.loc[index, 'mis_prob'] = 0.5
            elif row['race'] != 'African-American' and row['two_year_recid']==1 and row['priors_count']==2:
                dfcutQ.loc[index, 'mis_prob'] = 0.3
            elif row['race'] == 'African-American':
                dfcutQ.loc[index, 'mis_prob'] = 0.2
            else:
                dfcutQ.loc[index, 'mis_prob'] = 0.05
        new_label = []
        for index, row in dfcutQ.iterrows():
            if np.random.binomial(1, float(row['mis_prob']), 1)[0] == 1:
                new_label.append('missing')
            else:
                new_label.append(row['priors_count'])
        dfcutQ['priors_count'] = new_label
        print('Total number of missing values')
        print(len(dfcutQ.loc[dfcutQ['priors_count'] == 'missing', :].index))
        print('Total number of observations')
        print(len(dfcutQ.index))

        dfcutQ['priors_count'] = dfcutQ['priors_count'].apply(
            lambda x: quantizePrior(x))
        dfcutQ['length_of_stay'] = dfcutQ['length_of_stay'].apply(
            lambda x: quantizeLOS(x))
        dfcutQ['score_text'] = dfcutQ['score_text'].apply(
            lambda x: quantizeScore(x))
        dfcutQ['age_cat'] = dfcutQ['age_cat'].apply(lambda x: adjustAge(x))
        # Recode sex and race
        dfcutQ['sex'] = dfcutQ['sex'].replace({'Female': 1.0, 'Male': 0.0})
        dfcutQ['race'] = dfcutQ['race'].apply(lambda x: group_race(x))

        features = ['two_year_recid', 'race',
                    'age_cat', 'priors_count', 'c_charge_degree', 'score_text']

        # Pass vallue to df
        df = dfcutQ[features]
        return df

    XD_features = [
        'age_cat',
        'c_charge_degree',
        'priors_count',
        'race',
        'score_text']
    D_features = [
        'race'] if protected_attributes is None else protected_attributes
    Y_features = ['two_year_recid']
    X_features = list(set(XD_features) - set(D_features))
    categorical_features = [
        'age_cat',
        'priors_count',
        'c_charge_degree',
        'score_text']

    # privileged classes
    all_privileged_classes = {"sex": [1.0],
                              "race": [1.0]}

    # protected attribute maps
    all_protected_attribute_maps = {
        "sex": {
            0.0: 'Male', 1.0: 'Female'}, "race": {
            1.0: 'Caucasian', 0.0: 'Not Caucasian'}}

    return CompasDataset(
        label_name=Y_features[0],
        favorable_classes=[0],
        protected_attribute_names=D_features,
        privileged_classes=[all_privileged_classes[x] for x in D_features],
        instance_weights_name=None,
        categorical_features=categorical_features,
        features_to_keep=X_features + Y_features + D_features,
        na_values=[],
        metadata={'label_maps': [{1.0: 'Did recid.', 0.0: 'No recid.'}],
                  'protected_attribute_maps': [all_protected_attribute_maps[x]
                                               for x in D_features]},
        custom_preprocessing=custom_preprocessing)

The code below is to load the data and run the fairness fixing algorithm proposed by Calmon et al. \[1\]. We set missing values as a new category in features containing missing values. <br>
Note that we modified the distortion function at ```get_distortion_compas```. In this function, we define the penalty for the fairness fixing algorithm to change values in each feature. In this distortion function, we set penalty to be 0 if the original observation value changes from the missing category to a non-missing category and we set a big penalty if the original value changes from a non-missing category to the missing category or the original values remain at the missing category. <br> 

In [3]:
privileged_groups = [{'race': 1}]
unprivileged_groups = [{'race': 0}]
dataset_orig = load_preproc_data_compas(['race'])

optim_options = {
    "distortion_fun": get_distortion_compas,
    "epsilon": 0.05,
    "clist": [0.99, 1.99, 2.99],
    "dlist": [.1, 0.05, 0]
}

dataset_orig_train, dataset_orig_vt = dataset_orig.split(
    [0.7], shuffle=True)

OP = OptimPreproc(OptTools, optim_options,
                  unprivileged_groups=unprivileged_groups,
                  privileged_groups=privileged_groups)

OP = OP.fit(dataset_orig_train)

dataset_transf_cat_test = OP.transform(dataset_orig_vt, transform_Y=True)
dataset_transf_cat_test = dataset_orig_vt.align_datasets(
    dataset_transf_cat_test)

dataset_transf_cat_train = OP.transform(
    dataset_orig_train, transform_Y=True)
dataset_transf_cat_train = dataset_orig_train.align_datasets(
    dataset_transf_cat_train)

Total number of missing values
906
Total number of observations
5278
Optimized Preprocessing: Objective converged to 0.157471


In this part we use the training data obtained from the fairness fixing algorithm by Calmon et al. \[1\] to train a logistic regression classifier and validate the classifier on the test set.

In [4]:
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)
lmod = LogisticRegression()
lmod.fit(X_train, y_train)
y_pred = lmod.predict(X_test)
print('Without reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,1)

Without reweight
Accuracy
0.6502525252525253
p-rule
0.7667486382146068
FPR for unpriv group
0.37632135306553915
FNR for unpriv group
0.34439834024896265
FPR for priv group
0.25133689839572193
FNR for priv group
0.4549019607843138


After getting the accuracy and fairness results, we apply our reweighting algorithm to train a new logistic regression classifier and validate the classifier on the same test set.

In [5]:
dataset_orig_train.instance_weights = reweight_df(dataset_orig_train)
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)
lmod = LogisticRegression()
lmod.fit(
    X_train,
    y_train,
    sample_weight=dataset_orig_train.instance_weights)
y_pred = lmod.predict(X_test)
print('With reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,1)

With reweight
Accuracy
0.648989898989899
p-rule
0.8382674916706331
FPR for unpriv group
0.34460887949260044
FNR for unpriv group
0.37344398340248963
FPR for priv group
0.27005347593582885
FNR for priv group
0.4392156862745098


By comparing the two results, the fairness scores increase with a very small tradeoff in accuracy (about 0.1-0.2\% decrease in accuracy) <br>
The code chunks below process data and create missing values with MAR missing type. <br>
The function below process data and create missing values in the dataset. In the function below, the missing value mechanism is MAR that the missing values do not depend on the feature itself.<br>

In [6]:
def load_preproc_data_compas(protected_attributes=None):
    def custom_preprocessing(df):
        """The custom pre-processing function is adapted from
            https://github.com/fair-preprocessing/nips2017/blob/master/compas/code/Generate_Compas_Data.ipynb
        """

        df = df[['age',
                 'c_charge_degree',
                 'race',
                 'age_cat',
                 'score_text',
                 'sex',
                 'priors_count',
                 'days_b_screening_arrest',
                 'decile_score',
                 'is_recid',
                 'two_year_recid',
                 'length_of_stay']]

        # Indices of data samples to keep
        ix = df['days_b_screening_arrest'] <= 30
        ix = (df['days_b_screening_arrest'] >= -30) & ix
        ix = (df['is_recid'] != -1) & ix
        ix = (df['c_charge_degree'] != "O") & ix
        ix = (df['score_text'] != 'N/A') & ix
        df = df.loc[ix, :]
        # Restrict races to African-American and Caucasian
        dfcut = df.loc[~df['race'].isin(
            ['Native American', 'Hispanic', 'Asian', 'Other']), :]

        # Restrict the features to use
        dfcutQ = dfcut[['sex',
                        'race',
                        'age_cat',
                        'c_charge_degree',
                        'score_text',
                        'priors_count',
                        'is_recid',
                        'two_year_recid',
                        'length_of_stay']].copy()

        # Quantize priors count between 0, 1-3, and >3
        def quantizePrior(x):
            if x == 0:
                return '0'
            elif x == 1:
                return '1 to 3'
            elif x == 2:
                return 'More than 3'
            elif x == 'missing':
                return 'missing'
        # Quantize length of stay

        def quantizeLOS(x):
            if x == 0:
                return '<week'
            if x == 1:
                return '<3months'
            else:
                return '>3 months'

        # Quantize length of stay
        def adjustAge(x):
            if x == 0:
                return '25 to 45'
            elif x == 1:
                return 'Greater than 45'
            elif x == 2:
                return 'Less than 25'

        def quantizeScore(x):
            if x == 1:
                return 'MediumHigh'
            else:
                return 'Low'

        def group_race(x):
            if x == "Caucasian":
                return 1.0
            else:
                return 0.0
        
        np.random.seed(10)
        df1 = dfcutQ[['priors_count', 'c_charge_degree', 'race',
                  'age_cat', 'score_text', 'two_year_recid']]
        # Here we define a column called mis_prob to assign the probability of each observation 
        # being missed
        dfcutQ['mis_prob'] = 0
        # Here, the probability of missing values in priors_count depend on race and 
        # two_year_recid, so in this case the missing values are under MAR because the missingness 
        # does not depend on the feature priors_count
        # To change the distribution of missing values, we can change the probability here
        for index, row in dfcutQ.iterrows():
            if row['race'] == 'African-American' and row['two_year_recid']==0:
                dfcutQ.loc[index, 'mis_prob'] = 0.2
            elif row['race'] == 'African-American':
                dfcutQ.loc[index, 'mis_prob'] = 0.15
            else:
                dfcutQ.loc[index, 'mis_prob'] = 0.05
        new_label = []
        for index, row in dfcutQ.iterrows():
            if np.random.binomial(1, float(row['mis_prob']), 1)[0] == 1:
                new_label.append('missing')
            else:
                new_label.append(row['priors_count'])
        dfcutQ['priors_count'] = new_label
        print('Total number of missing values')
        print(len(dfcutQ.loc[dfcutQ['priors_count'] == 'missing', :].index))
        print('Total number of observations')
        print(len(dfcutQ.index))

        dfcutQ['priors_count'] = dfcutQ['priors_count'].apply(
            lambda x: quantizePrior(x))
        dfcutQ['length_of_stay'] = dfcutQ['length_of_stay'].apply(
            lambda x: quantizeLOS(x))
        dfcutQ['score_text'] = dfcutQ['score_text'].apply(
            lambda x: quantizeScore(x))
        dfcutQ['age_cat'] = dfcutQ['age_cat'].apply(lambda x: adjustAge(x))
        # Recode sex and race
        dfcutQ['sex'] = dfcutQ['sex'].replace({'Female': 1.0, 'Male': 0.0})
        dfcutQ['race'] = dfcutQ['race'].apply(lambda x: group_race(x))

        features = ['two_year_recid', 'race',
                    'age_cat', 'priors_count', 'c_charge_degree', 'score_text']

        # Pass vallue to df
        df = dfcutQ[features]

        return df

    XD_features = [
        'age_cat',
        'c_charge_degree',
        'priors_count',
        'race',
        'score_text']
    D_features = [
        'race'] if protected_attributes is None else protected_attributes
    Y_features = ['two_year_recid']
    X_features = list(set(XD_features) - set(D_features))
    categorical_features = [
        'age_cat',
        'priors_count',
        'c_charge_degree',
        'score_text']

    # privileged classes
    all_privileged_classes = {"sex": [1.0],
                              "race": [1.0]}

    # protected attribute maps
    all_protected_attribute_maps = {
        "sex": {
            0.0: 'Male', 1.0: 'Female'}, "race": {
            1.0: 'Caucasian', 0.0: 'Not Caucasian'}}

    return CompasDataset(
        label_name=Y_features[0],
        favorable_classes=[0],
        protected_attribute_names=D_features,
        privileged_classes=[all_privileged_classes[x] for x in D_features],
        instance_weights_name=None,
        categorical_features=categorical_features,
        features_to_keep=X_features + Y_features + D_features,
        na_values=[],
        metadata={'label_maps': [{1.0: 'Did recid.', 0.0: 'No recid.'}],
                  'protected_attribute_maps': [all_protected_attribute_maps[x]
                                               for x in D_features]},
        custom_preprocessing=custom_preprocessing)

Same as above, we load the data and run the fairness fixing algorithm proposed by Calmon et al.

In [7]:
privileged_groups = [{'race': 1}]
unprivileged_groups = [{'race': 0}]
dataset_orig = load_preproc_data_compas(['race'])

optim_options = {
    "distortion_fun": get_distortion_compas,
    "epsilon": 0.05,
    "clist": [0.99, 1.99, 2.99],
    "dlist": [.1, 0.05, 0]
}

dataset_orig_train, dataset_orig_vt = dataset_orig.split(
    [0.7], shuffle=True)

OP = OptimPreproc(OptTools, optim_options,
                  unprivileged_groups=unprivileged_groups,
                  privileged_groups=privileged_groups)

OP = OP.fit(dataset_orig_train)

dataset_transf_cat_test = OP.transform(dataset_orig_vt, transform_Y=True)
dataset_transf_cat_test = dataset_orig_vt.align_datasets(
    dataset_transf_cat_test)

dataset_transf_cat_train = OP.transform(
    dataset_orig_train, transform_Y=True)
dataset_transf_cat_train = dataset_orig_train.align_datasets(
    dataset_transf_cat_train)

Total number of missing values
623
Total number of observations
5278
Optimized Preprocessing: Objective converged to 0.102867


Same as MNAR case, we first train a logistic regression classifier without reweight and train another logistic regression classifier with reweight and validate both of them on the same test set

In [8]:
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)
lmod = LogisticRegression()
lmod.fit(X_train, y_train)
y_pred = lmod.predict(X_test)
print('Without reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,0)

Without reweight
Accuracy
0.6546717171717171
p-rule
0.7910284406324818
FPR for unpriv group
0.3568464730290456
FNR for unpriv group
0.35306553911205074
FPR for priv group
0.4549019607843138
FNR for priv group
0.2459893048128342


In [9]:
dataset_orig_train.instance_weights = reweight_df(dataset_orig_train)
scale_transf = StandardScaler()
X_train = scale_transf.fit_transform(dataset_transf_cat_train.features)
y_train = dataset_transf_cat_train.labels.ravel()
X_test = scale_transf.fit_transform(dataset_transf_cat_test.features)
lmod = LogisticRegression()
lmod.fit(
    X_train,
    y_train,
    sample_weight=dataset_orig_train.instance_weights)
y_pred = lmod.predict(X_test)
print('With reweight')
get_evaluation(dataset_orig_vt,y_pred,privileged_groups,unprivileged_groups,0,1,0)

With reweight
Accuracy
0.6534090909090909
p-rule
0.8284982088729678
FPR for unpriv group
0.3568464730290456
FNR for unpriv group
0.35306553911205074
FPR for priv group
0.42352941176470593
FNR for priv group
0.2727272727272727


Similar to results from MNAR, our reweighting algorithm improves the fairness scores with a very small tradeoff in accuracy. <br>
# Reference
[1] Optimized Pre-Processing for Discrimination Prevention <br>
Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy and Kush R. Varshney.
31st Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, December 2017.