# Infering Protected Attribute Using a Linked Feature

The background to the problem is that of inferring protected features, typically demographic data such as gender, for datasets where that information is absent. Our approach works on the assumption that the dataset contains a "linked" feature whose values cam give us probabilistic information about the protected feature values. For example, sssume that we have data about customer purchases and the stores from which the purchases were made. It is reasonable to assume that for certain stores women are more likely to buy items (the store may sell items of more interest to women) and for other stores men are more likely to buy items (the stores sell items of more interest to men). 

Let's now examine the transactions for a single customer $X$. A priori we do not know if the customer is male or female so we have the probabilities of the customer being male of female, in the absence of any other data, equal to each other i.e.

$$P(X=Male) = P(X=Female) = 0.5$$

Let $S_i$ be the store at which transaction $i$ has been made, the probability of a man making a purchase at a store $S$ be $P(S|X=Male)$ and the probability of woman making a purchase at a store ($P(S|X=Female)$). The probability of the customer being male given a list of his $N$ transactions can be calculated using Bayes theorem (and assuming that purchases are conditionally independent) to give

$$
\begin{split}
P(X=Male| S_1...S_n) & = \frac{P(X=Male) P(S_1...S_n|X=Male)}{K}\\
                     & = \frac{P(X=Male) \prod_{i=1}^n P(S_i|X=Male)}{K}
\end{split}
$$

and

$$
\begin{split}
P(X=Female| S_1...S_n) & = \frac{P(X=Female) P(S_1...S_n|X=Female)}{K}\\
                     & = \frac{P(X=Female) \prod_{i=1}^n P(S_i|X=Female)}{K}
\end{split}
$$

where $K$ is a common normalizing constant. 

In [1]:
from etiq_core import *
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



# Generating Synthetic Data

In order to test our inference pipeline synthetic transactions data is generated. Using the following functions

In [2]:
from collections import Counter

suspect_stores = ['MCC3', 'MCC5', 'MCC7', 'MCC11']

# A utility function to sample from a categorical disbtribution given the categories and the probability of each
# category
def prob_categorical(cats, p):
    return(np.random.choice(cats, 1, p=p)[0])

# This function generates a dataframe consisting of Customers (identified by a unique ID) and their genders
# The probability of a random customer being a Male is equal to the probability of the same customer being female
def generate_customers(num_customers: int = 100, random_seed: int = 3):
    np.random.seed(random_seed)
    customers = pd.DataFrame()
    customers['id'] = ['Cust' + str(i) for i in range(1,num_customers+1)]
    customers['gender'] = customers.apply(lambda row: 'Male' if np.random.uniform() >= 0.5 else 'Female', axis=1)
    return customers

# This function is used to determine whether a transactions should be flagged
def flag_rule(row):
    # Flag all transactions from four stores greater than 60
    if row['MCC'] in suspect_stores:
        return (1 if row['amount'] > 60 else 0)
    return 0

# This function generates a dataframe of synthetic transactions
def generate_transactions(num_transactions: int = 1000, customers: pd.DataFrame = None, linked_feature_probability: pd.DataFrame = None, random_seed: int = 4):
    np.random.seed(random_seed)
    transactions = pd.DataFrame()
    if (customers is None) or (linked_feature_probability is None):
        return transactions
    customers_dict = {row['id']: row['gender'] for _,row in customers.iterrows()}
    male_prob = list(linked_feature_probability['Male']/np.sum(linked_feature_probability['Male']))
    female_prob = list(linked_feature_probability['Female']/np.sum(linked_feature_probability['Female']))
    cats = list(linked_feature_probability['ID'])
    mcc_rule = lambda row: prob_categorical(cats, male_prob) if row['gender'] == 'Male' else prob_categorical(cats, female_prob)
    amount_rule = lambda row: np.random.uniform(20.0,100)
    transactions['customerID'] = np.random.choice(customers['id'], num_transactions)
    transactions['gender'] = transactions.apply(lambda row: customers_dict[row['customerID']], axis=1) 
    transactions['MCC'] = transactions.apply(mcc_rule, axis=1)
    transactions['amount'] = transactions.apply(amount_rule, axis=1)
    transactions['flag'] = transactions.apply(flag_rule, axis=1)
    return transactions



We generate 1000 customers

In [3]:
customers = generate_customers(1000, random_seed=13)
customers.head()

Unnamed: 0,id,gender
0,Cust1,Male
1,Cust2,Female
2,Cust3,Male
3,Cust4,Male
4,Cust5,Male


In [4]:
# Get the number of Male and Female customers in the list of 1000 customers
Counter(customers['gender'])

Counter({'Male': 484, 'Female': 516})

## Generate Transaction Data

Create probabilities for protected feature given linked feature values for a set of 11 stores.

In [5]:
mcc_strong = pd.DataFrame()
mcc_strong['ID'] = ['MCC1', 'MCC2', 'MCC3', 'MCC4', 'MCC5', 'MCC6', 'MCC7', 'MCC8', 'MCC9', 'MCC10', 'MCC11']
mcc_strong['Female'] = [0.5, 0.3, 0.7, 0.5, 0.75, 0.25, 0.5, 0.52, 0.42, 0.9, 0.1]
mcc_strong['Male'] = [0.5, 0.7, 0.3, 0.5, 0.25, 0.75, 0.5, 0.48, 0.58, 0.1, 0.9]

mcc_weak = pd.DataFrame()
mcc_weak['ID'] = ['MCC1', 'MCC2', 'MCC3', 'MCC4', 'MCC5', 'MCC6', 'MCC7', 'MCC8', 'MCC9', 'MCC10', 'MCC11']
mcc_weak['Female'] = [0.5, 0.45, 0.55, 0.5, 0.55, 0.45, 0.5, 0.55, 0.45, 0.55, 0.45]
mcc_weak['Male'] = [0.5, 0.55, 0.45, 0.5, 0.45, 0.55, 0.5, 0.45, 0.55, 0.45, 0.55]

Create transaction data using the probabilities above

In [6]:
full_data_strong = generate_transactions(40000, customers, mcc_strong, random_seed=3)
# Drop the actual gender from the transactions
transactions_strong = full_data_strong.drop(['gender'], axis=1)
transactions_strong.head()

Unnamed: 0,customerID,MCC,amount,flag
0,Cust875,MCC3,28.065022,0
1,Cust665,MCC3,58.534485,0
2,Cust250,MCC4,56.476419,0
3,Cust644,MCC2,83.464443,0
4,Cust953,MCC5,94.138779,1


In [7]:
full_data_weak = generate_transactions(40000, customers, mcc_weak, random_seed=5)
# Drop the actual gender from the transactions
transactions_weak = full_data_weak.drop(['gender'], axis=1)
transactions_weak.head()

Unnamed: 0,customerID,MCC,amount,flag
0,Cust868,MCC9,84.300496,0
1,Cust207,MCC3,26.826899,0
2,Cust702,MCC3,43.868908,0
3,Cust999,MCC3,53.864379,0
4,Cust119,MCC3,93.994402,1


In [8]:
dict_mcc_strong = {row['ID']: (row['Female'], row['Male']) for _,row in mcc_strong.iterrows()}
dict_mcc_weak = {row['ID']: (row['Female'], row['Male']) for _,row in mcc_weak.iterrows()}

# Setup Inferred data pipelines

In [9]:
# Specify the categorical and continuous features
cat_vars = ['customerID','MCC', 'flag']
cont_vars = ['amount']

transforms = [Dropna, EncodeLabels] 
# Note that we don't have the protected feature so the BiasParams protected field should be set to None
debias_param = BiasParams(protected=None, privileged=1, unprivileged=2, 
                              positive_outcome_label='0', negative_outcome_label='1')

dl_strong = DatasetLoader(data=transactions_strong, label='flag', transforms=transforms, bias_params=debias_param,
                   train_valid_test_splits=[0.8, 0.1, 0.1], cat_col=cat_vars,
                   cont_col=cont_vars, names_col = transactions_strong.columns.values)

dl_weak = DatasetLoader(data=transactions_weak, label='flag', transforms=transforms, bias_params=debias_param,
                   train_valid_test_splits=[0.8, 0.1, 0.1], cat_col=cat_vars,
                   cont_col=cont_vars, names_col = transactions_weak.columns.values)

# Model
xgb_strong = DefaultXGBoostClassifier()
xgb_weak = DefaultXGBoostClassifier()

# "Strong" inferred data pipeline
metrics_initial = [accuracy,  equal_opportunity, demographic_parity]
pipeline_infered_strong = InferProtectedDataPipeline(dataset_loader=dl_strong, model=xgb_strong, 
                                                      metrics=metrics_initial, infer_feature='gender',
                                                      linked_feature='MCC', data_key_column='customerID',
                                                      feature_prob_lookup=dict_mcc_strong,
                                                      privileged_class=1)
pipeline_infered_strong.run()

# "Weak" inferred data pipeline
pipeline_infered_weak = InferProtectedDataPipeline(dataset_loader=dl_weak, model=xgb_weak, 
                                                      metrics=metrics_initial, infer_feature='gender',
                                                      linked_feature='MCC', data_key_column='customerID',
                                                      feature_prob_lookup=dict_mcc_weak,
                                                      privileged_class=1)
pipeline_infered_weak.run()

INFO:etiq_core.pipeline.InferProtectedPipeline0349:Starting pipeline
INFO:etiq_core.pipeline.InferProtectedPipeline0349:Infering protected feature "gender" using feature "MCC"
INFO:etiq_core.pipeline.InferProtectedPipeline0349:Fitting model
INFO:etiq_core.pipeline.InferProtectedPipeline0349:Computed metrics for the initial dataset
INFO:etiq_core.pipeline.InferProtectedPipeline0349:Completed pipeline
INFO:etiq_core.pipeline.InferProtectedPipeline0365:Starting pipeline
INFO:etiq_core.pipeline.InferProtectedPipeline0365:Infering protected feature "gender" using feature "MCC"
INFO:etiq_core.pipeline.InferProtectedPipeline0365:Fitting model
INFO:etiq_core.pipeline.InferProtectedPipeline0365:Computed metrics for the initial dataset
INFO:etiq_core.pipeline.InferProtectedPipeline0365:Completed pipeline


In [10]:
pipeline_infered_strong.get_protected_metrics()

{'InferProtectedPipeline0349': [{'accuracy': ('privileged',
    1.0,
    'unprivileged',
    1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.7865279841505696,
    'unprivileged',
    0.8244197780020182)}]}

In [11]:
pipeline_infered_weak.get_protected_metrics()

{'InferProtectedPipeline0365': [{'accuracy': ('privileged',
    1.0,
    'unprivileged',
    1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.7896440129449838,
    'unprivileged',
    0.8275299238302503)}]}

## How well did we do at inferring the protected characteristic?

In [12]:
# Load the full dataset
customers_dict = {row['id']: row['gender'] for _,row in customers.iterrows()}

In [13]:
# Get all the customer IDs from the training set
customer_idx = np.where(pipeline_infered_strong.get_dataset().get_dataset_column_names()== 'customerID')[0][0]
c = pipeline_infered_strong.get_dataset().x_train[:,customer_idx].astype(int)
train_customers_strong  = pipeline_infered_strong.store.encoder['customerID'].inverse_transform(c)

customer_idx = np.where(pipeline_infered_weak.get_dataset().get_dataset_column_names()== 'customerID')[0][0]
c = pipeline_infered_weak.get_dataset().x_train[:,customer_idx].astype(int)
train_customers_weak  = pipeline_infered_weak.store.encoder['customerID'].inverse_transform(c)

In [14]:
# Get the gender of the customer for each of the training set transactions
actual_train_gender_strong  = [customers_dict[acustid] for acustid in train_customers_strong]
actual_train_gender_strong_encoded = [1 if x=='Male' else 0 for x in actual_train_gender_strong]

# Get the gender of the customer for each of the training set transactions
actual_train_gender_weak  = [customers_dict[acustid] for acustid in train_customers_weak]
actual_train_gender_weak_encoded = [1 if x=='Male' else 0 for x in actual_train_gender_weak]

In [15]:
# Get the inferred gender of each customer in the training 
infered_train_gender_strong =  pipeline_infered_strong.get_dataset().protected_train.astype(int)

# Get the inferred gender of each customer in the training 
infered_train_gender_weak =  pipeline_infered_weak.get_dataset().protected_train.astype(int)

In [16]:
# Get the confusion matrix of the gender of the customer in each of the training set transactions
confusion_matrix(actual_train_gender_strong_encoded , infered_train_gender_strong)

array([[16534,    66],
       [    0, 15399]])

In [17]:
# Get the confusion matrix of the gender of the customer in each of the training set transactions
confusion_matrix(actual_train_gender_weak_encoded , infered_train_gender_weak)

array([[11712,  4808],
       [ 5200, 10279]])

In [18]:
actual_train_customers_dict_strong = {a: customers_dict[a] for a in train_customers_strong}
inferred_train_customers_strong_d = {a: ('Male' if infered_train_gender_strong[idx]==1 else 'Female') for idx,a in enumerate(train_customers_strong)}

actual_train_customers_dict_weak = {a: customers_dict[a] for a in train_customers_weak}
inferred_train_customers_weak_d = {a: ('Male' if infered_train_gender_weak[idx]==1 else 'Female') for idx,a in enumerate(train_customers_weak)}

In [19]:
actual_train_customers_list = list(actual_train_customers_dict_strong.values())
inferred_train_customers_list = list(inferred_train_customers_strong_d.values())
confusion_matrix(actual_train_customers_list, inferred_train_customers_list)

array([[514,   2],
       [  0, 484]])

In summary where there is strong gender disparity in shopping habits our technique manages to correctly infer the gender of 998 out of 1000 customers. 

In [20]:
actual_train_customers_list = list(actual_train_customers_dict_weak.values())
inferred_train_customers_list = list(inferred_train_customers_weak_d.values())
confusion_matrix(actual_train_customers_list, inferred_train_customers_list)


array([[367, 149],
       [167, 317]])

On the other hand where there is weak gender disparity in shopping habits out technique only manages to correctly infer the gender of 684 out of our 1000 customers.

# Running a Debias pipeline over an inferred data pipeline

In [21]:
# the DebiasPipeline aims to identify sources of bias by applying analyses formalized in the Identify pipelines
# the Identify pipeline is looking for 3 sources of bias (limited features, poor sampling and proxies)

identify_pipeline = IdentifyBiasSources(nr_groups=20, # nr of segments based on using unsupervised learning to group similar rows
                                        train_model_segment=True,
                                        group_def=['unsupervised'],
                                        fit_metrics=[accuracy, equal_opportunity])
    
# the DebiasPipeline aims to mitigate sources of bias by applying different types of repair algorithms
# the library offers implementations of repair algorithms described in the academic fairness literature
repair_pipeline = RepairResamplePipeline(steps=[ResampleUnbiasedSegmentsStep(ratio_resample=1)], random_seed=4)

debias_pipeline = DebiasPipeline(data_pipeline=pipeline_infered_strong, 
                                 model=xgb_strong,
                                 metrics=metrics_initial,
                                 identify_pipeline=identify_pipeline,
                                 repair_pipeline=repair_pipeline)
debias_pipeline.run()

INFO:etiq_core.pipeline.DebiasPipeline0296:Starting pipeline
INFO:etiq_core.pipeline.DebiasPipeline0296:Start Phase IdentifyPipeline0575
INFO:etiq_core.pipeline.IdentifyPipeline0575:Starting pipeline
INFO:etiq_core.pipeline.IdentifyPipeline0575:Completed pipeline
INFO:etiq_core.pipeline.DebiasPipeline0296:Completed Phase IdentifyPipeline0575
INFO:etiq_core.pipeline.DebiasPipeline0296:Start Phase RepairPipeline0743
INFO:etiq_core.pipeline.RepairPipeline0743:Starting pipeline
INFO:etiq_core.pipeline.RepairPipeline0743:Completed pipeline
INFO:etiq_core.pipeline.DebiasPipeline0296:Completed Phase RepairPipeline0743
INFO:etiq_core.pipeline.DebiasPipeline0296:Refitting model
INFO:etiq_core.pipeline.DebiasPipeline0296:Computed metrics for the repaired dataset
INFO:etiq_core.pipeline.DebiasPipeline0296:Compare pipeline predictions
INFO:etiq_core.pipeline.DebiasPipeline0296:Completed pipeline


In [22]:
debias_pipeline.get_protected_metrics()

{'InferProtectedPipeline0349': [{'accuracy': ('privileged',
    1.0,
    'unprivileged',
    1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.7865279841505696,
    'unprivileged',
    0.8244197780020182)}],
 'DebiasPipeline0296': [{'accuracy': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.8244197780020182,
    'unprivileged',
    0.7865279841505696)}]}

In [23]:
debias_pipeline.get_issues_summary()

Unnamed: 0,issue,features,segments
0,correlation_issue,MCC,"[1, 2, 3, 4, 6, 7, 10, 11, 14, 15, 16, 17, 18,..."
1,low_priv_sample,,[0]
0,missing_sample,,[5]


## Compare Against the Non-Inferred Debias Pipeline

In [24]:
full_cat_vars = ['customerID','MCC', 'flag', 'gender']
debias_param = BiasParams(protected='gender', privileged='Male', unprivileged='Female', 
                              positive_outcome_label='0', negative_outcome_label='1')

dl_full_strong = DatasetLoader(data=full_data_strong, label='flag', transforms=transforms, bias_params=debias_param,
                          train_valid_test_splits=[0.8, 0.1, 0.1], cat_col=full_cat_vars,
                          cont_col=cont_vars, names_col = full_data_strong.columns.values)
# Model
xgb_full_strong = DefaultXGBoostClassifier()
pipeline_full_strong = DataPipeline(dataset_loader=dl_full_strong, model=xgb_full_strong, 
                                    metrics=metrics_initial)
pipeline_full_strong.run()

INFO:etiq_core.pipeline.DataPipeline0493:Starting pipeline
INFO:etiq_core.pipeline.DataPipeline0493:Fitting model
INFO:etiq_core.pipeline.DataPipeline0493:Computed metrics for the initial dataset
INFO:etiq_core.pipeline.DataPipeline0493:Completed pipeline


We now run the debias pipeline on the full data.

In [25]:
identify_pipeline_full = IdentifyBiasSources(nr_groups=20, # nr of segments based on using unsupervised learning to group similar rows
                                        train_model_segment=True,
                                        group_def=['unsupervised'],
                                        fit_metrics=[accuracy, equal_opportunity])
    
# the DebiasPipeline aims to mitigate sources of bias by applying different types of repair algorithms
# the library offers implementations of repair algorithms described in the academic fairness literature
repair_pipeline_full = RepairResamplePipeline(steps=[ResampleUnbiasedSegmentsStep(ratio_resample=1)], random_seed=4)

debias_pipeline_full = DebiasPipeline(data_pipeline=pipeline_full_strong, 
                                 model=xgb_full_strong,
                                 metrics=metrics_initial,
                                 identify_pipeline=identify_pipeline_full,
                                 repair_pipeline=repair_pipeline_full)
debias_pipeline_full.run()

INFO:etiq_core.pipeline.DebiasPipeline0993:Starting pipeline
INFO:etiq_core.pipeline.DebiasPipeline0993:Start Phase IdentifyPipeline0849
INFO:etiq_core.pipeline.IdentifyPipeline0849:Starting pipeline
INFO:etiq_core.pipeline.IdentifyPipeline0849:Completed pipeline
INFO:etiq_core.pipeline.DebiasPipeline0993:Completed Phase IdentifyPipeline0849
INFO:etiq_core.pipeline.DebiasPipeline0993:Start Phase RepairPipeline0769
INFO:etiq_core.pipeline.RepairPipeline0769:Starting pipeline
INFO:etiq_core.pipeline.RepairPipeline0769:Completed pipeline
INFO:etiq_core.pipeline.DebiasPipeline0993:Completed Phase RepairPipeline0769
INFO:etiq_core.pipeline.DebiasPipeline0993:Refitting model
INFO:etiq_core.pipeline.DebiasPipeline0993:Computed metrics for the repaired dataset
INFO:etiq_core.pipeline.DebiasPipeline0993:Compare pipeline predictions
INFO:etiq_core.pipeline.DebiasPipeline0993:Completed pipeline


In [32]:
debias_pipeline_full.get_protected_metrics()

{'DataPipeline0493': [{'accuracy': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.8237082066869301,
    'unprivileged',
    0.7873704982733103)}],
 'DebiasPipeline0993': [{'accuracy': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.8237082066869301,
    'unprivileged',
    0.7873704982733103)}]}

In [33]:
debias_pipeline.get_protected_metrics()

{'InferProtectedPipeline0349': [{'accuracy': ('privileged',
    1.0,
    'unprivileged',
    1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.7865279841505696,
    'unprivileged',
    0.8244197780020182)}],
 'DebiasPipeline0296': [{'accuracy': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'equal_opportunity': ('privileged', 1.0, 'unprivileged', 1.0)},
  {'demographic_parity': ('privileged',
    0.8244197780020182,
    'unprivileged',
    0.7865279841505696)}]}