<a href="https://colab.research.google.com/github/IndraniMandal/New-Revisions/blob/main/Class_imbalance_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class Imbalance bias analysis

In [None]:
# Requirements
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'
import pandas as pd
import numpy as np
import os
import io
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report



# Data 
We are using Adult data to analyze the existing bias in the data.
Adult data is a benchmark dataset used to establish multiple baselines in data analysis and modeling field. 

The [data](https://www.kaggle.com/datasets/uciml/adult-census-income) was extracted by Barry Becker from the 1994 Census database. Prediction task in the data is to determine whether a person makes over 50K a year.

In [None]:
url = 'https://raw.githubusercontent.com/surbhir08/Data/main/adult.csv'
data = pd.read_csv(url)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


# Questions we are trying to address through this analysis


*   If there exisit imbalance in data, for example if number of recorded data for males exceed number of recorded data for females, the model will only have a few data points to learn about females and a large amount of data points to learn about males which can lead to partial learning about feamles and it might result in biased outcomes.



In [None]:
# Checking for null values
data.isnull().sum().sort_values()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In the code below, I have segregated the columns into different groups as per the category they belong to. For example, workclass has values like - private, local-gov etc, these are categories they belong to(not a numerical category like age)


In [None]:
feature_cat = ['workclass','education','occupation', 'relationship','native-country'] # categorical features
feature_num = ['capital-gain', 'capital-loss', 'hours-per-week','educational-num'] # Numerical features
feature_p_att = ['age', 'marital-status', 'race', 'gender'] # protected attributes : these are features, qualities, traits or characteristics defining a human
target = 'income' # target variable to be predicted

features = feature_cat + feature_p_att+ [target] # I have just added all the categorical variables to be used as argument in function below

# This function encodes all the categorical variable into ordinal values
def categorical_feature_encoder(data,features):
    '''
    takes a data frame and categorical features and returns numerical encoding for categorical features
    
    Parameters: 
    ----------- 
    dataset : pandas dataframe 
    features : list of categorical features

    Return 
    ------
    returns : dataframe with encoded features, encoding for each categorical features  
    '''
    enc = {}
    
    for f in features:
        encoder = OrdinalEncoder()
        data[f] = encoder.fit_transform(data[[f]]).astype(int)
        enc[f] = encoder
    return data, enc

adult_data, enc = categorical_feature_encoder(data,features)

adult_data.head() 
# we can match the original data frame and the one below to verify the ordinal values in each categorical column, all the categorical values are now converted into numerical encodings for further analysis.


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,8,4,226802,1,7,4,7,3,2,1,0,0,40,39,0
1,21,4,89814,11,9,2,5,0,4,1,0,0,50,39,0
2,11,2,336951,7,12,2,11,0,4,1,0,0,40,39,1
3,27,4,160323,15,10,2,7,0,2,1,7688,0,40,39,1
4,1,0,103497,15,10,4,0,3,4,0,0,0,30,39,0


In [None]:
adult_data = adult_data[['age', 'workclass', 'education','educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income']]

adult_data.head()

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,8,4,1,7,4,7,3,2,1,0,0,40,39,0
1,21,4,11,9,2,5,0,4,1,0,0,50,39,0
2,11,2,7,12,2,11,0,4,1,0,0,40,39,1
3,27,4,15,10,2,7,0,2,1,7688,0,40,39,1
4,1,0,15,10,4,0,3,4,0,0,0,30,39,0


In [None]:
adult_data['gender'].value_counts() # Counting values to understand class imbalance. female(0) data is approximately 1.5 times less than male(1) data. 

1    32650
0    16192
Name: gender, dtype: int64

# Training model using adult data

Here we are training the model using adult data, we havn't trained data based on specific value unlike class imbalance implementation showed above.

We split the data into training, validation, and testing sets using the train_test_split function from scikit-learn twice. 

We create a logistic regression model using the LogisticRegression class and train it on the training data using the fit method. We make predictions on the validation data using the predict method and calculate the accuracy of the model on the validation data using the accuracy_score function.

Finally, we test the final model on the testing data by making predictions using the predict method and calculating the accuracy of the model on the testing data using the accuracy_score function.

## Brief description of Train-Valid-Test
Train-Valid-Test split is a technique to evaluate the performance of your machine learning model.


### Train Dataset
Set of data used for learning (by the model), that is, to fit the parameters to the machine learning model

### Valid Dataset
Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters.

### Test Dataset
Set of data used to provide an unbiased evaluation of a final model fitted on the training dataset.


In [None]:

# Let's say we want to split the data in 80:10:10 for train:valid:test dataset
train_size=0.8

X = adult_data.drop(columns = ['income']).copy()
y = adult_data['income']

# In the first step we will split the data in training and remaining dataset
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.8)

# Now since we want the valid and test size to be equal (10% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)

test_size = 0.5
X_val_, X_test_, y_val_, y_test_ = train_test_split(X_rem,y_rem, test_size=0.5)

print('X_train',X_train.shape)
print('y_train',y_train.shape)
print('X_valid',X_val_.shape)
print('y_valid',y_val_.shape)
print('X_test',X_test_.shape)
print('y_test',y_test_.shape)



X_train (39073, 13)
y_train (39073,)
X_valid (4884, 13)
y_valid (4884,)
X_test (4885, 13)
y_test (4885,)


In [None]:
# Create a logistic regression model and train it on the training data
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Make predictions on the validation data
y_val_pred = lr.predict(X_val_)

# Calculate the accuracy of the model on the validation data
val_accuracy = accuracy_score(y_val_, y_val_pred)
print('Validation Accuracy:', val_accuracy)

# Test the final model on the testing data
y_test_pred = lr.predict(X_test_)

# Calculate the accuracy of the model on the testing data
test_accuracy = accuracy_score(y_test_, y_test_pred)
print('Test Accuracy:', test_accuracy)

# Print classification report
print('Classification Report:')
print(classification_report(y_test_, y_test_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Validation Accuracy: 0.800982800982801
Test Accuracy: 0.8036847492323439
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.94      0.88      3722
           1       0.66      0.36      0.47      1163

    accuracy                           0.80      4885
   macro avg       0.74      0.65      0.67      4885
weighted avg       0.79      0.80      0.78      4885



## Model Performance Interpretation
**Precision, recall, and f1-score** are metrics used to evaluate the performance of classification models.

Precision measures how many of the samples predicted as positive are actually positive. It is defined as the number of true positives divided by the number of true positives plus false positives. A high precision indicates that the model makes few false positive predictions.

Recall measures how many of the actual positive samples are correctly identified as positive by the model. It is defined as the number of true positives divided by the number of true positives plus false negatives. A high recall indicates that the model correctly identifies a high proportion of positive samples.

F1-score is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall. It is defined as 2 times the product of precision and recall divided by the sum of precision and recall.

In general, a high precision indicates a low false positive rate, a high recall indicates a low false negative rate, and a high f1-score indicates a good balance between precision and recall.

It's important to consider both precision and recall, especially in imbalanced datasets, where one class has many more samples than the other. In these cases, a high accuracy score may not be enough to evaluate the model's performance, since a model that always predicts the majority class would achieve high accuracy but may perform poorly on the minority class.

# Class Imbalance

Training using gender = 1 (male)

First, we split the data into training, validation, and testing sets using the train_test_split function from scikit-learn twice. 

Next, we filter the data to use only the male samples for both training and testing. This can help us identify any biases in the model that may be due to gender imbalance in the data.

We create a logistic regression model using the LogisticRegression class and train it on the training data using the fit method. We make predictions on the validation data using the predict method

In [None]:
# Here we are trying to train the model using male data and validate and test using female data and analyze the performance of the model.

# Let's say we want to split the data in 80:10:10 for train:valid:test dataset, but here we'll filter the data for male =1 only!
train_size=0.8

# Here we are only considering male data (['gender'] == 1)
X = adult_data[adult_data['gender'] == 1].drop(['income', 'gender'], axis=1)
y = adult_data[adult_data['gender'] == 1]['income']

# In the first step we will split the male data in training and remaining dataset
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.8)

# we'll not further split the male data as we want to filter female data and use that for validation and test.

# Here we are filtering female data
X1 = adult_data[adult_data['gender'] == 0].drop(['income', 'gender'], axis=1)
y1 = adult_data[adult_data['gender'] == 0]['income']

# In the first step we will split the female data in training and remaining dataset (we did this to follow above split ration)
X1_train, X1_rem, y1_train, y1_rem = train_test_split(X1,y1, train_size=0.8)

# Now since we want the valid and test size to be equal (10% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)

# test_size = 0.5
X1_val, X1_test, y1_val, y1_test = train_test_split(X1_rem,y1_rem, test_size=0.5)


# Create a logistic regression model and train it on the training data (male data)
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Make predictions on the validation data (female data)
y1_val_pred = lr.predict(X1_val)

# Calculate the accuracy of the model on the validation data
val_accuracy = accuracy_score(y1_val, y1_val_pred)
print('Validation Accuracy:', val_accuracy)

# Test the final model on the testing data (female data)
y1_test_pred = lr.predict(X1_test)

# Calculate the accuracy of the model on the testing data
test_accuracy = accuracy_score(y1_test, y1_test_pred)
print('Test Accuracy:', test_accuracy)

# Print classification report
print('Classification Report:')
print(classification_report(y1_test, y1_test_pred))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Validation Accuracy: 0.8579369981470043
Test Accuracy: 0.8703703703703703
Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.95      0.93      1443
           1       0.35      0.21      0.26       177

    accuracy                           0.87      1620
   macro avg       0.63      0.58      0.59      1620
weighted avg       0.85      0.87      0.86      1620



# Interpretation

Overall, while this model has higher accuracy scores, it performs poorly on the minority class, with low precision, recall, and f1-score. This suggests that the model is biased towards the majority class and may not be suitable for applications where accurately identifying the minority class is important.

It's important to consider both precision and recall, especially in imbalanced datasets, where one class has many more samples than the other. In these cases, a high accuracy score may not be enough to evaluate the model's performance, since a model that always predicts the majority class would achieve high accuracy but may perform poorly on the minority class.


# In the next part we'll see how model performance differs when trained using a biased data.

In [None]:
# Checking the counts of each unique value.
target_counts = adult_data['income'].value_counts()
target_counts

0    37155
1    11687
Name: income, dtype: int64

Here we'll check for any gender bias that exists in the data. We already know that, there exist class imbalance as number of records for males is greater than the number of records available for female data points.

We have a few fairness metrics available to access existing biases in data. Here we'll focus on one of the fairness metrics, DI - Disparate Impact. 

Disparate impact is a concept that refers to a situation where a particular policy or practice, although seemingly neutral, has a disproportionately negative effect on a certain group of people based on their protected characteristic(s) such as race, gender, or age. 

For example, a company's policy of requiring job candidates to have a certain level of education may seem neutral on its face, but it could disproportionately affect certain racial or ethnic groups who historically have had less access to quality education. In such cases, the policy could be considered to have a disparate impact on those groups.

Disparate impact is typically measured using statistical methods to determine whether there is a statistically significant difference in outcomes for different groups. If a policy or practice is found to have a disparate impact, it may be challenged as discriminatory under various conditions.

## Here is an example calculation of DI:

Assume a company has 100 job openings and receives 1,000 applications, including 500 from male and 500 from female. The company hires 80 male and 20 female.

The selection rate for male is $80/500 = 0.16$, or 16%.

The selection rate for female is $20/500 = 0.04$, or 4%.

The ratio of the selection rate for female to the selection rate for male is 
$0.04/0.16 = 0.25$.

**DI = 0.25**

The DI is 0.25, which is less than 0.8, indicating that the company's hiring practice has a disparate impact on women. The acceptable threshold of DI is between .8 and 1.25, and an ideal value of DI is 1.

 
**Note** : DI is not enough to access bias hence, other statistical methods and qualitative analysis may also be needed to fully assess whether a practice has a disparate impact.

In [None]:
# Separate data into x and y for training and testing
from sklearn.model_selection import train_test_split
adult_df = adult_data.copy() # defensive code just to have original data intact
x = adult_df.drop(['income'], axis = 1)
y = adult_df['income'].astype('int')

# Shape of both datasets
print(x.shape, y.shape)

(48842, 13) (48842,)


In [None]:
# #Creating Test and Train splits
# We will follow an 80-20 split pattern for our training and test data, respectively

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state = 0)

Calculating **disparate impact** on data without training any model.

In [None]:
actual_test = x_test.copy()
actual_test['income_actual'] = y_test
actual_test.shape

(9769, 14)

In [None]:
# Priviliged group: Males (1)
# Unpriviliged group: Females (0)
male_df = actual_test[actual_test['gender'] == 1]
num_of_priviliged = male_df.shape[0]
female_df = actual_test[actual_test['gender'] == 0]
num_of_unpriviliged = female_df.shape[0]

In [None]:
unpriviliged_outcomes = female_df[female_df['income_actual'] == 1].shape[0]
unpriviliged_ratio = unpriviliged_outcomes/num_of_unpriviliged
unpriviliged_ratio

0.11175395858708892

In [None]:
priviliged_outcomes = male_df[male_df['income_actual'] == 1].shape[0]
priviliged_ratio = priviliged_outcomes/num_of_priviliged
priviliged_ratio

0.3056283731688512

In [None]:
# Calculating disparate impact
disparate_impact = unpriviliged_ratio / priviliged_ratio
print("Disparate Impact on raw data: " + str(disparate_impact))

Disparate Impact on raw data: 0.36565308851527323


Disparate Impact on raw/original data = 0.36565308851527323, which is again very low compared to acceptable threshold of 0.8 to 1.2. This explains that data is biased towards unpriviliged group (females in this case).

# Training a model on the original dataset to check DI after training a Logistic Regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

# Liblinear is a solver that is very fast for small datasets, like ours
model = LogisticRegression(solver='liblinear', class_weight='balanced')

In [None]:
# using x_train, y_train from above split
model.fit(x_train, y_train)

In [None]:
# Let's see how well it predicted with a couple of values 
y_pred = pd.Series(model.predict(x_test))
y_test = y_test.reset_index(drop=True)
z = pd.concat([y_test, y_pred], axis=1)
z.columns = ['True', 'Prediction'] # naming y_test, y_pred columns
z.head()
# Predicts almost correctly in this sample

Unnamed: 0,True,Prediction
0,0,0
1,1,1
2,1,1
3,0,0
4,0,0


In [None]:

from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))

Accuracy: 0.7552461869178012
Precision: 0.49420529801324503
Recall: 0.7624521072796935


# Calculating disparate impact on predicted values by model trained on original dataset

In [None]:
# We now need to add this array into x_test as a column to calculate the fairness metric DI.
y_pred = model.predict(x_test)
x_test['income_predicted'] = y_pred
original_output = x_test
original_output.head()

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income_predicted
38113,19,4,9,13,4,10,1,4,1,0,0,40,39,0
39214,40,5,14,15,2,10,0,4,1,0,0,36,39,1
44248,32,4,12,14,0,12,1,4,1,4787,0,45,39,1
10283,37,6,4,3,2,5,0,4,1,0,0,55,39,0
26724,0,6,0,6,4,8,3,4,0,0,0,24,39,0


In [None]:
# Priviliged group: Males (1)
# Unpriviliged group: Females (0)
male_df = original_output[original_output['gender'] == 1]
num_of_priviliged = male_df.shape[0]
female_df = original_output[original_output['gender'] == 0]
num_of_unpriviliged = female_df.shape[0]

In [None]:
unpriviliged_outcomes = female_df[female_df['income_predicted'] == 1].shape[0]
unpriviliged_ratio = unpriviliged_outcomes/num_of_unpriviliged
unpriviliged_ratio

0.14403166869671133

In [None]:
priviliged_outcomes = male_df[male_df['income_predicted'] == 1].shape[0]
priviliged_ratio = priviliged_outcomes/num_of_priviliged
priviliged_ratio

0.4858905165767155

In [None]:
# Calculating disparate impact
disparate_impact = unpriviliged_ratio / priviliged_ratio
print("Disparate Impact, data trained using LR: " + str(disparate_impact))

Disparate Impact, data trained using LR: 0.2964282359562593


We saw that DI was low for raw data but when we trained the data using LR, it probably amplified the existing bias and decreased the DI value to 0.2964282359562593, which is again not under the acceptable threshold of DI. This explains that data is biased towards unpriviliged group (females in this case) and modeling a biased data amplifies the existing bias. 

# Applying the Disparate Impact Remover, an algorithm introduced by IBM AIF 360 toolkit to mitigate biases.

We'll install the package as shown below for the algorithm to work. (otherwise it might throw an error)

In [None]:
pip install aif360

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aif360
  Downloading aif360-0.5.0-py3-none-any.whl (214 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.1/214.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: aif360
Successfully installed aif360-0.5.0


In [None]:
!pip install BlackBoxAuditing

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting BlackBoxAuditing
  Downloading BlackBoxAuditing-0.1.54.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: BlackBoxAuditing
  Building wheel for BlackBoxAuditing (setup.py) ... [?25l[?25hdone
  Created wheel for BlackBoxAuditing: filename=BlackBoxAuditing-0.1.54-py2.py3-none-any.whl size=1394768 sha256=f2be209cb908427dcc801b4b5580c190f0b0e0413fafa7919d1373adda7b68c1
  Stored in directory: /root/.cache/pip/wheels/8f/3c/f8/2ad8792a15548dfb008ec5738566ea9e5aa8999311732473fa
Successfully built BlackBoxAuditing
Installing collected packages: BlackBoxAuditing
Successfully installed BlackBoxAuditing-0.1.54


In [None]:
import aif360
from aif360.algorithms.preprocessing import DisparateImpactRemover

# Aif360 DisparateImpactRemover works with certain type of data format which is BinaryLabelDataset, hence we used the code below to convert out data into the desired format
binaryLabelDataset = aif360.datasets.BinaryLabelDataset(
    favorable_label=1,
    unfavorable_label=0,
    df=adult_df,
    label_names=['income'],
    protected_attribute_names=['gender'])
#print(binaryLabelDataset)


In [None]:
di = DisparateImpactRemover(repair_level = 1.0) # DisparateImpactRemover() with a repair_level = 1.0
dataset_transf_train = di.fit_transform(binaryLabelDataset) # transforming the data 
transformed = dataset_transf_train.convert_to_dataframe()[0] # converting transformed data into a dataframe for further processing
transformed.head()

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,8.0,4.0,1.0,7.0,4.0,7.0,3.0,2.0,1.0,0.0,0.0,35.0,39.0,0.0
1,21.0,4.0,11.0,9.0,2.0,5.0,0.0,4.0,1.0,0.0,0.0,43.0,39.0,0.0
2,11.0,2.0,7.0,12.0,2.0,11.0,0.0,4.0,1.0,0.0,0.0,35.0,39.0,1.0
3,27.0,4.0,15.0,10.0,2.0,7.0,0.0,2.0,1.0,7688.0,0.0,35.0,39.0,1.0
4,1.0,0.0,15.0,10.0,4.0,0.0,3.0,4.0,0.0,0.0,0.0,30.0,39.0,0.0


**Train a model using the dataset that underwent the pre-processing**

In [None]:
x_trans = transformed.drop(['income'], axis = 1)
y_trans = transformed['income']

# Liblinear is a solver that is effective for relatively smaller datasets.
model = LogisticRegression(solver='liblinear', class_weight='balanced')

# Splitting into test and training
# We will follow an 80-20 split pattern for our training and test data
x_trans_train,x_trans_test,y_trans_train,y_trans_test = train_test_split(x_trans, y_trans, test_size=0.2, random_state = 0)

In [None]:
model.fit(x_trans_train, y_trans_train)

In [None]:
# See how well it predicted with a couple values
y_trans_pred = pd.Series(model.predict(x_trans_test))
y_trans_test = y_trans_test.reset_index(drop=True)
z = pd.concat([y_trans_test, y_trans_pred], axis=1)
z.columns = ['True', 'Prediction']
z.head() # to check a few samples


Unnamed: 0,True,Prediction
0,0.0,0.0
1,1.0,1.0
2,1.0,1.0
3,0.0,0.0
4,0.0,0.0


In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_trans_pred))
print("Precision:", metrics.precision_score(y_test, y_trans_pred))
print("Recall:", metrics.recall_score(y_test, y_trans_pred))

Accuracy: 0.7281195618794145
Precision: 0.45986928104575164
Recall: 0.7488292890591741


# Calculating disparate impact on predicted values by model trained on transformed dataset




In [None]:
# We now need to add this array into x_test as a column for when we calculate the fairness metrics.
y_trans_pred = model.predict(x_trans_test)
x_trans_test['income_predicted'] = y_trans_pred
transformed_output = x_trans_test
transformed_output.head()

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income_predicted
38113,19.0,4.0,9.0,13.0,4.0,10.0,1.0,4.0,1.0,0.0,0.0,35.0,39.0,0.0
39214,40.0,5.0,14.0,15.0,2.0,10.0,0.0,4.0,1.0,0.0,0.0,31.0,39.0,1.0
44248,32.0,4.0,12.0,14.0,0.0,12.0,1.0,4.0,1.0,4687.0,0.0,39.0,39.0,1.0
10283,37.0,6.0,4.0,3.0,2.0,5.0,0.0,4.0,1.0,0.0,0.0,47.0,39.0,0.0
26724,0.0,6.0,0.0,6.0,4.0,7.0,3.0,4.0,0.0,0.0,0.0,24.0,39.0,0.0


In [None]:
# Priviliged group: Males (1)
# Unpriviliged group: Females (0)
male_df = transformed_output[transformed_output['gender'] == 1]
num_of_priviliged = male_df.shape[0]
female_df = transformed_output[transformed_output['gender'] == 0]
num_of_unpriviliged = female_df.shape[0]

In [None]:
unpriviliged_outcomes = female_df[female_df['income_predicted'] == 1].shape[0]
unpriviliged_ratio = unpriviliged_outcomes/num_of_unpriviliged
unpriviliged_ratio

0.2028014616321559

In [None]:
priviliged_outcomes = male_df[male_df['income_predicted'] == 1].shape[0]
priviliged_ratio = priviliged_outcomes/num_of_priviliged
priviliged_ratio

0.487124132613724

In [None]:

# Calculating disparate impact
disparate_impact = unpriviliged_ratio / priviliged_ratio
print("Disparate Impact, on data after transforming using disparate impact remover : " + str(disparate_impact))

Disparate Impact, on data after transforming using disparate impact remover : 0.41632398818756916


# Interpretation
We saw that DI was low for raw data but when we trained the data using LR, it probably amplified the existing bias and decreased the DI value to 0.2964282359562593, which is again not under the acceptable threshold of DI. This explains that data is biased towards unpriviliged group (females in this case) and modeling a biased data amplifies the existing bias. 

Disparate impact remover did mitigate slight bias in performance of the model (slightly better compared to Logistic model result) but couldn't get the DI value in the range of acceptable value.