## Does removing `sex` column result in an Unbiased ML model?
"*If I remove a sensitive feature, say - `sex`, from my ML model before training, that ought to solve the problem of gender prejudice.*"

Turns out, its not the case.

Gender based prejudice is not the resultant of simply a `sex` column in our data, but can be caused by other data points as well. This is because of *multicolinearity* - the fact that other columns may also be correlated with `sex`.

Same goes for `race`, `ethnicity` and other sensitive features.

For this reason, on the contrary it becomes important that the sensitive features be ***included*** in our ML model, so that FairAI can work to reduce the disparity in predictions of our ML model.

This notebook is an attempt to validate this, and further reinforce that idea that ***bias-mitigation algorithms are an absolute necessity***, and prejudice cannot be solved by simply removing the sensitive feature from the training data.

In [1]:
# import relevant dependencies
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from fairai.utils.metrics import disparate_impact

In [2]:
# fetch raw-data from sklearn.datasets
raw_data = fetch_openml(data_id=1590, as_frame=True)

# preview raw-data
raw_data.frame

  warn(


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
3,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
4,18.0,,103497.0,Some-college,10.0,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27.0,Private,257302.0,Assoc-acdm,12.0,Married-civ-spouse,Tech-support,Wife,White,Female,0.0,0.0,38.0,United-States,<=50K
48838,40.0,Private,154374.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
48839,58.0,Private,151910.0,HS-grad,9.0,Widowed,Adm-clerical,Unmarried,White,Female,0.0,0.0,40.0,United-States,<=50K
48840,22.0,Private,201490.0,HS-grad,9.0,Never-married,Adm-clerical,Own-child,White,Male,0.0,0.0,20.0,United-States,<=50K


In [3]:
# Data pre-processing
X_raw = pd.get_dummies(raw_data.data)
X = pd.DataFrame(MinMaxScaler().fit_transform(X_raw), columns=X_raw.columns)
y = 1 * (raw_data.target == ">50K")

In [4]:
# Step 1: Data pre-processing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Step 2: Model Training
LR = LogisticRegression(solver="liblinear", random_state=42)
LR.fit(X_train, y_train)

# Step 3: Prediction
y_test_pred = LR.predict(X_test)

In [5]:
disparate_impact(X_test, y_test, 'sex_Male')

0.3544886573259138

In [6]:
disparate_impact(X_test, y_test_pred, 'sex_Male')

0.2690576350331758

In [7]:
# Step 1: Data pre-processing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Step 2: Model Training WITHOUT sex features
LR = LogisticRegression(solver="liblinear", random_state=42)
LR.fit(X_train.drop(['sex_Male', 'sex_Female'], axis=1), y_train)

# Step 3: Prediction
y_test_pred = LR.predict(X_test.drop(['sex_Male', 'sex_Female'], axis=1))

In [8]:
disparate_impact(X_test, y_test, 'sex_Male')

0.3544886573259138

In [9]:
disparate_impact(X_test, y_test_pred, 'sex_Male')

0.2773051333493612