# Demonstrating Bias and Fairness Intervention in Data Science

This notebook illustrates how bias can arise in a data science project and how an intervention (reweighting) can help create more fairness in model predictions.

# Bias and Fairness in Data Science Projects

This notebook demonstrates how bias can arise in data science projects and presents best practices to ensure algorithmic fairness. We use an artificial dataset to illustrate these concepts.

In [35]:
# Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## 1. Artificial Dataset with Strong Bias

In this dataset, males have a much higher chance of loan approval than females. This will be reflected in the model's predictions before any fairness intervention.

In [36]:
# Generate strongly biased artificial dataset
np.random.seed(0)
n_samples = 400
age = np.random.randint(20, 65, n_samples)
gender = np.random.choice(['Male', 'Female'], n_samples)
income = np.random.normal(48000, 12000, n_samples).astype(int)
# Strong bias: males have much higher approval rate than females
loan_approved = (
    ((income > 40000) & (age > 30) & (gender == 'Male')) |
    ((income > 60000) & (age > 40) & (gender == 'Female'))
).astype(int)
data = pd.DataFrame({
    'Age': age,
    'Gender': gender,
    'Income': income,
    'Loan_Approved': loan_approved
})
data.head()

Unnamed: 0,Age,Gender,Income,Loan_Approved
0,64,Male,38868,0
1,20,Male,46974,0
2,23,Female,38520,0
3,23,Male,42349,0
4,59,Female,35603,0


## 2. Detecting Bias in Model Predictions

Train a model and compare approval rates for males and females. You should see a significant disparity before any intervention.

In [37]:
# Train model and compare prediction rates by gender
X = data[['Age', 'Gender', 'Income']]
X = pd.get_dummies(X, drop_first=True)
y = data['Loan_Approved']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
results = X_test.copy()
results['Actual'] = y_test.values
results['Predicted'] = preds
results['Gender'] = data.loc[X_test.index, 'Gender'].values

# Compare positive prediction rates by gender
groups = results.groupby('Gender')
for gender, group in groups:
    rate = (group['Predicted'] == 1).mean()
    print(f'Predicted approval rate for {gender}: {rate:.2f}')

Predicted approval rate for Female: 0.02
Predicted approval rate for Male: 0.46


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 3. Fairness Intervention: Reweighting

We apply a reweighting technique to reduce the disparity in approval rates and make the model predictions more fair. Compare the approval rates before and after intervention.

In [38]:
# Fairness metric: demographic parity
# Mitigation: simple reweighting example
from sklearn.utils import compute_sample_weight
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)
model_balanced = LogisticRegression()
model_balanced.fit(X_train, y_train, sample_weight=sample_weights)
preds_balanced = model_balanced.predict(X_test)
results['Predicted_Balanced'] = preds_balanced
for gender, group in results.groupby('Gender'):
    rate = (group['Predicted_Balanced'] == 1).mean()
    print(f'Balanced model approval rate for {gender}: {rate:.2f}')

Balanced model approval rate for Female: 0.11
Balanced model approval rate for Male: 0.67


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
