### Bias & Fairness in Data: Bias Mitigation Techniques
**Question**: Use the Adult Income dataset and apply reweighing technique to balance the
class weights based on sensitive attributes (e.g., gender).

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Step 1: Load Adult Income dataset (from UCI repository)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'
]

df = pd.read_csv(url, header=None, names=column_names, na_values=' ?', skipinitialspace=True)

# Drop rows with missing values for simplicity
df.dropna(inplace=True)

# Step 2: Encode sensitive attribute and label
# Sensitive attribute: 'sex' (Male=1, Female=0)
df['sex_encoded'] = LabelEncoder().fit_transform(df['sex'])
# Label: 'income' (<=50K=0, >50K=1)
df['income_encoded'] = (df['income'] == '>50K').astype(int)

# Step 3: Calculate counts for reweighing
# Groups: sensitive attribute (sex) and label (income)
group_counts = df.groupby(['sex_encoded', 'income_encoded']).size().reset_index(name='count')
total = len(df)

# Calculate weights for each group based on their representation in data
# Formula: weight = (total_samples) / (number_of_groups * count_of_group)
num_groups = group_counts.shape[0]

group_counts['weight'] = total / (num_groups * group_counts['count'])

# Step 4: Assign weights to each instance in original df
def get_weight(row):
    mask = (group_counts['sex_encoded'] == row['sex_encoded']) & \
           (group_counts['income_encoded'] == row['income_encoded'])
    return group_counts.loc[mask, 'weight'].values[0]

df['weight'] = df.apply(get_weight, axis=1)

# Step 5: Show original vs reweighed class distribution by gender
print("\nOriginal distribution:")
print(pd.crosstab(df['sex'], df['income']))

print("\nAverage weights assigned per group:")
print(group_counts[['sex_encoded', 'income_encoded', 'weight']])

# To use weights in model training, pass 'weight' as sample_weight parameter in classifiers

# Optional: Check total weights per group after reweighing (should be more balanced)
weighted_counts = df.groupby(['sex_encoded', 'income_encoded'])['weight'].sum().reset_index()
print("\nWeighted counts (sum of weights) per group:")
print(weighted_counts)


Original distribution:
income  <=50K  >50K
sex                
Female   9592  1179
Male    15128  6662

Average weights assigned per group:
   sex_encoded  income_encoded    weight
0            0               0  0.848650
1            0               1  6.904368
2            1               0  0.538092
3            1               1  1.221893

Weighted counts (sum of weights) per group:
   sex_encoded  income_encoded   weight
0            0               0  8140.25
1            0               1  8140.25
2            1               0  8140.25
3            1               1  8140.25
