Goal: Predict the probability by gender of getting hired in a company

Evaluation Criteria: Prioritise fairness and be unbiased to determine the probability by gender of getting hired in a company

Description: I will train a simple logistic regression model to predict the probability of
having income >50K using the Adult dataset, dropping gender from the inputs to
encourage demographic parity. I will preprocess numeric features with scaling
and categorical features with one‚Äêhot encoding, split the data into an 80/20
stratified train/test set, and train the model on the training portion. To
evaluate fairness, I will compute the ROC AUC separately for male and female
test subsets and report their average (macro group AUC) as our metric. This
encourages both accuracy and equal performance across genders.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

# Load data
df = pd.read_csv("./input/adult_reconstruction.csv")

# Prepare target and sensitive attribute
df["target"] = (df["income"].str.strip() == ">50K").astype(int)
sensitive = df["gender"].str.strip()

# Features: drop target, gender, income
X = df.drop(["income", "gender", "target"], axis=1)
y = df["target"]

# Identify numeric and categorical columns
num_cols = ["age", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
cat_cols = [c for c in X.columns if c not in num_cols]

# Split data
X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
    X, y, sensitive, test_size=0.2, stratify=y, random_state=42
)

# Build preprocessing + model pipeline
preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)
model = Pipeline(
    [
        ("prep", preprocessor),
        ("clf", LogisticRegression(solver="liblinear", random_state=42)),
    ]
)

# Train
model.fit(X_train, y_train)

# Predict probabilities
proba = model.predict_proba(X_test)[:, 1]

# Compute AUC per gender
auc_vals = []
for grp in ["Male", "Female"]:
    mask = sens_test == grp
    if mask.sum() > 0:
        auc_vals.append(roc_auc_score(y_test[mask], proba[mask]))
macro_auc = sum(auc_vals) / len(auc_vals)

# Print evaluation
print(f"Macro group AUC: {macro_auc:.4f}")

AttributeError: Can only use .str accessor with string values!