Goal: Predict the probability by education of getting hired in a company

Evaluation Criteria: Prioritise accuracy to determine the probability of getting hired in a company by education

Desciption: We load the adult hiring dataset, encode categorical features with one-hot
encoding and split into train and validation sets. We fit a logistic regression
to predict the probability of having income >50K as a proxy for “getting hired.”
We evaluate the probabilistic predictions using log loss on the hold-out set.
This simple model provides a baseline for accuracy in estimating hiring
probability by education and other demographics.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

# Load and prepare data
df = pd.read_csv("./input/adult_reconstruction.csv")
df["income"] = (df["income"] == ">50K").astype(int)

# Features and target
X = df.drop("income", axis=1)
y = df["income"]

# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict probabilities and evaluate
y_pred_proba = model.predict_proba(X_val)[:, 1]
loss = log_loss(y_val, y_pred_proba)
print(f"Log Loss: {loss:.4f}")

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0