
# Adult Census Income Prediction — ML Case Study

**Objective:** Predict whether an individual's income exceeds $50K/year (binary classification) using the UCI Adult Census dataset.

**Deliverable:** A single, self-contained notebook with:
- Clean, well-commented code
- Clear rationale for each decision
- Inline visualizations for EDA and model evaluation
- Two supervised models: **Logistic Regression** and **Naive Bayes** (no XGBoost)
- Overfitting controls and fair evaluation on a true holdout set

**Author:** Tanmay



## 1. Business Problem (Plain English)

We want to forecast whether a person is a **high earner** (>$50K) using demographic and work-related inputs (age, education, occupation, hours per week, etc.).

**Why businesses care:**
- **Recruiting:** prioritize candidates for senior roles.
- **Marketing:** segment and target premium products/services.
- **Policy/Analytics:** quantify income disparities across demographics.

**Success criteria:** Build a model with strong **precision/recall** on the >$50K class, good **ROC-AUC**, and clear explainability so stakeholders understand the drivers.



## 2. Assumptions & Guardrails

- We treat the UCI `adult.data` as **train** and `adult.test` as a **true external test** set (as published).
- We'll avoid data leakage by fitting transformers **only on train**.
- We'll use simple, robust preprocessing: missing value handling, one-hot encoding for categoricals, and appropriate scaling depending on model.
- We will not use black-box gradient boosting here; requested focus is **Logistic Regression** and **Naive Bayes**.
- We care about **calibrated probabilities** (for business thresholds) and **interpretability** (feature influence).


In [None]:

# 3. Imports & Configuration

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt

# Modeling & Evaluation
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score,
                             roc_curve, precision_recall_curve, accuracy_score)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# For feature names
from sklearn.utils.validation import check_is_fitted

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 160)



## 3. Data Access & Loading (UCI)

We load the official Adult dataset directly from UCI:
- Train: `adult.data`
- Test: `adult.test`

We also standardize column names and convert "?" to missing values.


In [None]:

# UCI URLs
URL_TRAIN = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
URL_TEST  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"

# Schema from UCI (fixed order)
COLS = [
    "age","workclass","fnlwgt","education","education_num","marital_status",
    "occupation","relationship","race","sex","capital_gain","capital_loss",
    "hours_per_week","native_country","income"
]

# Read with proper NA handling and whitespace trimming
train_raw = pd.read_csv(URL_TRAIN, header=None, names=COLS, na_values=" ?", skipinitialspace=True)
test_raw  = pd.read_csv(URL_TEST,  header=0,   names=COLS, na_values=" ?", skipinitialspace=True)

print(train_raw.shape, test_raw.shape)
train_raw.head()



## 4. Attribute Information (from UCI page)

**Features** (abridged summary):
- **age** (int)
- **workclass** (Private, Self-emp, Gov, etc.)
- **fnlwgt** (final sampling weight; often not predictive)
- **education** (Bachelors, HS-grad, etc.)
- **education_num** (years of education)
- **marital_status** (Married, Never-married, etc.)
- **occupation** (Tech-support, Craft-repair, etc.)
- **relationship** (Wife, Husband, Not-in-family, etc.)
- **race**, **sex**
- **capital_gain**, **capital_loss**
- **hours_per_week**
- **native_country**
- **income** (target: `<=50K` or `>50K`)

We'll treat `income` as the binary target (1 for `>50K`, 0 for `<=50K`). We'll also consider whether **fnlwgt** helps or hurts; many practitioners drop it.



## 5. Data Cleaning

Steps:
1. Strip any trailing periods from test labels (they come as `<=50K.` / `>50K.`).
2. Harmonize the target to 0/1.
3. Handle missing values (`?` loaded as `NaN`). We'll **drop rows with missing critical categoricals** to keep things simple and reproducible.
4. Optional: Drop `fnlwgt` (often noisy for prediction). We'll keep it first, then check importance and optionally drop it.


In [None]:

def clean_adult(df):
    out = df.copy()
    # normalize income strings (remove trailing periods, trim)
    out["income"] = out["income"].astype(str).str.strip().str.replace(".", "", regex=False)
    out["income"] = out["income"].map({">50K": 1, "<=50K": 0})
    # Drop rows with any NA in common categoricals (simple and robust)
    # We'll keep numeric NAs (there shouldn't be any in this dataset)
    before = out.shape[0]
    out = out.dropna()
    after = out.shape[0]
    print(f"Dropped {before-after} rows with missing values.")
    return out

train = clean_adult(train_raw)
test  = clean_adult(test_raw)

print(train["income"].value_counts(normalize=True))
print(test["income"].value_counts(normalize=True))

train.head()



## 6. Exploratory Data Analysis (EDA)

We examine class balance and key numeric distributions.  
*All charts use Matplotlib; one chart per figure, with default styles (no custom colors).*


In [None]:

# Helper to plot a simple bar chart for class balance
def plot_class_balance(series, title):
    counts = series.value_counts().sort_index()
    plt.figure()
    counts.plot(kind="bar")
    plt.title(title)
    plt.xlabel("Income (0=<=50K, 1=>50K)")
    plt.ylabel("Count")
    plt.show()

plot_class_balance(train["income"], "Training Set: Class Balance")
plot_class_balance(test["income"], "Test Set: Class Balance")


In [None]:

# Numeric distributions
numeric_cols = ["age", "education_num", "hours_per_week", "capital_gain", "capital_loss"]

for col in numeric_cols:
    plt.figure()
    train[col].hist(bins=40)
    plt.title(f"Distribution: {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.show()


In [None]:

# Relationship with target (boxplots)
for col in ["age", "hours_per_week"]:
    plt.figure()
    train.boxplot(column=col, by="income")
    plt.suptitle("")
    plt.title(f"{col} vs Income")
    plt.xlabel("Income (0=<=50K, 1=>50K)")
    plt.ylabel(col)
    plt.show()



## 7. Feature Engineering & Preprocessing

We split features into **numeric** and **categorical** columns and build two pipelines:

- **Logistic Regression**  
  - Numeric: `StandardScaler`
  - Categorical: `OneHotEncoder(handle_unknown="ignore")`  
  - Class weight balanced to mitigate skew.

- **Naive Bayes (MultinomialNB)**  
  - Naive Bayes expects **non-negative** features.  
  - Numeric: `MinMaxScaler` to [0, 1]  
  - Categorical: `OneHotEncoder`  
  - We'll tune `alpha` via grid search.

We will fit on **train** and evaluate on the **published test** set.


In [None]:

# Identify columns
target_col = "income"
feature_cols = [c for c in train.columns if c != target_col]

numeric_features = train[feature_cols].select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = [c for c in feature_cols if c not in numeric_features]

numeric_features, categorical_features


In [None]:

# Column transformers
ct_logit = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), categorical_features)
])

ct_nb = ColumnTransformer([
    ("num", MinMaxScaler(), numeric_features),  # non-negative
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), categorical_features)
])

# Pipelines
pipe_logit = Pipeline([
    ("prep", ct_logit),
    ("clf", LogisticRegression(max_iter=500, class_weight="balanced", n_jobs=None))
])

pipe_nb = Pipeline([
    ("prep", ct_nb),
    ("clf", MultinomialNB())
])



## 8. Model Training (with Cross-Validation)

We use **GridSearchCV** with stratified 5-fold CV.  
- **Metric**: ROC-AUC (robust for imbalanced classes)
- We keep grids tight to avoid overfitting with excessive search.


In [None]:

# Hyperparameter grids
param_grid_logit = {
    "clf__C": [0.25, 1.0, 4.0],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"]  # stable & supports class_weight
}

param_grid_nb = {
    "clf__alpha": [0.1, 0.5, 1.0]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

gs_logit = GridSearchCV(pipe_logit, param_grid_logit, cv=cv, scoring="roc_auc", n_jobs=-1, refit=True)
gs_nb    = GridSearchCV(pipe_nb,    param_grid_nb,   cv=cv, scoring="roc_auc", n_jobs=-1, refit=True)

X_train = train[feature_cols]
y_train = train[target_col]

gs_logit.fit(X_train, y_train)
gs_nb.fit(X_train, y_train)

print("Best Logistic Regression params:", gs_logit.best_params_, "CV AUC:", round(gs_logit.best_score_, 4))
print("Best Naive Bayes params:", gs_nb.best_params_, "CV AUC:", round(gs_nb.best_score_, 4))



## 9. Evaluation on the External Test Set

We report:
- Accuracy, Precision, Recall, F1 (on >50K class)
- ROC-AUC
- Confusion Matrix
- ROC and Precision-Recall curves (inline plots)


In [None]:

def evaluate_model(name, model, X_test, y_test):
    y_pred = model.predict(X_test)
    # Use predict_proba if available; else decision_function; else fallback to predicted labels for AUC (less ideal)
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
    elif hasattr(model, "decision_function"):
        y_proba = model.decision_function(X_test)
        # Scale to [0,1] if needed
        y_proba = (y_proba - y_proba.min()) / (y_proba.max() - y_proba.min() + 1e-12)
    else:
        y_proba = y_pred.astype(float)
    
    print(f"\n=== {name} ===")
    print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
    print("ROC-AUC:", round(roc_auc_score(y_test, y_proba), 4))
    print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=4))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    
    # ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.figure()
    plt.plot(fpr, tpr)
    plt.plot([0,1],[0,1], linestyle="--")
    plt.title(f"ROC Curve — {name}")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.show()
    
    # Precision-Recall curve
    precision, recall, _ = precision_recall_curve(y_test, y_proba)
    plt.figure()
    plt.plot(recall, precision)
    plt.title(f"Precision-Recall Curve — {name}")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.show()

X_test = test[feature_cols]
y_test = test[target_col]

best_logit = gs_logit.best_estimator_
best_nb    = gs_nb.best_estimator_

evaluate_model("Logistic Regression", best_logit, X_test, y_test)
evaluate_model("Naive Bayes (Multinomial)", best_nb, X_test, y_test)



## 10. Model Interpretability (Logistic Regression)

We extract the top positive/negative coefficients to understand which features push predictions toward `>50K` or `<=50K`.


In [None]:

def get_feature_names_from_ct(ct, input_cols):
    # Works for our specific ColumnTransformer definition
    output_names = []
    for name, transformer, cols in ct.transformers_:
        if transformer == "drop":
            continue
        if hasattr(transformer, "get_feature_names_out"):
            try:
                names = transformer.get_feature_names_out(cols)
            except TypeError:
                names = transformer.get_feature_names_out()
            output_names.extend(names)
        else:
            # passthrough or scaler without names
            if isinstance(cols, list):
                output_names.extend(cols)
            else:
                output_names.extend(input_cols if cols == "remainder" else [cols])
    return output_names

# Extract coefficients from the best logistic model
logit_clf = best_logit.named_steps["clf"]
ct = best_logit.named_steps["prep"]

feature_names = get_feature_names_from_ct(ct, feature_cols)

coef = logit_clf.coef_.ravel()
coef_df = pd.DataFrame({"feature": feature_names, "coef": coef}).sort_values("coef", ascending=False)

top_k = 15
top_pos = coef_df.head(top_k)
top_neg = coef_df.tail(top_k).iloc[::-1]

display(top_pos)
display(top_neg)

# Plot top positives
plt.figure()
plt.barh(top_pos["feature"][::-1], top_pos["coef"][::-1])
plt.title("Top Positive Coefficients (Logistic Regression)")
plt.xlabel("Coefficient")
plt.ylabel("Feature")
plt.show()

# Plot top negatives
plt.figure()
plt.barh(top_neg["feature"][::-1], top_neg["coef"][::-1])
plt.title("Top Negative Coefficients (Logistic Regression)")
plt.xlabel("Coefficient")
plt.ylabel("Feature")
plt.show()



## 11. Discussion: Which Model Is Best and Why?

We compare both models on **external test**:
- If **Logistic Regression** shows higher ROC-AUC and balanced precision/recall, it wins for **interpretability** and **calibration**.
- If **Multinomial Naive Bayes** is close or better, it's attractive for **speed** and **simplicity**, but may be less calibrated on numeric-heavy features.

**Bias/Variance & Overfitting Controls**
- Used a true holdout (published test file).
- Simple preprocessing (no leakage). 
- Modest hyperparameter search with cross-validation (5-fold).
- Regularized Logistic Regression.

**Practical Recommendation**
- Choose the model with **higher ROC-AUC** and operationally desirable **precision/recall** on the positive class.
- For deployment, set a business threshold (probability cutoff) aligned with costs (false positives vs false negatives). Use the PR curve to pick a point.



## 12. Next Steps

- Try **calibration** (Platt/Isotonic) and a **cost-sensitive threshold** tuned to business goals.
- Explore **feature interactions** and drop **fnlwgt** if it hurts generalization.
- Investigate **fairness metrics** (e.g., across `sex` or `race`) before real-world use.
- Add **learning curve** diagnostics and **error analysis** on top contributing segments.



## 13. Reproducibility

- Random states fixed where relevant.
- All transformations encapsulated in scikit-learn **Pipelines**.
- Clear separation of **train** (adult.data) vs **test** (adult.test).
