MLP using Pytorch on Adult Income Dataset

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    confusion_matrix
)

In [2]:
def load_adult_dataset(csv_path):
    column_names = [
        "age", "workclass", "fnlwgt", "education", "education-num",
        "marital-status", "occupation", "relationship", "race", "sex",
        "capital-gain", "capital-loss", "hours-per-week",
        "native-country", "income"
    ]

    df = pd.read_csv(csv_path, names=column_names, na_values=" ?")
    df = df.dropna()

    # Convert target to binary
    df["income"] = (df["income"].str.strip() == ">50K").astype(int)

    return df

In [8]:
def preprocess_features(df):
    """
    Splits features into numerical and categorical,
    scales numerical features, one-hot encodes categorical features.
    """
    target = df["income"].values

    numerical_cols = df.select_dtypes(include=["int64"]).columns.drop("income")
    categorical_cols = df.select_dtypes(include=["object"]).columns

    # Scale numerical features
    scaler = StandardScaler()
    X_num = scaler.fit_transform(df[numerical_cols])

    # One-hot encode categorical features
    encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
    X_cat = encoder.fit_transform(df[categorical_cols])

    # Combine features
    X = np.hstack([X_num, X_cat]).astype(np.float32)

    return X, target

In [9]:
def build_mlp(input_dim):
    model = nn.Sequential(
        nn.Linear(input_dim, 128),
        nn.ReLU(),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 1)   # single logit
    )
    return model

In [10]:
def train_model(model, X_train, y_train, epochs=20, batch_size=256, lr=1e-3):
    """
    Trains the MLP using Adam optimizer and BCE With Logits Loss.
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.BCEWithLogitsLoss() #loss=−[ylog(σ(z))+(1−y)log(1−σ(z))]

    num_samples = X_train.shape[0]

    for epoch in range(epochs):
        permutation = torch.randperm(num_samples) #shuffles sample indices every epoch

        model.train()

        for i in range(0, num_samples, batch_size): #mini batch training
            indices = permutation[i:i + batch_size]

            batch_X = X_train[indices]
            batch_y = y_train[indices]

            logits = model(batch_X).squeeze()
            loss = loss_fn(logits, batch_y)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch+1}/{epochs} - Loss: {loss.item():.4f}")

In [11]:
def evaluate_model(model, X_test, y_test):
    """
    Evaluates the trained model and reports metrics.
    """
    model.eval()

    with torch.no_grad():
        logits = model(X_test).squeeze()
        probabilities = torch.sigmoid(logits) #logits → probabilities → predictions
        predictions = (probabilities >= 0.5).int().numpy()

    y_true = y_test.numpy()

    accuracy = accuracy_score(y_true, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, predictions, average="binary"
    )
    cm = confusion_matrix(y_true, predictions)

    return accuracy, precision, recall, f1, cm

In [None]:
if __name__ == "__main__":

    df = load_adult_dataset("/home/naman/Cryptonite-RTP-NamanGoel/Task-3/adult/adult_data.csv")

    # Preprocess
    X, y = preprocess_features(df)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=42
    )

    # Convert to PyTorch tensors bec numpy array cannot track gradients, nor can interact with nn.Module
    X_train_tensor = torch.tensor(X_train)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32)

    X_test_tensor = torch.tensor(X_test)
    y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

    model = build_mlp(X_train_tensor.shape[1])

    train_model(
        model,
        X_train_tensor,
        y_train_tensor,
        epochs=20,
        batch_size=256,
        lr=1e-3
    )

    # Evaluate
    acc, prec, rec, f1, cm = evaluate_model(
        model, X_test_tensor, y_test_tensor
    )

    print("\nEvaluation Results:")
    print("Accuracy :", acc)
    print("Precision:", prec)
    print("Recall   :", rec)
    print("F1-score :", f1)
    print("Confusion Matrix:\n", cm)


Epoch 1/100 - Loss: 0.4021
Epoch 2/100 - Loss: 0.3881
Epoch 3/100 - Loss: 0.3956
Epoch 4/100 - Loss: 0.2964
Epoch 5/100 - Loss: 0.4277
Epoch 6/100 - Loss: 0.3074
Epoch 7/100 - Loss: 0.2229
Epoch 8/100 - Loss: 0.3772
Epoch 9/100 - Loss: 0.2784
Epoch 10/100 - Loss: 0.2756
Epoch 11/100 - Loss: 0.4148
Epoch 12/100 - Loss: 0.3598
Epoch 13/100 - Loss: 0.2731
Epoch 14/100 - Loss: 0.2223
Epoch 15/100 - Loss: 0.2970
Epoch 16/100 - Loss: 0.3030
Epoch 17/100 - Loss: 0.2962
Epoch 18/100 - Loss: 0.3139
Epoch 19/100 - Loss: 0.3888
Epoch 20/100 - Loss: 0.3855
Epoch 21/100 - Loss: 0.2716
Epoch 22/100 - Loss: 0.2353
Epoch 23/100 - Loss: 0.2786
Epoch 24/100 - Loss: 0.3867
Epoch 25/100 - Loss: 0.3041
Epoch 26/100 - Loss: 0.2688
Epoch 27/100 - Loss: 0.1636
Epoch 28/100 - Loss: 0.3043
Epoch 29/100 - Loss: 0.3035
Epoch 30/100 - Loss: 0.2656
Epoch 31/100 - Loss: 0.2454
Epoch 32/100 - Loss: 0.2031
Epoch 33/100 - Loss: 0.2564
Epoch 34/100 - Loss: 0.2406
Epoch 35/100 - Loss: 0.2391
Epoch 36/100 - Loss: 0.2205
E

for epochs=20,  
    batch_size=256,  
    lr=1e-3  
Epoch 1/20 - Loss: 0.3117  
Epoch 2/20 - Loss: 0.2787  
Epoch 3/20 - Loss: 0.4108  
Epoch 4/20 - Loss: 0.2465  
Epoch 5/20 - Loss: 0.3254  
Epoch 6/20 - Loss: 0.4575  
Epoch 7/20 - Loss: 0.3554  
Epoch 8/20 - Loss: 0.2626  
Epoch 9/20 - Loss: 0.2518  
Epoch 10/20 - Loss: 0.2981  
Epoch 11/20 - Loss: 0.3389  
Epoch 12/20 - Loss: 0.3012  
Epoch 13/20 - Loss: 0.4842  
Epoch 14/20 - Loss: 0.3310  
Epoch 15/20 - Loss: 0.3438  
Epoch 16/20 - Loss: 0.2139  
Epoch 17/20 - Loss: 0.2541  
Epoch 18/20 - Loss: 0.2405  
Epoch 19/20 - Loss: 0.3080  
Epoch 20/20 - Loss: 0.2683  
  
Evaluation Results:  
Accuracy : 0.8513177523620089  
Precision: 0.7325134511913912  
Recall   : 0.6344873501997337  
F1-score : 0.6799857295754549  
Confusion Matrix:  
 [[4183  348]  
 [ 549  953]]

Experiment A (20 epochs, batch size 256) is the better baseline model because it achieves a higher F1-score and maintains a stronger balance between precision and recall, while Experiment B (100 epochs, batch size 512) demonstrates the expected recall–precision trade-off when training longer with larger batches on an imbalanced dataset.  
for epochs = 100 batch_size = 512 lr = 1e-3    
Epoch 1/100 - Loss: 0.4021  
Epoch 2/100 - Loss: 0.3881   
Epoch 3/100 - Loss: 0.3956   
Epoch 4/100 - Loss: 0.2964    
Epoch 95/100 - Loss: 0.1991   
Epoch 96/100 - Loss: 0.2377   
Epoch 97/100 - Loss: 0.1747   
Epoch 98/100 - Loss: 0.1643   
Epoch 99/100 - Loss: 0.3716   
Epoch 100/100 - Loss: 0.2401   
Evaluation Results:   
Accuracy : 0.825957235206365   
Precision: 0.6475195822454308   
Recall : 0.6604527296937417   
F1-score : 0.6539222148978246   
Confusion Matrix:   
[[3991 540]   
[ 510 992]  

Thus, experiment B caught more individuals in >50k category

Methods  
The Adult Income dataset was used to formulate a binary classification task predicting whether an individual earns more than $50K annually. The dataset contains both numerical and categorical features, which were preprocessed by standardizing numerical variables and applying one-hot encoding to categorical variables. A multi-layer perceptron (MLP) was implemented using PyTorch, with two hidden layers and ReLU activations. The model was trained using the Adam optimizer and binary cross-entropy loss with logits. Training was performed using mini-batch gradient descent, and model performance was evaluated using accuracy, precision, recall, F1-score, and a confusion matrix to account for class imbalance.  
  
Results  
With 20 epochs and a batch size of 256, the model achieved an accuracy of 85.1%, a precision of 73.3%, a recall of 63.4%, and an F1-score of 0.68 for the high-income class. Increasing the training duration to 100 epochs and the batch size to 512 resulted in a higher recall (66.0%) but reduced precision (64.8%) and overall accuracy (82.6%), leading to a lower F1-score (0.65). These results illustrate the trade-off between precision and recall in the presence of class imbalance and establish the shorter training configuration as a stronger balanced baseline.