## Income Classification Model Training

This notebook trains and evaluates several classifiers on the cleaned income dataset. We compare baseline models, fine-tune an XGBoost estimator, and persist the best-performing pipeline for downstream prediction notebooks.


In [6]:
#import the necessary libraries
import pandas as pd

# Load the cleaned dataset
data = pd.read_csv('../data/cleaned.csv')

# Features (X) and target (y)
X = data.drop(columns=['income'])  # Assuming 'income' is the target column
y = data['income']  # The target column


### Load Features and Target

We read the preprocessed dataset, split out the binary income label, and keep predictors ready for model training.


### Baseline Model Benchmark

We compare balanced versions of Logistic Regression, Decision Tree, Random Forest, and XGBoost using an 80/20 train-test split and metrics tailored to the imbalanced target.


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Split the data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Calculate scale_pos_weight for XGBoost to handle class imbalance
neg_count = sum(y_train == 0)
pos_count = sum(y_train == 1)
scale_pos_weight = neg_count / pos_count

# Initialize the models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000,class_weight='balanced'),
    "Decision Tree": DecisionTreeClassifier(random_state=42,class_weight='balanced'),
    "Random Forest": RandomForestClassifier(random_state=42,class_weight='balanced'),
    "XGBoost": xgb.XGBClassifier( eval_metric='logloss',scale_pos_weight=scale_pos_weight)
}

# Function to train models and evaluate performance with verbose output
def train_and_evaluate(models, X_train, X_test, y_train, y_test):
    results = {}

    for model_name, model in models.items():
        print(f"\nTraining {model_name}...")  # Verbose output for model training
        
        # Train the model
        model.fit(X_train, y_train)
        
        print(f"Model {model_name} training completed.")  # Verbose output after training
        
        # Predict on the test set
        y_pred = model.predict(X_test)
        
        # Calculate performance metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        # Save the results
        results[model_name] = {
            "Accuracy": accuracy,
            "Precision": precision,
            "Recall": recall,
            "F1 Score": f1
        }

        print(f"{model_name} performance:")  # Verbose output with performance metrics
        print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")
        
    return results

# Evaluate the models
results = train_and_evaluate(models, X_train, X_test, y_train, y_test)

# Display results in a DataFrame for better visualization
results_df = pd.DataFrame(results).T
print("\nModel Comparison:")
print(results_df)


Training Logistic Regression...


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Logistic Regression training completed.
Logistic Regression performance:
Accuracy: 0.7907, Precision: 0.5571, Recall: 0.8314, F1 Score: 0.6672

Training Decision Tree...
Model Decision Tree training completed.
Decision Tree performance:
Accuracy: 0.7860, Precision: 0.5703, Recall: 0.6162, F1 Score: 0.5924

Training Random Forest...
Model Random Forest training completed.
Random Forest performance:
Accuracy: 0.8137, Precision: 0.6476, Recall: 0.5741, F1 Score: 0.6086

Training XGBoost...
Model XGBoost training completed.
XGBoost performance:
Accuracy: 0.8300, Precision: 0.6179, Recall: 0.8543, F1 Score: 0.7171

Model Comparison:
                     Accuracy  Precision    Recall  F1 Score
Logistic Regression  0.790709   0.557125  0.831370  0.667164
Decision Tree        0.786016   0.570281  0.616243  0.592372
Random Forest        0.813702   0.647552  0.574086  0.608610
XGBoost              0.829970   0.617937  0.854309  0.717148


#### Benchmark Results

The comparison table highlights how each algorithm handles the class imbalance; metrics guide which model to tune further.


### Hyperparameter Tuning

We run a grid search over XGBoost hyperparameters to optimize performance while keeping evaluation consistent via cross-validation.


In [8]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

xgb_model = xgb.XGBClassifier( eval_metric='logloss')

# Define the hyperparameter grid
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'gamma': [0, 0.1, 0.3],
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, 
                           cv=3, scoring='accuracy', verbose=2, n_jobs=-1)

# Perform grid search to find the best parameters
grid_search.fit(X_train, y_train)

# Get the best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Print the best parameters
print(f"Best Parameters: {best_params}")

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Best Model Accuracy: {accuracy:.4f}")

import joblib
joblib.dump(best_model, "../models/best_xgboost_model.pkl")
print("Saved tuned XGBoost model to ../models/best_xgboost_model.pkl")

Fitting 3 folds for each of 972 candidates, totalling 2916 fits
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=50, subsample=1.0; total time=   0.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=50, subsample=0.8; total time=   0.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=50, subsample=1.0; total time=   0.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=50, subsample=1.0; total time=   0.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=50, subsample=0.8; total time=   0.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=50, subsample=0.8; total time=   0.3s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.01, max_depth=3,

#### Best Parameter Snapshot

The selected hyperparameters balance depth, learning rate, and sampling ratios to mitigate overfitting while capturing signal.


### Persist the Best Model

We serialize the tuned XGBoost estimator with `joblib` so it can be reused for evaluation and inference notebooks.


### Reload and Validate Saved Model

We reload the persisted model to confirm parity between in-memory and serialized versions by recomputing key classification metrics.


In [9]:
import joblib
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the saved model
loaded_model = joblib.load('../models/best_xgboost_model.pkl')

# Predict on the test set
y_pred_loaded = loaded_model.predict(X_test)

# Evaluate performance

accuracy_loaded = accuracy_score(y_test, y_pred_loaded)
precision_loaded = precision_score(y_test, y_pred_loaded)
recall_loaded = recall_score(y_test, y_pred_loaded)
f1_loaded = f1_score(y_test, y_pred_loaded)

print(f"Loaded Model Accuracy: {accuracy_loaded:.4f}")
print(f"Loaded Model Precision: {precision_loaded:.4f}")
print(f"Loaded Model Recall: {recall_loaded:.4f}")
print(f"Loaded Model F1 Score: {f1_loaded:.4f}")

Loaded Model Accuracy: 0.8628
Loaded Model Precision: 0.7714
Loaded Model Recall: 0.6485
Loaded Model F1 Score: 0.7046
