# 2.0 - Boosting Ensemble Analysis

## Objective

This notebook performs a deep dive into various boosting algorithms to see if they can outperform the `RandomForest` baseline. We will compare `AdaBoost`, `XGBoost`, `LightGBM`, and `CatBoost`.

All models in this notebook are tree-based, so we will use `Label Encoding` for categorical features.

In [None]:
import pandas as pd
import sys
import os

# Add src to path to allow imports
sys.path.append(os.path.join(os.path.abspath(''), '..', 'src'))

from data.make_dataset import load_data
from features.build_features import split_features_target, label_encode_features, split_data
from models.train_model import train_and_evaluate, save_model

## 1. Load and Prepare Data

Load the cleaned data and prepare it for tree-based models using `Label Encoding`.

In [None]:
df = load_data('../data/processed/adult_cleaned.csv')
X, y = split_features_target(df)

X_le = label_encode_features(X)
X_train, X_test, y_train, y_test = split_data(X_le, y)

print("Data prepared for tree-based models.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

## 2. Train and Evaluate Ensemble Models

We will iterate through our list of boosting models (and `RandomForest` as a baseline), train each one, and store the results.

In [None]:
models_to_compare = ["RandomForest", "AdaBoost", "XGBoost", "LightGBM", "CatBoost"]
results = {}
trained_models = {}

for model_name in models_to_compare:
    print(f"--- Training {model_name} ---")
    model, metrics = train_and_evaluate(X_train, y_train, X_test, y_test, model_name)
    results[model_name] = metrics
    trained_models[model_name] = model
    print(f"ROC-AUC: {metrics['ROC-AUC']:.4f}\n")

# Convert results to a DataFrame for display
results_df = pd.DataFrame(results).round(4)

## 3. Compare Results and Conclude

Now we can display the performance metrics for all the trained models in a single table and visualize the key metrics.

In [None]:
import matplotlib.pyplot as plt

print("Comparison of Ensemble Models:")
print(results_df)

# Plot F1-Score and ROC-AUC for comparison
results_df.T[['F1-Score', 'ROC-AUC']].plot(
    kind='bar',
    figsize=(12, 6),
    title="Comparison of F1-Score and ROC-AUC"
)
plt.ylabel("Score")
plt.ylim(0, 1)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.savefig('../reports/figures/2.0_ensemble_comparison.png')
plt.show()

### Save Best Performing Model

Finally, we select the model with the highest `ROC-AUC` score from our comparison and save it to the `models/` directory. This represents the best model found for this prediction task.

In [None]:
# Define the save path relative to the notebook's location
models_dir = '../models/'

# Find the best model name from the results
best_model_name = results_df.T['ROC-AUC'].idxmax()
best_model = trained_models[best_model_name]

print(f"Best performing model is: {best_model_name} with ROC-AUC of {results_df.loc['ROC-AUC', best_model_name]}")

# Construct the full path and save the model
save_path = os.path.join(models_dir, 'best_performing_model.joblib')
save_model(best_model, save_path)