## LECTURE 8. CLASS EXERCISE: Boosting Algorithm Training Time Showdown

**Objective:**  
Compare the training time required by scikit-learn's `GradientBoostingClassifier` and `HistGradientBoostingClassifier`, XGBoost (`XGBClassifier`), and LightGBM (`LGBMClassifier`).


**Dataset Choice:**  

We will use the **Covertype** dataset, which contains **581,012 samples** and **54 features**, making it sufficiently challenging to highlight differences in algorithm efficiency. It is a moderately sized dataset that is large enough to show a noticeable difference in training times but small enough to run quickly within a class session. 

The Covertype dataset contains forest cover type labels for 30Ã—30 m plots in Colorado, utilising 54 cartographic and environmental features (elevation, slope, soil type, etc.) to predict one of seven forest cover classes. Therefore, the task is to predict the forest Cover_Type (the primary tree species, e.g., Spruce/Fir, Lodgepole Pine, Ponderosa Pine) for specific 30 x 30 meter areas of land.


**Step 1: Setup the Environment**  
Install `xgboost` and `lightgbm` as they are not pre-installed in Colab's default environment.


In [18]:
# Install external libraries in Google Colab
!pip install xgboost lightgbm

import time
import pandas as pd
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


**Step 2: Load and Preprocess the Dataset**  
The Covertype dataset needs to be fetched and prepared. Scaling the features help ensure fair comparisons, especially for `GradientBoostingClassifier`, which benefits from it.



In [10]:
# Load the Covertype dataset

#import ssl
#ssl._create_default_https_context = ssl._create_unverified_context

print("Loading data...")
covertype = fetch_covtype(shuffle=True, random_state=42)
X, y = covertype.data, covertype.target

# --- LABEL INDEXING  ---
y = y - 1 
# The classes are now [0, 1, 2, 3, 4, 5, 6] instead of [1, 2, 3, 4, 5, 6, 7]
# -----------------------------------

# Use a smaller subset for faster execution in class, if needed!!!
# We use only 10% of the dataset here as training data
# Because train_size is 0.1 (10%), the remaining 90% goes into the "test" portion
X, _, y, _ = train_test_split(X, y, train_size=0.1, random_state=42, stratify=y) 

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Data loaded: {X_train_scaled.shape[0]} training samples, {X_train_scaled.shape[1]} features.")


Loading data...
Data loaded: 46480 training samples, 54 features.


**Step 3: Define Models and a Timing Function**  
Define the classifiers with consistent parameters where possible, and create a function to time the training process cleanly.


In [19]:
# Initialise the models with basic, comparable parameters
# We limit n_estimators to a reasonable number to keep runtime manageable for the class.


**Step 4: Run the Comparison**  
Execute the training for all four models and collect their training times and accuracies.


**Step 5: Analyse and Visualise Results**  


In [None]:
# Create a summary DataFrame
# based on your own code above
summary_df = pd.DataFrame({
    'Algorithm': list(timing_results.keys()),
    'Training Time (s)': list(timing_results.values()),
    'Accuracy': list(accuracy_results.values())
}).sort_values(by='Training Time (s)')

print("\n--- Summary Comparison ---")
print(summary_df.round(4))

# Optional: Visualisation