<a href="https://colab.research.google.com/github/BastienCherel/Advanced-ML-I/blob/main/Lab2_Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Lab: Regression Task Using Random Forest, XGBoost, and LightGBM

#### **Objective**:
In this lab, you will learn how to apply three powerful ensemble learning algorithms—**Random Forest**, **XGBoost**, and **LightGBM**—to solve a regression problem. You will explore how to train and evaluate these models on a sample dataset, understand their strengths, and compare their performances.

#### **Prerequisites**:
- Familiarity with Python and common ML libraries (`pandas`, `scikit-learn`).
- Basic understanding of regression metrics such as Mean Squared Error (MSE) and R-squared.


#### **Libraries to Install**:
Make sure you have the following libraries installed before running the notebook.

```bash
# Install the required libraries
!pip install pandas scikit-learn xgboost lightgbm
```

In [2]:
%pip install pandas scikit-learn xgboost lightgbm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### **1. Importing Required Libraries**

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb

### **2. Dataset: California Housing Prices**

For this lab, we will use the **California Housing Prices** dataset, which is available from the `scikit-learn` dataset module. This dataset contains features like average income, house age, and house prices in various districts.

#### **Step 2.1: Load the Dataset**

In [4]:
from sklearn.datasets import fetch_california_housing

# Load the dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="MedHouseVal")

# Display the first few rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


#### **Step 2.2: Split the Data**

We will split the dataset into training and testing sets.

In [5]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data: {X_train.shape}, Testing data: {X_test.shape}")

Training data: (16512, 8), Testing data: (4128, 8)


### **3. Model 1: Random Forest Regression**

#### **Step 3.1: Train the Random Forest Regressor**

We’ll start with the **Random Forest Regressor**, which is an ensemble learning method that builds multiple decision trees and averages their predictions.

In [6]:
# Initialize the RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

#### **Step 3.2: Evaluate the Random Forest Regressor**

We will evaluate the performance of the Random Forest model using **Mean Squared Error (MSE)** and **R-squared (R²)**.

In [7]:
# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest MSE: {mse_rf:.4f}")
print(f"Random Forest R²: {r2_rf:.4f}")

Random Forest MSE: 0.2554
Random Forest R²: 0.8051


### **4. Model 2: XGBoost Regression**

#### **Step 4.1: Train the XGBoost Regressor**

Next, we will train the **XGBoost** model, which uses gradient boosting techniques to optimize decision trees.

In [8]:
# Initialize the XGBoost regressor
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

#### **Step 4.2: Evaluate the XGBoost Regressor**
We will now evaluate the performance of the XGBoost model.

In [9]:
# Evaluate the model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost MSE: {mse_xgb:.4f}")
print(f"XGBoost R²: {r2_xgb:.4f}")

XGBoost MSE: 0.2273
XGBoost R²: 0.8266


### **5. Model 3: LightGBM Regression**

#### **Step 5.1: Train the LightGBM Regressor**

Now, we will use **LightGBM**, another gradient boosting algorithm known for its speed and efficiency.

In [10]:
# Initialize the LightGBM regressor
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
lgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_lgb = lgb_model.predict(X_test)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001691 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947


#### **Step 5.2: Evaluate the LightGBM Regressor**

Finally, evaluate the performance of the LightGBM model.

In [11]:
# Evaluate the model
mse_lgb = mean_squared_error(y_test, y_pred_lgb)
r2_lgb = r2_score(y_test, y_pred_lgb)

print(f"LightGBM MSE: {mse_lgb:.4f}")
print(f"LightGBM R²: {r2_lgb:.4f}")

LightGBM MSE: 0.2148
LightGBM R²: 0.8360


### **6. Comparing the Models**
We will now compare the performance of the three models using **MSE** and **R²**.

In [12]:
# Print comparison of the three models
print("Model Comparison:")
print(f"Random Forest MSE: {mse_rf:.4f}, R²: {r2_rf:.4f}")
print(f"XGBoost MSE: {mse_xgb:.4f}, R²: {r2_xgb:.4f}")
print(f"LightGBM MSE: {mse_lgb:.4f}, R²: {r2_lgb:.4f}")

Model Comparison:
Random Forest MSE: 0.2554, R²: 0.8051
XGBoost MSE: 0.2273, R²: 0.8266
LightGBM MSE: 0.2148, R²: 0.8360


### **7. Hyperparameter Tuning (if time is left)**

For advanced users, you can improve model performance by tuning hyperparameters. Here’s an example of how to use GridSearchCV to tune hyperparameters for **Random Forest**.

In [13]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=0)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

Best parameters found:  {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}


### **8. Individual Work**

#### **Exercises**:
1. Experiment with the hyperparameters for **XGBoost** and **LightGBM**. Use `GridSearchCV` or `RandomizedSearchCV` to find optimal configurations.
2. Try running the models on a different regression dataset (e.g., Boston Housing or any dataset of your choice).
3. Analyze the training time of each model using the `time` library.

In [25]:
from sklearn.datasets import fetch_california_housing

# Load the dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="MedHouseVal")

# Display the first few rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [26]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data: {X_train.shape}, Testing data: {X_test.shape}")

Training data: (16512, 8), Testing data: (4128, 8)


In [None]:
# Hyperparameter space for XGBoost
xgb_param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1, 0.2],
    "subsample": [0.8, 1],
    "colsample_bytree": [0.8, 1]
}

# Hyperparameter space for LightGBM
lgb_param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1, 0.2],
    "num_leaves": [31, 50, 70],
    "min_child_samples": [20, 30, 50]
}

# Define the Random Forest hyperparameter search space
rf_param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [10, 20],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2"]
}

In [33]:
# Initialize models
xgb_model = xgb.XGBRegressor(random_state=42)
lgb_model = lgb.LGBMRegressor(random_state=42)
lgb_model.set_params(**{"verbose": -1})
rf_model = RandomForestRegressor(random_state=42)

In [34]:
from time import time
from sklearn.model_selection import RandomizedSearchCV

# Use GridSearchCV for XGBoost
print("Tuning XGBoost...")
start_time_xgb = time()
xgb_grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=xgb_param_grid,
    scoring="neg_mean_squared_error",
    cv=3,
    verbose=0,
    n_jobs=-1
)
xgb_grid_search.fit(X_train, y_train)
xgb_best_model = xgb_grid_search.best_estimator_
end_time_xgb = time()
xgb_training_time = end_time_xgb - start_time_xgb


# Use RandomizedSearchCV for LightGBM
print("Tuning LightGBM...")
start_time_lgb = time()
lgb_random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=lgb_param_grid,
    scoring="neg_mean_squared_error",
    n_iter=50,
    cv=3,
    verbose=0,
    random_state=42,
    n_jobs=-1
)
lgb_random_search.fit(X_train, y_train)
lgb_best_model = lgb_random_search.best_estimator_
end_time_lgb = time()
lgb_training_time = end_time_lgb - start_time_lgb


# Use GridSearchCV for RandomForest
print("Tuning RandomForest...")
start_time_rf = time()
rf_random_search = GridSearchCV(
    estimator=rf_model,
    param_grid=rf_param_grid,
    scoring="neg_mean_squared_error",
    cv=3,
    verbose=0,
    n_jobs=-1
)
rf_random_search.fit(X_train, y_train)
rf_best_model = rf_random_search.best_estimator_
end_time_rf = time()
rf_training_time = end_time_rf - start_time_rf


Tuning XGBoost...
Tuning LightGBM...
Tuning RandomForest...


  _data = np.array(data, dtype=dtype, copy=copy,


In [35]:
# Evaluate both models on the test set
xgb_pred = xgb_best_model.predict(X_test)
lgb_pred = lgb_best_model.predict(X_test)
rf_pred = rf_best_model.predict(X_test)

# Compute metrics
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
xgb_r2 = r2_score(y_test, xgb_pred)

lgb_rmse = np.sqrt(mean_squared_error(y_test, lgb_pred))
lgb_r2 = r2_score(y_test, lgb_pred)

rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
rf_r2 = r2_score(y_test, rf_pred)

In [36]:
# Print results
print("\n--- XGBoost Results ---")
print(f"Best Parameters: {xgb_grid_search.best_params_}")
print(f"RMSE: {xgb_rmse:.4f}")
print(f"R2 Score: {xgb_r2:.4f}")
print(f"Training Time: {xgb_training_time:.2f} seconds")

print("\n--- LightGBM Results ---")
print(f"Best Parameters: {lgb_random_search.best_params_}")
print(f"RMSE: {lgb_rmse:.4f}")
print(f"R2 Score: {lgb_r2:.4f}")
print(f"Training Time: {lgb_training_time:.2f} seconds")

print("\n--- RandomForest Results ---")
print(f"Best Parameters: {rf_random_search.best_params_}")
print(f"RMSE: {rf_rmse:.4f}")
print(f"R2 Score: {rf_r2:.4f}")
print(f"Training Time: {rf_training_time:.2f} seconds")

# Compare training times and metrics
print("\n--- Comparison ---")
print(f"XGBoost Training Time: {xgb_training_time:.2f} seconds")
print(f"LightGBM Training Time: {lgb_training_time:.2f} seconds")
print(f"RandomForest Training Time: {rf_training_time:.2f} seconds")
print(f"XGBoost RMSE: {xgb_rmse:.4f} | LightGBM RMSE: {lgb_rmse:.4f} | RandomForest RMSE: {rf_rmse:.4f}")
print(f"XGBoost R2: {xgb_r2:.4f} | LightGBM R2: {lgb_r2:.4f} | RandomForest R2: {rf_rmse:.4f}")


--- XGBoost Results ---
Best Parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300, 'subsample': 1}
RMSE: 0.4425
R2 Score: 0.8506
Training Time: 101.06 seconds

--- LightGBM Results ---
Best Parameters: {'num_leaves': 31, 'n_estimators': 300, 'min_child_samples': 30, 'max_depth': 7, 'learning_rate': 0.1}
RMSE: 0.4413
R2 Score: 0.8514
Training Time: 23.46 seconds

--- RandomForest Results ---
Best Parameters: {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
RMSE: 0.4941
R2 Score: 0.8137
Training Time: 948.44 seconds

--- Comparison ---
XGBoost Training Time: 101.06 seconds
LightGBM Training Time: 23.46 seconds
RandomForest Training Time: 948.44 seconds
XGBoost RMSE: 0.4425 | LightGBM RMSE: 0.4413 | RandomForest RMSE: 0.4941
XGBoost R2: 0.8506 | LightGBM R2: 0.8514 | RandomForest R2: 0.4941


#### **Research Points**:
- How do the three models differ in terms of training time and performance?
- Why might LightGBM or XGBoost outperform Random Forest on certain datasets?
- How do the ensemble methods like Random Forest and boosting methods like XGBoost and LightGBM handle overfitting?

##### Model Comparison: Training Time and Performance

Training Time:
* LightGBM is significantly faster than XGBoost and Random Forest. This is due to LightGBM’s histogram-based learning, which speeds up feature splits by grouping continuous features into discrete bins.
* Random Forest is the slowest because it grows many decision trees independently, each considering all features (or a subset) during training, leading to computational overhead.

Performance (RMSE & R²):
* LightGBM achieves the best performance (lowest RMSE and highest R²), indicating a slightly better fit compared to XGBoost.
* Random Forest performs worst in both RMSE and R², suggesting it is less suited for the dataset in question.


##### Why LightGBM and XGBoost Outperform Random Forest

**Boosting vs Bagging:**

Random Forest (Bagging):
* Builds trees independently by sampling data with replacement (bootstrap aggregation).
* Reduces variance but can struggle with bias if individual trees are weak.
* May underperform on datasets with complex relationships due to lack of iterative learning.

XGBoost and LightGBM (Boosting):
* Build trees sequentially, each focusing on correcting the errors of the previous tree.
* Reduce both bias and variance by iteratively improving predictions.
* More effective at capturing complex patterns in data compared to Random Forest.

**Efficiency of Boosting Algorithms:**

XGBoost:
* Implements regularization (L1 and L2) to prevent overfitting.
* Weighted data points to prioritize misclassified samples.
* Handles sparsity efficiently, which is beneficial for high-dimensional datasets.

LightGBM:
* Optimized for speed with histogram-based learning.
* Handles large datasets and high-dimensional features more efficiently than XGBoost.
* Splits leaf-wise rather than depth-wise, focusing on the most significant feature splits.

This notebook provides a hands-on approach to comparing **Random Forest**, **XGBoost**, and **LightGBM** for a regression task. It introduces the core concepts and provides insights into their performance, with opportunities for deeper exploration.