<a href="https://colab.research.google.com/github/BastienCherel/Advanced-ML-I/blob/main/Lab2_Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Lab: Regression Task Using Random Forest, XGBoost, and LightGBM

#### **Objective**:
In this lab, you will learn how to apply three powerful ensemble learning algorithms—**Random Forest**, **XGBoost**, and **LightGBM**—to solve a regression problem. You will explore how to train and evaluate these models on a sample dataset, understand their strengths, and compare their performances.

#### **Prerequisites**:
- Familiarity with Python and common ML libraries (`pandas`, `scikit-learn`).
- Basic understanding of regression metrics such as Mean Squared Error (MSE) and R-squared.


#### **Libraries to Install**:
Make sure you have the following libraries installed before running the notebook.

```bash
# Install the required libraries
!pip install pandas scikit-learn xgboost lightgbm
```

### **1. Importing Required Libraries**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb

### **2. Dataset: California Housing Prices**

For this lab, we will use the **California Housing Prices** dataset, which is available from the `scikit-learn` dataset module. This dataset contains features like average income, house age, and house prices in various districts.

#### **Step 2.1: Load the Dataset**

In [None]:
from sklearn.datasets import fetch_california_housing

# Load the dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="MedHouseVal")

# Display the first few rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


#### **Step 2.2: Split the Data**

We will split the dataset into training and testing sets.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data: {X_train.shape}, Testing data: {X_test.shape}")

Training data: (16512, 8), Testing data: (4128, 8)


### **3. Model 1: Random Forest Regression**

#### **Step 3.1: Train the Random Forest Regressor**

We’ll start with the **Random Forest Regressor**, which is an ensemble learning method that builds multiple decision trees and averages their predictions.

In [None]:
# Initialize the RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

#### **Step 3.2: Evaluate the Random Forest Regressor**

We will evaluate the performance of the Random Forest model using **Mean Squared Error (MSE)** and **R-squared (R²)**.

In [None]:
# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest MSE: {mse_rf:.4f}")
print(f"Random Forest R²: {r2_rf:.4f}")

Random Forest MSE: 0.2554
Random Forest R²: 0.8051


### **4. Model 2: XGBoost Regression**

#### **Step 4.1: Train the XGBoost Regressor**

Next, we will train the **XGBoost** model, which uses gradient boosting techniques to optimize decision trees.

In [None]:
# Initialize the XGBoost regressor
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

#### **Step 4.2: Evaluate the XGBoost Regressor**
We will now evaluate the performance of the XGBoost model.

In [None]:
# Evaluate the model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost MSE: {mse_xgb:.4f}")
print(f"XGBoost R²: {r2_xgb:.4f}")

XGBoost MSE: 0.2273
XGBoost R²: 0.8266


### **5. Model 3: LightGBM Regression**

#### **Step 5.1: Train the LightGBM Regressor**

Now, we will use **LightGBM**, another gradient boosting algorithm known for its speed and efficiency.

In [None]:
# Initialize the LightGBM regressor
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
lgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_lgb = lgb_model.predict(X_test)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002399 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947


#### **Step 5.2: Evaluate the LightGBM Regressor**

Finally, evaluate the performance of the LightGBM model.

In [None]:
# Evaluate the model
mse_lgb = mean_squared_error(y_test, y_pred_lgb)
r2_lgb = r2_score(y_test, y_pred_lgb)

print(f"LightGBM MSE: {mse_lgb:.4f}")
print(f"LightGBM R²: {r2_lgb:.4f}")

LightGBM MSE: 0.2148
LightGBM R²: 0.8360


### **6. Comparing the Models**
We will now compare the performance of the three models using **MSE** and **R²**.

In [None]:
# Print comparison of the three models
print("Model Comparison:")
print(f"Random Forest MSE: {mse_rf:.4f}, R²: {r2_rf:.4f}")
print(f"XGBoost MSE: {mse_xgb:.4f}, R²: {r2_xgb:.4f}")
print(f"LightGBM MSE: {mse_lgb:.4f}, R²: {r2_lgb:.4f}")

Model Comparison:
Random Forest MSE: 0.2554, R²: 0.8051
XGBoost MSE: 0.2273, R²: 0.8266
LightGBM MSE: 0.2148, R²: 0.8360


### **7. Hyperparameter Tuning (if time is left)**

For advanced users, you can improve model performance by tuning hyperparameters. Here’s an example of how to use GridSearchCV to tune hyperparameters for **Random Forest**.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


PicklingError: Could not pickle the task to send it to the workers.

### **8. Individual Work**

#### **Exercises**:
1. Experiment with the hyperparameters for **XGBoost** and **LightGBM**. Use `GridSearchCV` or `RandomizedSearchCV` to find optimal configurations.
2. Try running the models on a different regression dataset (e.g., Boston Housing or any dataset of your choice).
3. Analyze the training time of each model using the `time` library.

#### **Research Points**:
- How do the three models differ in terms of training time and performance?
- Why might LightGBM or XGBoost outperform Random Forest on certain datasets?
- How do the ensemble methods like Random Forest and boosting methods like XGBoost and LightGBM handle overfitting?

This notebook provides a hands-on approach to comparing **Random Forest**, **XGBoost**, and **LightGBM** for a regression task. It introduces the core concepts and provides insights into their performance, with opportunities for deeper exploration.