### Lab: Regression Task Using Random Forest, XGBoost, and LightGBM

#### **Objective**:
In this lab, you will learn how to apply three powerful ensemble learning algorithms—**Random Forest**, **XGBoost**, and **LightGBM**—to solve a regression problem. You will explore how to train and evaluate these models on a sample dataset, understand their strengths, and compare their performances.

#### **Prerequisites**:
- Familiarity with Python and common ML libraries (`pandas`, `scikit-learn`).
- Basic understanding of regression metrics such as Mean Squared Error (MSE) and R-squared.


#### **Libraries to Install**:
Make sure you have the following libraries installed before running the notebook.

```bash
# Install the required libraries
!pip install pandas scikit-learn xgboost lightgbm
```

### **1. Importing Required Libraries**

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
import time

### **2. Dataset: California Housing Prices**

For this lab, we will use the **California Housing Prices** dataset, which is available from the `scikit-learn` dataset module. This dataset contains features like average income, house age, and house prices in various districts.

#### **Step 2.1: Load the Dataset**

In [2]:
from sklearn.datasets import fetch_california_housing

# Load the dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="MedHouseVal")

# Display the first few rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


#### **Step 2.2: Split the Data**

We will split the dataset into training and testing sets.

In [3]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data: {X_train.shape}, Testing data: {X_test.shape}")

Training data: (16512, 8), Testing data: (4128, 8)


### **2.3. Model 0: Decision tree**

In [4]:
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=42)

start_time = time.time()
dt_model.fit(X_train, y_train)
end_time = time.time()
training_time_dt = end_time - start_time

y_pred_dt = dt_model.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

mse_dt_train = mean_squared_error(y_train, dt_model.predict(X_train))
r2_dt_train = r2_score(y_train, dt_model.predict(X_train))

print(f"Decision Tree MSE: {mse_dt:.4f}")
print(f"Decision Tree R²: {r2_dt:.4f}")
print(f"Decision Tree Training Time: {training_time_dt:.4f} seconds")
print(f"Decision Tree Training MSE: {mse_dt_train:.4f}")
print(f"Decision Tree Training R²: {r2_dt_train:.4f}")


Decision Tree MSE: 0.4997
Decision Tree R²: 0.6187
Decision Tree Training Time: 0.1162 seconds
Decision Tree Training MSE: 0.0000
Decision Tree Training R²: 1.0000


###  **3. Model 1: Random Forest Regression**

#### **Step 3.1: Train the Random Forest Regressor**

We’ll start with the **Random Forest Regressor**, which is an ensemble learning method that builds multiple decision trees and averages their predictions.

In [5]:
# Initialize the RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
start_time = time.time()
rf_model.fit(X_train, y_train)
end_time = time.time()
training_time_rf = end_time - start_time

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)


#### **Step 3.2: Evaluate the Random Forest Regressor**

We will evaluate the performance of the Random Forest model using **Mean Squared Error (MSE)** and **R-squared (R²)**.

In [6]:
# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

mse_rf_train = mean_squared_error(y_train, rf_model.predict(X_train))
r2_rf_train = r2_score(y_train, rf_model.predict(X_train))

print(f"Random Forest MSE: {mse_rf:.4f}")
print(f"Random Forest R²: {r2_rf:.4f}")
print(f"Random Forest Training Time: {training_time_rf:.4f} seconds")

print(f"Random Forest Training MSE: {mse_rf_train:.4f}")
print(f"Random Forest Training R²: {r2_rf_train:.4f}")


Random Forest MSE: 0.2557
Random Forest R²: 0.8049
Random Forest Training Time: 6.3041 seconds
Random Forest Training MSE: 0.0354
Random Forest Training R²: 0.9735


### **4. Model 2: XGBoost Regression**

#### **Step 4.1: Train the XGBoost Regressor**

Next, we will train the **XGBoost** model, which uses gradient boosting techniques to optimize decision trees.

In [7]:
# Initialize the XGBoost regressor
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
start_time = time.time()
xgb_model.fit(X_train, y_train)
end_time = time.time()
training_time_xgb = end_time - start_time

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

#### **Step 4.2: Evaluate the XGBoost Regressor**
We will now evaluate the performance of the XGBoost model.

In [8]:
# Evaluate the model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

mse_xgb_train = mean_squared_error(y_train, xgb_model.predict(X_train))
r2_xgb_train = r2_score(y_train, xgb_model.predict(X_train))

print(f"XGBoost MSE: {mse_xgb:.4f}")
print(f"XGBoost R²: {r2_xgb:.4f}")
print(f"XGBoost Training Time: {training_time_xgb:.4f} seconds")
print(f"XGBoost Training MSE: {mse_xgb_train:.4f}")
print(f"XGBoost Training R²: {r2_xgb_train:.4f}")


XGBoost MSE: 0.2273
XGBoost R²: 0.8266
XGBoost Training Time: 0.1864 seconds
XGBoost Training MSE: 0.1361
XGBoost Training R²: 0.8982


### **5. Model 3: LightGBM Regression**

#### **Step 5.1: Train the LightGBM Regressor**

Now, we will use **LightGBM**, another gradient boosting algorithm known for its speed and efficiency.

In [9]:
# Initialize the LightGBM regressor

lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
start_time = time.time()
lgb_model.fit(X_train, y_train)
end_time = time.time()
training_time_lgb = end_time - start_time

# Predict on the test set
y_pred_lgb = lgb_model.predict(X_test)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000500 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947


#### **Step 5.2: Evaluate the LightGBM Regressor**

Finally, evaluate the performance of the LightGBM model.

In [10]:
# Evaluate the model
mse_lgb = mean_squared_error(y_test, y_pred_lgb)
r2_lgb = r2_score(y_test, y_pred_lgb)

mse_lgb_train = mean_squared_error(y_train, lgb_model.predict(X_train))
r2_lgb_train = r2_score(y_train, lgb_model.predict(X_train))

print(f"LightGBM MSE: {mse_lgb:.4f}")
print(f"LightGBM R²: {r2_lgb:.4f}")
print(f"LightGBM Training Time: {training_time_lgb:.4f} seconds")
print(f"LightGBM Training MSE: {mse_lgb_train:.4f}")
print(f"LightGBM Training R²: {r2_lgb_train:.4f}")


LightGBM MSE: 0.2148
LightGBM R²: 0.8360
LightGBM Training Time: 0.4720 seconds
LightGBM Training MSE: 0.1562
LightGBM Training R²: 0.8831


### **6. Comparing the Models**
We will now compare the performance of the three models using **MSE** and **R²**.

In [11]:
# Print comparison of the three models
print("Model Comparison:")
print(f"Random Forest MSE: {mse_rf:.4f}, R²: {r2_rf:.4f}")
print(f"XGBoost MSE: {mse_xgb:.4f}, R²: {r2_xgb:.4f}")
print(f"LightGBM MSE: {mse_lgb:.4f}, R²: {r2_lgb:.4f}")

Model Comparison:
Random Forest MSE: 0.2557, R²: 0.8049
XGBoost MSE: 0.2273, R²: 0.8266
LightGBM MSE: 0.2148, R²: 0.8360


In [12]:
print("Training time comparison:")
print(f"Random Forest Training Time: {training_time_rf:.4f} seconds")
print(f"XGBoost Training Time: {training_time_xgb:.4f} seconds")
print(f"LightGBM Training Time: {training_time_lgb:.4f} seconds")


Training time comparison:
Random Forest Training Time: 6.3041 seconds
XGBoost Training Time: 0.1864 seconds
LightGBM Training Time: 0.4720 seconds


### **7. Hyperparameter Tuning (if time is left)**

For advanced users, you can improve model performance by tuning hyperparameters. Here’s an example of how to use GridSearchCV to tune hyperparameters for **Random Forest**.

In [13]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] END max_depth=None, min_samples_split=2, n_estimators=50; total time=   3.8s
[CV] END max_depth=None, min_samples_split=2, n_estimators=50; total time=   3.8s
[CV] END max_depth=None, min_samples_split=5, n_estimators=50; total time=   3.6s
[CV] END max_depth=None, min_samples_split=2, n_estimators=50; total time=   3.9s
[CV] END max_depth=None, min_samples_split=5, n_estimators=50; total time=   3.7s
[CV] END max_depth=None, min_samples_split=5, n_estimators=50; total time=   3.7s
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time=   6.7s
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time=   7.0s
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time=   7.0s
[CV] END max_depth=None, min_samples_split=10, n_estimators=50; total time=   3.1s
[CV] END max_depth=None, min_samples_split=5, n_estimators=100; total time=   6.1s
[CV] END max_depth=None, min_sa

### **8. Individual Work**

#### **Exercises**:
1. Experiment with the hyperparameters for **XGBoost** and **LightGBM**. Use `GridSearchCV` or `RandomizedSearchCV` to find optimal configurations.
2. Try running the models on a different regression dataset (e.g., Boston Housing or any dataset of your choice).
3. Analyze the training time of each model using the `time` library.

In [14]:
print("XGBoost")
params_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
}

# Initialize GridSearchCV
grid_search_xgb = GridSearchCV(estimator=xgb_model, param_grid=params_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
grid_search_xgb.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search_xgb.best_params_)


XGBoost
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=   0.2s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=   0.2s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=   0.3s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=   0.3s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=   0.2s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=   0.2s
[CV] END ...............learning_rate=0.01, n_estimators=200; total time=   0.4s
[CV] END ...............learning_rate=0.01, n_estimators=200; total time=   0.4s
[CV] END ...............learning_rate=0.01, n_estimators=200; total time=   0.5s
[CV] END ................learning_rate=0.1, n_estimators=200; total time=   0.3s
[CV] END ................learning_rate=0.2, n_estimators=100; total time=   0.2s
[CV] END ................learning_rate=0.

In [16]:
print("LightGBM")
params_grid = {
    'n_estimators': [100, 200, 300 ],
    'learning_rate': [0.01, 0.1, 0.2],
}

# Initialize GridSearchCV
grid_search_lgb = GridSearchCV(estimator=lgb_model, param_grid=params_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
grid_search_lgb.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search_lgb.best_params_)


LightGBM
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001112 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001150 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11008, number of used features: 8
[LightGBM] [Info] Number of data points in the train set: 11008, number of used features: 8
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001750 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001627 seconds.
You can set `force_col_wise=true` to remove the overhea

In [18]:
# Load the Boston Housing dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import time

# Load the dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name="MedHouseVal")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the models
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_model = xgb.XGBRegressor(n_estimators=300, learning_rate=0.1, random_state=42)
lgb_model = lgb.LGBMRegressor(n_estimators=300, learning_rate=0.1, random_state=42)

models = [rf_model, xgb_model, lgb_model]
model_names = ['Random Forest', 'XGBoost', 'LightGBM']

for model, name in zip(models, model_names):
    # Train the model and measure time
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"{name}:")
    print(f"Training Time: {train_time:.4f} seconds")
    print(f"MSE: {mse:.4f}")
    print(f"R²: {r2:.4f}")
    print()




Random Forest:
Training Time: 6.3003 seconds
MSE: 0.2557
R²: 0.8049

XGBoost:
Training Time: 0.4447 seconds
MSE: 0.2085
R²: 0.8409

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000273 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947
LightGBM:
Training Time: 1.1075 seconds
MSE: 0.1939
R²: 0.8520



#### **Research Points**:
- How do the three models differ in terms of training time and performance?
- Why might LightGBM or XGBoost outperform Random Forest on certain datasets?
- How do the ensemble methods like Random Forest and boosting methods like XGBoost and LightGBM handle overfitting?

1. How do the three models differ in terms of training time and performance?

Training Time:
1. Random Forest: 6.3003 seconds
2. XGBoost: 0.4447 seconds
3. LightGBM: 1.1075 seconds

Performance (MSE and R²):
1. Random Forest: MSE: 0.2557 R²: 0.8049
2. XGBoost: MSE: 0.2085 R²: 0.8409
3. LightGBM: MSE: 0.1939 R²: 0.8520

The models differ significantly in training time, with XGBoost being the fastest, followed by LightGBM, and Random Forest taking considerably longer. In terms of performance, LightGBM slightly outperforms XGBoost, which in turn outperforms Random Forest. The differences in R² scores are relatively small, but consistent across the models.

2. Why might LightGBM or XGBoost outperform Random Forest on certain datasets?

LightGBM and XGBoost might outperform Random Forest for several reasons:

- Gradient Boosting: Both LightGBM and XGBoost use gradient boosting, which builds trees sequentially to correct errors from previous trees. This can lead to better performance on complex datasets.
- Feature importance: These algorithms have sophisticated methods for determining feature importance, which can lead to better use of relevant features.
- Regularization: Both include built-in regularization techniques, which can help prevent overfitting.
- Handling of categorical variables: LightGBM, in particular, has efficient methods for handling categorical variables.
- Optimization algorithms: They use more advanced optimization algorithms compared to Random Forest.

3. How do the ensemble methods like Random Forest and boosting methods like XGBoost and LightGBM handle overfitting?

These ensemble methods handle overfitting in different ways:

Random Forest:
- Uses bagging (bootstrap aggregating) to create diverse trees
- Each tree is trained on a random subset of data and features
- Aggregates predictions from multiple trees, reducing variance

XGBoost and LightGBM:
- Use boosting to sequentially improve weak learners
- Employ regularization techniques (L1, L2) to penalize complex models
- Feature subsampling at each iteration to introduce randomness
- Early stopping to prevent overfitting during training
- LightGBM uses leaf-wise growth instead of level-wise, which can lead to deeper trees without overfitting

All three methods benefit from ensemble learning, which generally helps to reduce overfitting by combining multiple models. However, the boosting methods (XGBoost and LightGBM) often require more careful tuning of hyperparameters to prevent overfitting, while Random Forest is generally more robust out-of-the-box.

This notebook provides a hands-on approach to comparing **Random Forest**, **XGBoost**, and **LightGBM** for a regression task. It introduces the core concepts and provides insights into their performance, with opportunities for deeper exploration.