## 1.  Random Forest

### 1.1 Motivation

A single decision tree is easy to interpret and can model non-linear patterns, but it is often **high-variance**: small changes in the training data may lead to very different trees and unstable predictions.  
**Random Forest** addresses this by building **many** decision trees and combining their outputs, which typically improves generalization performance.

---

### 1.2 Core Idea (Ensemble of Trees)

Random Forest is an **ensemble method** based on the idea of **bagging (bootstrap aggregating)**:

1. Train many decision trees on **different bootstrap samples** of the training set.
    
2. At each split, each tree considers only a **random subset of features**.
    
3. Combine all trees:
    
    - **Classification:** majority vote
        
    - **Regression:** average prediction
        

This introduces randomness in both **data** and **features**, which reduces correlation among trees and reduces variance.

---

### 1.3 Bootstrapping (Sampling with Replacement)

Given a training set of size nnn, a bootstrap sample is created by randomly drawing nnn points **with replacement**.  
As a result:

- some samples appear multiple times
    
- some samples may not appear at all
    

Each tree is trained on its own bootstrap sample, so different trees see different training sets.

---

### 1.4 Feature Subsampling (Random Subspace)

When splitting a node, instead of searching over all ddd features, Random Forest only considers mmm randomly selected features:

$$\large m = \begin{cases} \sqrt{d}, & \text{(common for classification)} \\ \log_2(d), & \text{(another common choice)} \\ d, & \text{(no feature randomness)} \end{cases}$$

This feature randomness further decorrelates trees, making the ensemble more robust.

---

### 1.5 Prediction Rules

Assume we have $\large B$ trees $\large \{T_1, T_2, \dots, T_B\}$.
#### Regression

$$\large \hat{y}(x) = \frac{1}{B}\sum_{b=1}^{B} T_b(x)$$
#### Classification

$$\large \hat{y}(x) = \text{mode}\{T_1(x), T_2(x), \dots, T_B(x)\}$$

---

### 1.6 Why It Works (Bias–Variance Intuition)

- A **deep decision tree** typically has **low bias** but **high variance**.
    
- Random Forest reduces variance by averaging many noisy, high-variance models.
    

As long as individual trees are not perfectly correlated, averaging them yields a more stable predictor.

---

### 1.7 Key Hyperparameters

- `n_estimators`: number of trees in the forest  
    (more trees → more stable, but slower)
    
- `max_depth`: maximum depth of each tree  
    (controls overfitting)
    
- `min_samples_split`: minimum samples required to split a node  
    (regularization)
    
- `max_features`: number of features considered at each split  
    (controls randomness and correlation)
    
- `bootstrap`: whether to use bootstrap sampling

## 2. Data Loading and Preprocessing

We use the Seoul Bike Sharing Demand dataset, which contains hourly records of bike rental counts along with weather and temporal features.
The task is a regression problem, where the goal is to predict the number of rented bikes in each hour.

Each row corresponds to one hour

Target variable: Rented Bike Count

Features include temperature, humidity, wind speed, and other weather-related variables

This dataset exhibits strong non-linear relationships, making it suitable for tree-based and ensemble methods.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv(
    "../../data/SeoulBikeData.csv",
    encoding="latin1"
)
df.head()

df.shape


(8760, 14)

For this example, we select a subset of numeric features to keep the focus on the learning algorithms rather than extensive feature engineering.

In [2]:
features = [
    "Hour",
    "Temperature(°C)",
    "Humidity(%)",
    "Wind speed (m/s)",
    "Visibility (10m)",
    "Dew point temperature(°C)",
    "Solar Radiation (MJ/m2)",
    "Rainfall(mm)",
    "Snowfall (cm)",
]

target = "Rented Bike Count"


These features are:

- continuous or ordinal,

- directly usable by tree-based models,

- known to influence bike rental demand.

For simplicity, we remove rows with missing values in the selected columns.

In [3]:
df_model = df[features + [target]].dropna()

X = df_model[features].values
y = df_model[target].values

X.shape, y.shape


((8760, 9), (8760,))

Tree-based models do not require feature scaling,
so no normalization or standardization is applied.

We split the dataset into training and testing sets to evaluate model generalization performance.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


((7008, 9), (1752, 9))

- 80% of the data is used for training

- 20% is held out for testing

- A fixed random seed ensures reproducibility

We use **Mean Squared Error (MSE)** as the evaluation metric:

$$\large \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

MSE penalizes large prediction errors and is commonly used for regression tasks.

## 3. Decision Tree Baseline

Decision trees are flexible and capable of modeling non-linear relationships, but they often suffer from high variance, which makes them a natural point of comparison for ensemble methods.

In [5]:
from rice_ml.decision_trees import DecisionTreeRegressor

# Initialize Decision Tree Regressor
dt = DecisionTreeRegressor(
    max_depth=6,
    min_samples_split=10,
)


- max_depth limits tree complexity and helps control overfitting

- min_samples_split prevents splits based on very few samples

These hyperparameters are chosen to provide a reasonable balance between bias and variance.

In [6]:
dt.fit(X_train, y_train)


<rice_ml.decision_trees.DecisionTreeRegressor at 0x1dbd0719850>

The model is trained using the training portion of the dataset.

We evaluate the model on the held-out test set using Mean Squared Error (MSE).

In [7]:
from sklearn.metrics import mean_squared_error

# Predict on test data
y_pred_dt = dt.predict(X_test)

# Compute MSE
dt_mse = mean_squared_error(y_test, y_pred_dt)
dt_mse


137778.64701923067

The MSE serves as a quantitative measure of prediction error.

Decision trees can fit complex patterns by recursively partitioning the feature space.
However, because the model structure depends heavily on the training data, a single tree is often sensitive to noise and may not generalize well to unseen samples.

This behavior motivates the use of Random Forest, which reduces variance by aggregating multiple decision trees trained on randomized data subsets.

## 4. Random Forest Regressor

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to reduce variance and improve generalization performance.

We train a Random Forest Regressor using our own implementation and compare its behavior with the single decision tree baseline.

In [8]:
from rice_ml.random_forest import RandomForestRegressor

# Initialize Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=8,
    min_samples_split=10,
    max_features="sqrt",
    random_state=42,
)


Hyperparameter explanation:

- n_estimators: number of trees in the forest

- max_depth: maximum depth of each tree

- min_samples_split: minimum samples required to split a node

- max_features: number of features randomly selected at each split

- random_state: ensures reproducibility

These settings introduce randomness while controlling model complexity.

In [9]:
rf.fit(X_train, y_train)


<rice_ml.random_forest.RandomForestRegressor at 0x1dbd0719d00>

Each tree is trained on a bootstrap sample of the training data, and feature randomness is applied at each split.

We evaluate the Random Forest model using the same test set and evaluation metric as the baseline.

In [10]:
# Predict on test data
y_pred_rf = rf.predict(X_test)

# Compute MSE
rf_mse = mean_squared_error(y_test, y_pred_rf)
rf_mse


111629.14420404252

Using the same metric allows for a fair comparison between models.

Compared to a single decision tree, Random Forest produces smoother predictions and is less sensitive to noise in the training data.
By averaging multiple de-correlated trees, the ensemble reduces variance while maintaining low bias.

This behavior is particularly beneficial for datasets with complex, non-linear relationships, such as bike rental demand influenced by weather conditions.

## 5. Model Comparison and Key Takeaways

Compare the performance of the Decision Tree Regressor and the Random Forest Regressor using the same training–testing split and evaluation metric.

Summarize the test set performance using **Mean Squared Error (MSE)**.

In [11]:
import pandas as pd

results = pd.DataFrame({
    "Model": ["Decision Tree", "Random Forest"],
    "Test MSE": [dt_mse, rf_mse]
})

results


Unnamed: 0,Model,Test MSE
0,Decision Tree,137778.647019
1,Random Forest,111629.144204


Observe that the Random Forest achieves a **lower MSE** than the single decision tree, indicating improved generalization performance.

The performance difference can be explained by the **bias–variance trade-off**:

- The decision tree has **low bias but high variance**, making it sensitive to fluctuations in the training data.

- The random forest reduces variance by averaging predictions from many de-correlated trees trained on different bootstrap samples.

As a result, Random Forest typically provides more stable and reliable predictions on unseen data.

| Aspect              | Decision Tree | Random Forest     |
| ------------------- | ------------- | ----------------- |
| Model complexity    | Single tree   | Ensemble of trees |
| Variance            | High          | Lower             |
| Robustness to noise | Low           | High              |
| Interpretability    | High          | Lower             |
| Generalization      | Moderate      | Strong            |
This comparison highlights the trade-off between interpretability and predictive performance.

- Random Forest significantly improves predictive performance over a single decision tree on this dataset.
    
- The improvement comes from **bootstrap aggregation** and **feature randomness**, which reduce model variance.
    
- Tree-based ensemble methods work well with minimal preprocessing and can capture complex non-linear relationships.
    
- Random Forest serves as a strong baseline for many real-world regression tasks.

This example demonstrates how ensemble learning can enhance model performance by combining multiple weak learners.  
Compared to a single decision tree, Random Forest provides more robust predictions and better generalization, making it a powerful tool for practical machine learning applications.