# Supervised Learning → Random Forest Regression

As with other regression models in this project,
the first sections focus on data preparation
and are intentionally repeated.

This ensures that each notebook can be read
and used independently, without external references.

1. Project setup and common pipeline
2. Dataset loading
3. Train-test split
4. Feature scaling (why we do it)

----------------------------------

5. What is this model? (Intuition)
6. Model training
7. Model behavior and key hyperparameters
8. Predictions
9. Model evaluation
10. When to use it and when not to
11. Model persistence
12. Mathematical formulation (deep dive)
-----------------------------------------------------

## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand what is happening at each step
and be able to adapt it to your own data.

-----------------------------------------------------

## What is Random Forest Regression?

Random Forest Regression is a powerful and flexible model
that works very differently from both Linear Regression and KNN.

Instead of learning a single global rule
or relying on nearby data points,
Random Forest builds **many decision trees**
and combines their predictions.

Each tree:
- looks at the data in a slightly different way
- makes its own prediction

The final prediction is obtained by:
- averaging the predictions of all trees

By combining many simple models,
Random Forest is able to capture complex and non-linear patterns
that simpler models cannot.

-----------------------------------------------------

## Why we start with intuition

Random Forest can look complex at first,
but the core idea is simple.

Instead of trusting a single model,
Random Forest asks the same question many times,
each time with a slightly different perspective.

Each tree may make mistakes,
but when many trees agree,
the final prediction becomes more reliable.

Understanding this idea of **many weak models working together**
is key to understanding how Random Forest works.

-----------------------------------------------------

## What you should expect from the results

Before using Random Forest Regression,
it is important to set expectations.

With Random Forest Regression, you should expect:
- strong performance on complex and non-linear data
- robust predictions even in the presence of noise
- less sensitivity to individual outliers

Compared to simpler models:
- Random Forest usually outperforms Linear Regression
- it is more stable than KNN on large datasets
- it requires less manual feature engineering

However, this power comes at a cost:
- the model is harder to interpret
- training and prediction are more computationally expensive


# ____________________________________
## 1. Project setup and common pipeline

In this section we set up the common pipeline
used across all regression models in this project.

This part of the notebook does not depend on the model itself
and is intentionally kept consistent to:
- ensure fair comparison between models
- reduce implementation errors
- focus on understanding model behavior


In [11]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)

from pathlib import Path
import joblib


### Note on feature scaling for Random Forest

Random Forest does **not rely on distances**
or linear combinations of features.

For this reason, feature scaling is **not required**
for the model to work correctly.

However, we keep feature scaling in the pipeline to:
- maintain consistency across all regression notebooks
- simplify comparisons between models
- avoid changing preprocessing steps when switching models



# ____________________________________
## 2. Dataset loading

In this section we load the dataset that will be used
to train and evaluate the Random Forest Regression model.

We use the same dataset as in the other regression notebooks
to allow direct comparison between different models.


In [12]:
# Load the dataset

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


### Inputs and target

- `X` contains the input features
- `y` contains the continuous target variable

At this stage:
- no modeling is performed
- we are only defining what the model will learn from


# ____________________________________
## 3. Train-test split

In this section we split the dataset into
training and test sets.

This allows us to evaluate how well the model
generalizes to unseen data.


In [13]:
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


### Why this step is important

A model should not be evaluated on the same data
it was trained on.

By separating the data:
- the training set is used to build the model
- the test set is used only for evaluation

This ensures that performance metrics
reflect real-world behavior.


### Consistency across models

We use the same split configuration
as in the other regression notebooks.

This guarantees that performance differences
are due to the model itself
and not to different data splits.


# ____________________________________
## 4. Feature scaling (pipeline consistency)

In this section we apply feature scaling
to the input features.

Although Random Forest does not rely on distances
or linear combinations of features,
we keep feature scaling in the pipeline
to maintain consistency across models.


### Is feature scaling required for Random Forest?

Strictly speaking:
- Random Forest **does not require** feature scaling

Decision trees:
- split features based on thresholds
- are invariant to feature scale

However, scaling is still applied here
for consistency and comparability.



### Why we keep the same pipeline across models

In practice, once a problem is identified as a regression task,
it is common to try multiple models
and compare their performance on the same data.

For this reason:
- the dataset remains identical
- the train-test split remains identical
- the preprocessing pipeline remains identical

Keeping the same pipeline allows us to:
- compare models fairly
- attribute performance differences to the model itself
- switch models without changing data preparation steps

Even if a model does not strictly require scaling,
including it in the pipeline ensures consistency
and simplifies model comparison.


### Why we still apply scaling

We apply feature scaling to:
- keep the preprocessing pipeline identical
- simplify switching between models
- avoid conditional logic in the code

This makes the project easier to maintain
and easier to extend.


### Important rule: fit only on training data

As with other preprocessing steps:
- the scaler is fitted on training data only
- the same scaler is applied to test data

This prevents data leakage
and ensures a fair evaluation.


In [14]:
# Feature scaling (kept for pipeline consistency)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### What we have after this step

- scaled training data
- scaled test data
- a complete and consistent preprocessing pipeline

At this point, the data is ready
to be used by the Random Forest model.


# ____________________________________
## 5. What is this model? (Random Forest Regression)

Before training the model, it is important to understand
what Random Forest Regression is trying to do conceptually.

Random Forest Regression is an **ensemble model**,
meaning it combines the predictions of many simpler models
to produce a final result.


### The core idea

Instead of relying on a single model,
Random Forest builds **many decision trees**.

Each tree:
- is trained on a slightly different subset of the data
- looks at a different subset of features
- makes its own prediction

The final prediction is obtained
by averaging the predictions of all trees.


### Why using many trees helps

Individual decision trees:
- are easy to understand
- but tend to overfit the data

Random Forest reduces this problem by:
- training many trees
- making them as different as possible
- combining their predictions

Errors made by individual trees
tend to cancel out when averaged.


### Key takeaway

Random Forest Regression does not try to find a single rule.

Instead, it asks:
"What would many different decision trees predict?"

By combining these answers,
it produces robust and flexible predictions
on complex regression problems.


# ____________________________________
## 6. Model training (Random Forest Regression)

In this section we train the Random Forest Regression model.

Unlike KNN, Random Forest performs real training:
it builds multiple decision trees and learns patterns from the data.


### Important hyperparameters (introduced, not tuned)

At this stage we focus on understanding the model,
not on optimizing it.

We start with a simple configuration and default values.
Hyperparameter tuning can be explored later.



In [15]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

# Train the model
rf_model.fit(X_train_scaled, y_train)


0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0  Poisson criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",1.0
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


### What these parameters mean

- `n_estimators=100`  
  Number of decision trees in the forest.  
  More trees usually improve stability, but increase computation.

- `random_state=42`  
  Ensures reproducibility of results.

- `n_jobs=-1`  
  Uses all available CPU cores to speed up training.


### What we have after training

After this step:
- multiple decision trees have been trained
- each tree has learned different patterns
- the forest is ready to make predictions

Unlike KNN:
- training is computationally heavier
- prediction is relatively fast


# ____________________________________
## 7. Model behavior and key hyperparameters (Random Forest Regression)

In this section we describe how Random Forest behaves
and which hyperparameters most strongly influence its predictions.

Random Forest does not produce simple, interpretable parameters,
but its behavior can still be understood at a high level.


### Number of trees (`n_estimators`)

The number of trees controls:
- model stability
- variance reduction

General behavior:
- few trees → higher variance, less stable predictions
- many trees → more stable, smoother predictions

Beyond a certain point,
adding more trees provides diminishing returns.


### Tree depth and complexity

Each tree in the forest can grow deep and complex.

Key parameters that control tree complexity include:
- maximum tree depth
- minimum number of samples per split
- minimum number of samples per leaf

More complex trees:
- capture fine details
- risk overfitting

Simpler trees:
- generalize better
- may underfit


### Key takeaway

Random Forest behavior is controlled by:
- how many trees are used
- how complex each tree is
- how much randomness is introduced

Understanding these elements helps explain
why Random Forest often performs well
without extensive tuning.


# ____________________________________
## 8. Predictions (Random Forest Regression)

In this section we use the trained Random Forest model
to generate predictions on unseen data.

This step shows how the model behaves in practice
after the training phase is complete.


In [16]:
# Generate predictions on the test set

y_pred_rf = rf_model.predict(X_test_scaled)


### What we obtained

- `y_pred_rf` contains the predicted target values
- predictions are generally smooth and stable
- extreme values are handled more robustly than with simpler models


# ____________________________________
## 9. Model evaluation (Random Forest Regression)

In this section we evaluate the performance of the Random Forest Regression model
by comparing its predictions with the true target values.

Using the same evaluation metrics across models
allows direct and fair comparison.


In [17]:
# Compute evaluation metrics for Random Forest Regression

mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_rf)

mae, mse, rmse, r2


(0.3274252027374033,
 0.255169737347244,
 np.float64(0.5051432839771741),
 0.8052747336256919)

### Metrics interpretation

- **MAE (Mean Absolute Error)**  
  Average absolute difference between predictions and true values.

- **MSE (Mean Squared Error)**  
  Penalizes large errors more heavily.

- **RMSE (Root Mean Squared Error)**  
  Expresses prediction error in the same units as the target variable
  and is often the most informative metric for model comparison.

- **R² score**  
  Indicates how much variance in the target variable
  is explained by the model.


### What to expect from Random Forest results

With Random Forest Regression, you will often observe:
- lower RMSE compared to simpler models
- better handling of non-linear patterns
- improved robustness to noise and outliers

However:
- improvements come at the cost of interpretability
- training time and memory usage are higher


# ____________________________________
## 10. When to use it and when not to (Random Forest Regression)

Random Forest Regression is often a strong default choice,
but it is not always the best solution.

Understanding when to use it
helps avoid unnecessary complexity and inefficiency.


### When Random Forest Regression is a good choice

Random Forest Regression works well when:

- The relationship between features and target is non-linear
- The data contains complex interactions between features
- The dataset is of small to medium size
- Robust performance is more important than interpretability
- You want strong results without extensive feature engineering

In many real-world problems,
Random Forest provides a strong balance
between accuracy and reliability.


### When Random Forest Regression is NOT a good choice

Random Forest Regression may not be ideal when:

- Interpretability is a strict requirement
- The dataset is extremely large
- Memory usage is a concern
- Very fast prediction time is required
- A simple linear relationship already explains the data well

In these cases,
simpler or more specialized models may be preferable.


### Typical warning signs

You should be cautious if:

- Training time becomes very long
- The model uses excessive memory
- Performance gains over simpler models are marginal
- Model behavior is difficult to explain to stakeholders

These signals suggest that Random Forest
may not be the most efficient choice.


# ____________________________________
## 11. Model persistence (Random Forest Regression)

In this section we save the trained Random Forest model
and the preprocessing steps used during training.

Saving the model allows us to reuse it
without retraining and ensures reproducibility.


### Why saving the model is important

Training a Random Forest can be computationally expensive,
especially when many trees are used.

Once the model has been trained and evaluated,
it is common practice to save it and reuse it later:
- in another notebook
- in an application
- in a production environment


### Important rule: save the scaler together with the model

Even though Random Forest does not require feature scaling,
we still save the scaler.

This ensures that:
- the same preprocessing pipeline is applied
- models can be swapped without changing the input pipeline
- results remain consistent across experiments


In [18]:
# Define model directory
model_dir = Path("models/supervised_learning/regression/random_forest_regression")

# Create directory if it does not exist
model_dir.mkdir(parents=True, exist_ok=True)

# Save model and scaler
joblib.dump(rf_model, model_dir / "random_forest_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")


['models\\supervised_learning\\regression\\random_forest_regression\\scaler.joblib']

### Loading the model later (conceptual example)

To reuse the model:
- load the scaler
- scale new input data
- load the Random Forest model
- generate predictions

This guarantees consistency
with the original training pipeline.


# ____________________________________
## 12. Mathematical formulation (deep dive)

This section provides a deeper explanation of Random Forest Regression
from a mathematical and algorithmic perspective.

The goal is to understand the principles behind the model,
not to derive every formula in detail.


### From decision trees to Random Forest

Random Forest Regression is built on top of **decision trees**.

A single decision tree:
- recursively splits the feature space
- creates regions where predictions are constant
- predicts the average target value in each region

While individual trees are simple,
they tend to overfit the training data.


### Introducing randomness

Random Forest reduces overfitting
by introducing randomness in two main ways:

1. **Bootstrap sampling**  
   Each tree is trained on a random subset of the training data
   sampled with replacement.

2. **Feature subsampling**  
   At each split, only a random subset of features
   is considered.

These two sources of randomness
make individual trees less correlated.


### Ensemble prediction

Let each tree produce a prediction ŷᵢ(x)
for an input x.

The Random Forest prediction is computed as:

- the average of all tree predictions

This averaging process:
- reduces variance
- stabilizes predictions
- improves generalization


### Bias–variance perspective

Random Forest primarily addresses **variance**.

- Individual trees have low bias but high variance
- Averaging many trees keeps bias low
- Variance is reduced through aggregation

This explains why Random Forest often performs well
without heavy tuning.


### Why scaling does not affect the math

Decision trees split data based on feature thresholds.

Because these splits depend on order, not magnitude:
- feature scaling does not change split decisions
- the mathematical behavior of the model remains unchanged

This explains why Random Forest does not require scaling,
even though we include it for pipeline consistency.


### Final takeaway

Random Forest Regression replaces a single complex model
with an ensemble of simple ones.

By combining:
- randomness
- averaging
- independent decision trees

it achieves strong performance on complex, non-linear problems,
while remaining conceptually grounded and robust.
