# Supervised Learning â†’ Random Forest Regression

As with other regression models in this project,
the first sections focus on data preparation
and are intentionally repeated.

This ensures that each notebook can be read
and used independently, without external references.

1. Project setup and common pipeline
2. Dataset loading
3. Train-test split
4. Feature scaling (why we do it)

----------------------------------

5. What is this model? (Intuition)
6. Model training
7. Model behavior and key hyperparameters
8. Predictions
9. Model evaluation
10. When to use it and when not to
11. Model persistence
12. Mathematical formulation (deep dive)
-----------------------------------------------------

## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand what is happening at each step
and be able to adapt it to your own data.

-----------------------------------------------------

## What is Random Forest Regression?

Random Forest Regression is a powerful and flexible model
that works very differently from both Linear Regression and KNN.

Instead of learning a single global rule
or relying on nearby data points,
Random Forest builds **many decision trees**
and combines their predictions.

Each tree:
- looks at the data in a slightly different way
- makes its own prediction

The final prediction is obtained by:
- averaging the predictions of all trees

By combining many simple models,
Random Forest is able to capture complex and non-linear patterns
that simpler models cannot.

-----------------------------------------------------

## Why we start with intuition

Random Forest can look complex at first,
but the core idea is simple.

Instead of trusting a single model,
Random Forest asks the same question many times,
each time with a slightly different perspective.

Each tree may make mistakes,
but when many trees agree,
the final prediction becomes more reliable.

Understanding this idea of **many weak models working together**
is key to understanding how Random Forest works.

-----------------------------------------------------

## What you should expect from the results

Before using Random Forest Regression,
it is important to set expectations.

With Random Forest Regression, you should expect:
- strong performance on complex and non-linear data
- robust predictions even in the presence of noise
- less sensitivity to individual outliers

Compared to simpler models:
- Random Forest usually outperforms Linear Regression
- it is more stable than KNN on large datasets
- it requires less manual feature engineering

However, this power comes at a cost:
- the model is harder to interpret
- training and prediction are more computationally expensive


# ____________________________________
## 1. Project setup and common pipeline

In this section we set up the common pipeline
used across all regression models in this project.

This part of the notebook does not depend on the model itself
and is intentionally kept consistent to:
- ensure fair comparison between models
- reduce implementation errors
- focus on understanding model behavior


In [3]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)

import joblib


### Note on feature scaling for Random Forest

Random Forest does **not rely on distances**
or linear combinations of features.

For this reason, feature scaling is **not required**
for the model to work correctly.

However, we keep feature scaling in the pipeline to:
- maintain consistency across all regression notebooks
- simplify comparisons between models
- avoid changing preprocessing steps when switching models



# ____________________________________
## 2. Dataset loading

In this section we load the dataset that will be used
to train and evaluate the Random Forest Regression model.

We use the same dataset as in the other regression notebooks
to allow direct comparison between different models.


In [4]:
# Load the dataset

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


### Inputs and target

- `X` contains the input features
- `y` contains the continuous target variable

At this stage:
- no modeling is performed
- we are only defining what the model will learn from


# ____________________________________
## 3. Train-test split

In this section we split the dataset into
training and test sets.

This allows us to evaluate how well the model
generalizes to unseen data.


In [5]:
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


### Why this step is important

A model should not be evaluated on the same data
it was trained on.

By separating the data:
- the training set is used to build the model
- the test set is used only for evaluation

This ensures that performance metrics
reflect real-world behavior.


### Consistency across models

We use the same split configuration
as in the other regression notebooks.

This guarantees that performance differences
are due to the model itself
and not to different data splits.


# ____________________________________
## 4. Feature scaling (pipeline consistency)

In this section we apply feature scaling
to the input features.

Although Random Forest does not rely on distances
or linear combinations of features,
we keep feature scaling in the pipeline
to maintain consistency across models.


### Is feature scaling required for Random Forest?

Strictly speaking:
- Random Forest **does not require** feature scaling

Decision trees:
- split features based on thresholds
- are invariant to feature scale

However, scaling is still applied here
for consistency and comparability.



### Why we keep the same pipeline across models

In practice, once a problem is identified as a regression task,
it is common to try multiple models
and compare their performance on the same data.

For this reason:
- the dataset remains identical
- the train-test split remains identical
- the preprocessing pipeline remains identical

Keeping the same pipeline allows us to:
- compare models fairly
- attribute performance differences to the model itself
- switch models without changing data preparation steps

Even if a model does not strictly require scaling,
including it in the pipeline ensures consistency
and simplifies model comparison.


### Why we still apply scaling

We apply feature scaling to:
- keep the preprocessing pipeline identical
- simplify switching between models
- avoid conditional logic in the code

This makes the project easier to maintain
and easier to extend.


### Important rule: fit only on training data

As with other preprocessing steps:
- the scaler is fitted on training data only
- the same scaler is applied to test data

This prevents data leakage
and ensures a fair evaluation.


In [6]:
# Feature scaling (kept for pipeline consistency)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### What we have after this step

- scaled training data
- scaled test data
- a complete and consistent preprocessing pipeline

At this point, the data is ready
to be used by the Random Forest model.


# ____________________________________
## 5. What is this model? (Random Forest Regression)

Before training the model, it is important to understand
what Random Forest Regression is trying to do conceptually.

Random Forest Regression is an **ensemble model**,
meaning it combines the predictions of many simpler models
to produce a final result.
