# Supervised Learning → Random Forest (Classification)

This notebook is part of the **ML-Methods** project.

As with the other classification notebooks,
the first sections focus on data preparation
and are intentionally repeated.

This ensures consistency across models
and allows fair comparison of results.

1. Project setup and common pipeline
2. Dataset loading
3. Train-test split
4. Feature scaling (why we do it)

----------------------------------

5. What is this model? (Intuition)
6. Model training
7. Model behavior and key parameters
8. Predictions
9. Model evaluation
10. When to use it and when not to
11. Model persistence
12. Mathematical formulation (deep dive)
13. Final summary – Code only

-----------------------------------------------------

## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand what is happening at each step
and be able to adapt it to your own data.

-----------------------------------------------------

## What is Random Forest Classification?

Random Forest Classification is an **ensemble classification model**
that combines the predictions of many decision trees.

Instead of relying on a single classifier,
Random Forest aggregates the decisions
of multiple independent models
to produce a final prediction.

-----------------------------------------------------

## Why we start with intuition

Random Forest may look complex,
but its core idea is simple.

Each decision tree makes a prediction.
Some trees make mistakes,
but many trees together tend to agree
on the correct class.

Understanding this idea of
**many weak classifiers working together**
is key to understanding Random Forest.

-----------------------------------------------------

## What you should expect from the results

With Random Forest Classification, you should expect:
- strong performance on non-linear problems
- robustness to noise and outliers
- stable predictions without heavy tuning

However:
- interpretability is lower than linear models
- training and memory usage are higher



_________________________________________________

## 1. Project setup and common pipeline

In this section we set up the common pipeline
used across classification models in this project.

Although Random Forest does not require feature scaling,
we keep scaling for pipeline consistency
and fair model comparison.

In [1]:
# Common imports used across all classification models

import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    ConfusionMatrixDisplay
)

from pathlib import Path
import joblib
import matplotlib.pyplot as plt
