# Random Forest
* Excellent when:
	* You need high accuracy without heavy tuning
	* Data is tabular, mixed-type, nonlinear
	* You want a strong baseline before trying boosting models

* Use cases:

	* Financial risk modeling and credit scoring
	* Fraud detection / AML / anomaly detection
	* E-commerce: CTR prediction, ranking shortlist scoring
	* Booking.com: semantic match scoring, candidate generation models
	* Healthcare: interpretable yet powerful predictors
	* Insurance pricing and underwriting models


## 1. Definition
Random Forest is a supervised ensemble learning algorithm that builds a "forest" of many uncorrelated Decision Tress during training and then merges their outputs  to improve prediction accuracy and stability. It solves the problem of Overfitting and High Vairance in single Decision Trees by injecting randomness into both data samples and feature selection at each split

## 2. Core Idea
e need Random Forest because single decision trees are "high variance" models. If you change the training data slightly, a single decision tree might change its structure completely. It memorizes noise. However, by training many trees on slightly different data and average them, the random errors cancel out, leaving the true signal.

## 3. Mechanism
The workflow relies on two specific randomization techniques: Bagging and Feature Randomness.

* **Bootstrapping (Bagging)**: For each tree, we create a new training set bhy sampling N examples from the original tree 'with replacement'. (some data points appear multiple times)

* **Feature Selection**: When splitting a node, the tree is forced to choose the best split from a random subset of feature (e.g. $\sqrt{\text{total feature}}$), not all features. This ensures trees are diverse / decorrelated.

* **Aggregation**:
    * Classification: Majority Vote (hard voting).
    * Regression: Averaging all tree outputs

**Soft voting** is calculated by taking the average of the probabilities predicted by each individual model for every class: $P(class) = \frac{\text{count of class in leaf}}{\text{total sample in leaf}}$



## 4. Mathematical Details / Training
The math focuses on Variance Reduction.
1. Out-of-bag (OOB) Error: estimates generalization withou a validation set. Since ~1/3 of data is left out of each bootstrap sample, these 'unseen' points are used to calculate error during training, acting as a built in validation set.
2. Feature Subspace:
    * Classification: $m = \sqrt{p}$ features
    * Regression: $m = \frac{p}{3}$ features

3. Algorithmic Principle: If the correlation between trees is low, the variance of the ensemble decreases as the number of trees increases.

	* Lower correlation â†’ better variance reduction
	* Trees explore different parts of the feature space


## 5. Pros and Cons
* Pros
    * Robust to outliners, noise and missing values
    * Much less overfitting than a single Decision Tree
    * Handles nonlinearities and interactions automatically
    * No scaling needed: It can handle unscaled data and categroical feature well.
    * Feature importance: Automatically provides a feature importance score (measuing how much a feature reduces impurity on average across all trees).
* Cons:
    * Less interpretable: You lose the clear interpretability of a single tree.
    * Slow inference: To get a prediction, the computer must run the input through 100 + tree
    * Model size: storing 500 deep trees can consume significant RAM space.
    * Extrapolation: Random Forest cannot predict values outside the range of the training data. It is a "lookup" algorithm, not a trend-following one.
    * Not ideal for very large datasets.
    * Sparse data: It performs poorly on very high-dimensional, sparse data (like text / TFIDF) compared to linear models.

## 6. Production Consideration
* Parallel Training
* Latency vs. Accuracy: In real-time apps, you might reduce the number of trees or limit max_depth to trade a small amount of accuracy ofr much faster inference.
* Deployment: Models can be compiled into simpler languages (C++, Java) or converted to formats like ONNX to speed up the if-else logic traversals in production.


## 7. Other variants?
* Isolation Forest (iForest): This is an Unsupervised variant used for Anomaly Detection, not classification. It assumes that anomalies are few and different. Therefore, they are easier to isolate than normal points. 
    * If a data point ends up in a leaf node very quickly (short path length), it is likely an anomaly.
    * If a data point requires many splits to isolate (deep path length), it is likely normal data.
* Balanced Random Forest: This handles Imbalanced Data. Standard Random Forest is biased toward the majority class. If you bootstrap normally, some trees might not even see a fraud case.
    * It down-samples the majority class for every tree.

In [13]:
import sys, os

root = os.path.abspath("..")
sys.path.append(root)

from src.random_forest import RandomForest
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [3]:
data = datasets.load_breast_cancer()
X, y = data.data, data.target

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

In [6]:
clf = RandomForest(n_trees=5, max_depth=10, n_features=int(X.shape[1]))
clf.fit(X_train,y_train)

In [15]:
y_pred = clf.predict(X_test)

# Calculate Accuracy
acc = np.sum(y_pred == y_test) / len(y_test)
print(f"Random Forest Accuracy: {acc:.4f}")

Random Forest Accuracy: 0.9474


## Case 1: Financial Risk Modeling

## Case 2: Insurance Pricing and Underwriting