# Random Forest

Random Forests are a collection of decision trees used to form an ensemble classifier. By averaging multiple decision trees built from different parts of the training set, Random Forests aim to reduce the overfitting that can occur with a single decision tree. While this increases performance, it also results in some loss of interpretability.

## Steps Involved in Random Forest

### Data Bagging

Random Forests use bagging (bootstrap aggregating) to construct multiple decision trees. Bagging involves sampling subsets of the training data with replacement and building models based on these sampled training sets. Typically, each subset includes about 2/3 of the unique samples from the original dataset.

### Feature Bagging

In addition to sampling data points, Random Forests also perform feature bagging. Instead of using all features for each decision tree, a random subset of features is used. 

- For classification problems with $n$ features, typically $\sqrt{n}$ features are used at each split.
- For regression problems, the recommended number of features is $n/3$.

This technique ensures that each decision tree is built on a different set of features, enhancing the diversity of the trees.

### Cross-Validation in Random Forests

Cross-validation in Random Forests is implicitly conducted through the bagging process. For each bagged subset, about 1/3 of the data is left out, known as out-of-bag (OOB) samples. These OOB samples are used for cross-validation to estimate the model's performance and to determine the optimal feature set combination.

The OOB error estimate is the mean prediction error on each training sample $j$, using only the trees that did not have sample $j$ in their bootstrap sample.

### Final Prediction

After constructing multiple decision trees, each trained on a different subset of the data and features, the final prediction for a new data point is made by aggregating the predictions from all the trees. For classification, the majority vote is taken, while for regression, the average value is used.

### Illustrations

<center><img src="fig/RandomForestAlgorithm.jpg"/></center>

## Mathematical Explanation

### Decision Trees

A decision tree is a flowchart-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents the outcome.

1. **Splitting Criterion**: Trees are grown by recursively splitting nodes into child nodes, using measures like Gini impurity or entropy for classification and mean squared error (MSE) for regression.
2. **Stopping Criteria**: Tree growth stops when a predefined criterion is met, such as maximum depth or minimum samples per leaf node.

### Bagging

Given a dataset $D$ with $N$ samples:
1. Create $m$ bootstrap samples $D_1, D_2, ..., D_m$ by sampling $N$ samples with replacement from $D$.
2. Train a decision tree $T_i$ on each $D_i$.

### Feature Bagging

At each split in the tree:
1. Randomly select $k$ features from the total $n$ features.
2. Choose the best feature from these $k$ features to split the node.

### Out-of-Bag Error

For each sample $i$ in the training set:
1. Predict the sample $i$ using all trees that do not contain $i$ in their respective bootstrap samples.
2. Aggregate these predictions to get the OOB error estimate.

### Calculating Proximities to Handle Missing Values

Proximity in the context of Random Forest refers to the similarity between pairs of samples based on their behavior within the decision trees of the forest. It is a valuable metric used for various purposes such as handling missing values, identifying outliers, and computing prototypes. Let's delve deeper into how proximity is calculated and its significance.

#### Definition of Proximity

Proximity of two samples is considered true if they follow the same decision path to reach the outcome in a tree. In other words, if two samples end up in the same terminal node after traversing through a decision tree, their proximity is increased.

#### Proximity Calculation

1. **Proximity Count**: It provides a measure of how frequently unique pairs of training samples end in the same terminal node. 
   
2. **Proximity Matrix**: For a dataset with $N$ input cases, the proximity matrix is of size $N \times N$. Each cell in the matrix represents the proximity value between two samples. Initially, this value is either 0 or 1 for a pair of input observations, indicating whether they end up in the same terminal node in a specific decision tree.

3. **Averaging Across Trees**: Proximity values are obtained by averaging over all trees in the Random Forest. This ensures that the final proximity reflects the collective behavior of the trees rather than being specific to a single tree.

4. **Normalization**: To make the proximity values comparable across different datasets, they are normalized by dividing with the number of trees in the forest.

#### Significance of Proximities

- **Distance Measure**: Proximity values can be used as a distance measure between pairs of inputs, akin to Mahalanobis or Euclidean distance. This measure helps in clustering similar samples together and identifying outliers.
  
- **Handling Missing Values**: Proximities are useful in computing missing data in the dataset. One method involves finding all the cases that end in the same class as the case with missing data and then computing a suitable approximation for the missing value based on these cases.

- **Outlier Detection**: By analyzing proximity scores, outliers can be identified. Samples with significantly different proximity scores compared to others may indicate potential outliers in the dataset.

- **Prototype Computation**: Proximities aid in computing prototypes by identifying samples that are representative of the overall dataset.

In essence, proximity serves as a versatile metric in Random Forests, facilitating various tasks ranging from handling missing data to outlier detection and prototype computation. Its calculation and utilization enhance the robustness and interpretability of the Random Forest model.

## Hyperparameters

Key hyperparameters in Random Forest include:

- **n_estimators**: Number of trees in the forest.
- **max_features**: Maximum number of features to consider at each split.
- **max_depth**: Maximum depth of each tree.
- **min_samples_split**: Minimum number of samples required to split an internal node.
- **min_samples_leaf**: Minimum number of samples required to be at a leaf node.
- **bootstrap**: Whether bootstrap samples are used when building trees.
- **oob_score**: Whether to use out-of-bag samples to estimate the generalization accuracy.

## Important Features of Random Forest

Random Forest is a powerful ensemble learning method that combines multiple decision trees to enhance the model's performance and robustness. Here are some of the most important features that make Random Forest a popular choice in machine learning:

#### 1. Diversity

Random Forests ensure diversity by not considering all attributes/variables/features while constructing an individual tree. Instead, each tree in the forest is built using a random subset of features. This diversity in feature selection results in each tree being different from the others. 

- **Feature Bagging**: By randomly selecting a subset of features for each tree, the algorithm avoids overfitting to any particular feature, ensuring that the trees are diverse and capture different aspects of the data.
- **Tree Variety**: This variability among trees helps in capturing a broader spectrum of patterns and relationships within the data, making the overall model more robust.

#### 2. Immune to the Curse of Dimensionality

The curse of dimensionality refers to the problems that arise when analyzing data with a large number of features. Random Forests are relatively immune to this issue because each tree is built using only a subset of features.

- **Reduced Feature Space**: By considering only a random subset of features at each split, Random Forests effectively reduce the feature space, making the learning process more efficient and reducing the risk of overfitting.
- **Scalability**: This feature allows Random Forests to scale well to datasets with a large number of features, maintaining performance without being overwhelmed by the dimensionality.

#### 3. Parallelization

One of the key advantages of Random Forests is that each tree is created independently from different data and attributes, allowing for parallel processing.

- **Independent Tree Construction**: Since the trees are built independently, the process can be parallelized, significantly speeding up the training process.
- **Full CPU Utilization**: Modern computing architectures with multiple cores can be fully utilized, making Random Forests efficient and faster to train compared to sequential algorithms.

#### 4. Train-Test Split

Random Forests have an inherent mechanism for evaluating model performance without the need for a separate train-test split.

- **Out-of-Bag (OOB) Samples**: During the construction of each tree, about 30% of the data is left out of the bootstrap sample (OOB samples). These OOB samples can be used to evaluate the model's performance, providing an internal estimate of test error.
- **Automatic Validation**: This feature simplifies the workflow as it eliminates the need for a separate validation set, ensuring that all available data is used efficiently for both training and validation.

#### 5. Stability

Stability in Random Forests arises from the aggregation of predictions from multiple trees.

- **Majority Voting/Averaging**: For classification problems, the final prediction is based on the majority vote from all the trees, while for regression problems, it is based on the average prediction. This aggregation reduces the variance of the model, leading to more stable and reliable predictions.
- **Reduced Sensitivity**: The model is less sensitive to the variability in the data because it relies on the collective decision of multiple trees rather than a single tree, making it more robust to outliers and noise.

## Conclusion

Random Forests are powerful ensemble classifiers that combine multiple decision trees to improve performance and reduce overfitting. They leverage techniques like bagging and feature bagging to ensure diversity among the trees. Hyperparameter tuning and feature importance analysis are crucial for optimizing the model and interpreting its results.