# Random Forests

Random forests offer an enhancement to bagged trees through a simple adjustment that introduces tree decorrelation. Like in bagging, we construct multiple decision trees on bootstrapped training samples. However, when constructing these trees, at each split point, only a random subset of m predictors is considered as potential candidates for the split, drawn from the full set of p predictors [James et al., 2023].

The math behind Random Forests involves a few key concepts that contribute to its effectiveness in enhancing decision trees. Let's break down the main components [Hastie et al., 2013, James et al., 2023]:

1. **Bootstrap Sampling:** In Random Forests, multiple decision trees are created, each based on a different subset of the training data. These subsets are obtained through a process called bootstrap sampling. Given a dataset with 'n' observations, bootstrap sampling involves randomly selecting 'n' observations with replacement. This means that some observations may be included multiple times in the subset, while others may not be included at all. This process generates diverse training subsets for building different trees.

2. **Random Feature Selection:** At each split point of a decision tree within a Random Forest, instead of considering all available features (predictors), a random subset of features is selected as candidates for the split. This introduces randomness and diversity among the trees. The number of features in the subset, denoted as 'm', is typically smaller than the total number of predictors 'p'. This process helps decorrelate the trees, reducing the chance of them making similar errors and leading to more accurate predictions.

3. **Voting or Averaging:** Once all the trees are constructed, their predictions are combined to make a final prediction. For regression tasks, the predictions from individual trees are usually averaged to obtain the final prediction. For classification tasks, a majority vote among the predictions is often taken to determine the class label. This ensemble approach helps improve the overall accuracy and stability of the model.

Mathematically, the process of Random Forests involves creating 'B' decision trees, each constructed using a different bootstrap sample and a random subset of 'm' features at each split point. The final prediction for a new observation is obtained by averaging (for regression) or majority voting (for classification) the predictions from all the trees:

For regression:
\begin{equation}
\hat{f}_{rf}(x) = \frac{1}{B}\sum_{b = 1}^{B} \hat{f}^{b}(x)
\end{equation}

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/RFR_Fig.jpg" alt="picture" width="700">
<br>
<b>Figure</b>: A visual of Random Forests Algorithm for regression.
</center>

For classification:
\begin{equation}
\hat{C}_{rf}(x) = \text{majority vote}\left(\hat{C}^{1}(x), \hat{C}^{2}(x), \ldots, \hat{C}^{B}(x)\right)
\end{equation}

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/RFC_Fig.jpg" alt="picture" width="700">
<br>
<b>Figure</b>: A visual of Random Forests Algorithm for classification.
</center>

Here, $\hat{f}^{b}(x)$ represents the prediction of the 'b'-th tree for observation 'x', and $\hat{C}^{b}(x)$ represents the class predicted by the 'b'-th tree for observation 'x'.

The random forest algorithm's combination of bootstrap sampling and random feature selection helps create a diverse ensemble of trees that work together to provide more accurate and stable predictions, reducing the likelihood of overfitting and improving the model's generalization ability.



---

**Random Forest algorithm**

1. **Data Preparation:**
   - $N$ = Number of samples in the dataset
   - $M$ = Number of features in each sample
   - $x_i$ = Input features for the $i$th sample
   - $y_i$ = Output label for regression task (real value)
   - $C_i$ = Output class for classification task (categorical value)

2. **Bootstrapping:**
   - Randomly select $N$ samples with replacement to create multiple bags (bootstrap samples).
   - In scikit-learn's API, the parameter controlling this is `bootstrap=True`.

3. **Growing Individual Trees:**
   - Each individual tree $t$ is trained on one of the bootstrap samples.
   - At each node of tree $t$, consider a random subset of features of size $m$ for splitting.
   - Stop growing the tree based on stopping criteria like maximum depth or minimum samples per leaf.
   - In scikit-learn, you can control maximum depth with `max_depth` and minimum samples per leaf with `min_samples_leaf`.

4. **Voting or Averaging:**
   - For classification: Let $k$ be the number of classes. Each tree predicts a class $C_i$ for input $x_i$. Final prediction is the majority class among all trees' predictions.
   - For regression: Each tree predicts a value $y_i$ for input $x_i$. Final prediction is the average of all trees' predictions.
   - In scikit-learn, you can set `n_estimators` to determine the number of trees.

5. **Out-of-Bag (OOB) Error:**
   - For each sample $i$, if it's not in the training set of tree $t$, we can use its prediction to calculate the OOB error.
   - In scikit-learn, OOB error can be calculated by setting `oob_score=True`.

6. **Randomness and Diversity:**
   - For feature subset selection, $m$ is typically set to $\sqrt{M}$ for classification and $\frac{M}{3}$ for regression.
   - This randomness encourages different trees to focus on different subsets of features, leading to diversity.

7. **Hyperparameters:**
   - `n_estimators`: Number of trees in the forest.
   - `max_depth`: Maximum depth of each tree.
   - `min_samples_split`: Minimum number of samples required to split an internal node.
   - `min_samples_leaf`: Minimum number of samples required to be at a leaf node.
   - `max_features`: Number of features to consider for the best split at each node.
   - `bootstrap`: Whether bootstrap samples should be used.
   - `oob_score`: Whether to calculate out-of-bag score.

---

The Random Forests algorithm combines the predictions from multiple decision trees, each constructed on a different bootstrap sample and a subset of features. The diversity introduced by these mechanisms helps to reduce overfitting and improve the generalization performance of the ensemble model. Additionally, Random Forests provide insights into feature importance, which can be used for feature selection and understanding the underlying relationships in the data [Breiman, 2001, James et al., 2023].

Keep in mind that the algorithm can be further customized and optimized with various hyperparameters and techniques, such as adjusting the number of trees, tuning the size of the feature subset, and handling missing values and categorical variables. Implementation details may vary depending on the programming language or library you're using.

Here's how these concepts relate to scikit-learn's API:

```python
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Creating a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2,
                             min_samples_leaf=1, max_features="auto", bootstrap=True,
                             oob_score=True)

# Creating a Random Forest Regressor
reg = RandomForestRegressor(n_estimators=100, max_depth=None, min_samples_split=2,
                            min_samples_leaf=1, max_features="auto", bootstrap=True,
                            oob_score=True)
```

You can replace the hyperparameter values above with your desired settings. The scikit-learn API makes it convenient to configure the Random Forest algorithm for your specific task and data.

##  Example: Auto MPG dataset

<font color='Blue'><b>Example</b></font>. Consider the Auto MPG dataset retrieved from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/dataset/9/auto+mpg).

In [None]:
# Download the zip file using wget
!wget -N "http://archive.ics.uci.edu/static/public/9/auto+mpg.zip"

# Unzip the downloaded zip file
!unzip -o auto+mpg.zip auto-mpg.data

# Remove the downloaded zip file after extraction
!rm -r auto+mpg.zip

In [None]:
import pandas as pd
# You can download the dataset from: http://archive.ics.uci.edu/static/public/9/auto+mpg.zip

# Define column names based on the dataset's description
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model_Year', 'Origin', 'Car_Name']

# Read the dataset with column names, treating '?' as missing values, and remove rows with missing values
auto_mpg_df = pd.read_csv('auto-mpg.data', names=column_names,
                          na_values='?', delim_whitespace=True).dropna().reset_index(drop = True)

# Remove the 'Car_Name' column from the DataFrame
auto_mpg_df = auto_mpg_df.drop(columns=['Car_Name'])

# Display the resulting DataFrame
display(auto_mpg_df)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score

# Extract the features (X) and target variable (y)
X = auto_mpg_df.drop(columns=['MPG'])
y = np.log(auto_mpg_df.MPG.values)  # Take the natural logarithm of the MPG values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
set_size_df = pd.DataFrame({'Size': [len(X_train), len(X_test)]}, index = ['Train', 'Test'])
display(set_size_df.T)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import matplotlib.pyplot as plt
plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Files/mystyle.mplstyle')

# Create a figure and subplots
fig, ax = plt.subplots(1, 2, figsize=(9.5, 4.5))

# Loop through different feature sets
for i, n_estimators in enumerate([3, 10]):
    # Create a Random Forest Regressor
    reg = RandomForestRegressor(n_estimators = n_estimators, random_state= 0)
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    
    # Create scatter plot and a diagonal reference line
    scatter = ax[i].scatter(y_pred, y_test, label='medv', facecolors='SkyBlue', edgecolors='MidnightBlue', alpha=0.8)
    line = ax[i].plot([0, 1], [0, 1], '--k', lw = 2, transform=ax[i].transAxes)
    
    _title = f'n_estimators = {n_estimators}'
    # Set title and labels for the current subplot
    title = ax[i].set(title = _title, xlabel='Predicted Values', ylabel='Actual Values')
    
    # Calculate and display Mean Squared Error (MSE) with background color
    mse = metrics.mean_squared_error(y_test, y_pred)
    text = ax[i].text(0.68, 0.05, f'MSE = {mse:.3f}',
                      transform=ax[i].transAxes, fontsize=12, weight='bold',
                      bbox=dict(facecolor='Whitesmoke', alpha=0.7))  # Add background color

    # Set equal aspect ratio for the subplots
    ax[i].axis('equal')

# Adjust layout and display the plots
plt.tight_layout()

## Feature Importances

In the context of a random forest model, the `feature_importances_` attribute serves as an essential metric for gauging the significance of individual features in facilitating accurate predictions. This attribute offers valuable insights into the influential role that each feature plays in shaping the model's predictions [Pedregosa et al., 2011, scikit-learn Developers, 2023].

### Calculation of `feature_importances_`:

The determination of feature importance in a random forest involves assessing how much each feature contributes to the reduction of impurity, commonly measured using metrics such as Gini impurity or Mean Squared Error, within the individual decision trees constituting the forest. The process unfolds as follows [Pedregosa et al., 2011, scikit-learn Developers, 2023]:

1. **Tree Level Calculation:** Within each decision tree of the random forest, candidate features for splitting are identified based on the impurity reduction each feature would bring if chosen as the split feature. Metrics like Gini impurity or Mean Squared Error are frequently employed for this purpose.

2. **Feature Contribution:** For each candidate feature in each tree, the algorithm quantifies how much the feature diminishes impurity in the data. Greater reduction implies a higher importance for that specific tree.

3. **Averaging Across Trees:** After constructing all individual trees, the importance of each feature is averaged across the entire forest. This results in an importance score for each feature, indicating its collective contribution to the model's predictions.

4. **Normalization:** Importance scores are typically normalized to sum up to 1 or 100. This normalization aids in interpreting the relative importance of each feature.

5. **Interpretation:** A higher importance score denotes that a feature exerts a more substantial influence on the model's predictions. Conversely, features with lower importance scores contribute less to the model's predictive capabilities.

### Interpretation of `feature_importances_`:

The values within the `feature_importances_` array sum up to 0 or 1, contingent on normalization. These values are relative and offer insights into which features wield a more pronounced impact on the model's predictions. Higher importance values signify a more significant contribution to the model's ability to make accurate predictions.

By scrutinizing `feature_importances_`, one can pinpoint key features steering the model's performance, concentrate on pertinent variables, and potentially engage in feature selection to enhance model efficiency.

In essence, `feature_importances_` in a random forest model quantifies the contribution of each feature to the reduction of impurity across individual trees, providing a valuable tool for comprehending feature relevance and model behavior.

In [None]:
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import matplotlib.pyplot as plt

# Instantiate a RandomForestRegressor with 100 estimators and a random state of 0
reg = RandomForestRegressor(n_estimators=100, random_state=0)

# Fit the RandomForestRegressor model on the training data
reg.fit(X_train, y_train)

# Extract feature importances
feature_importances = reg.feature_importances_

# Create a DataFrame to store feature importances
Importance = pd.DataFrame({'Importance': reg.feature_importances_ * 100}, index=X.columns)

# Apply a background gradient to the DataFrame and round importance values to 3 decimal places
styled_importance = Importance.style.background_gradient(cmap='PuBu', subset=['Importance']).format({'Importance': '{:.3f}'})

# Display the styled DataFrame
display(styled_importance)
print('\n')

# Create a bar plot to visualize feature importances
fig, ax = plt.subplots(1, 1, figsize=(6, 4))
bars = ax.bar(Importance.index, Importance.Importance, color='#e7d2f3', edgecolor='#611589', hatch="///", lw=2)

# Set plot labels and title
ax.set_xlabel('Variable Importance', fontsize=12, weight='bold', color='midnightblue')
ax.set_ylabel('Importance', fontsize=12, weight='bold', color='midnightblue')
ax.set_title('Feature Importance in Random Forest Model', fontsize=16, weight='bold', color='darkslategray')

# Set y-axis limits and adjust tick parameters
ax.set_ylim([0, 40])
ax.tick_params(axis='x', rotation=45, labelsize=12, color='dimgray')
ax.tick_params(axis='y', labelsize=12, color='dimgray')

# Customize plot aesthetics
ax.spines[['top', 'right']].set_visible(False)
ax.spines[['bottom', 'left']].set_color('dimgray')
ax.grid(axis='x')

# Ensure a tight layout for better visualization
plt.tight_layout()

<font color='Blue'><b>Example</b></font>:  In this code example, a Decision Tree Classifier is utilized to illustrate decision boundaries on synthetic data.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import DecisionBoundaryDisplay
try:
  import sklearnex
except ImportError:
  !pip install pip install scikit-learn-intelex
  import sklearnex
from IPython.display import clear_output
clear_output()
sklearnex.patch_sklearn()

# Define colors and colormap for the plot
colors = ["#f44336", "#2986cc", "#065535", '#ffe599']

# Define a list of color names for the colormap
_cmap = ListedColormap(colors)

# Generate synthetic data using make_blobs
n_samples = 1000
n_features = 2
centers = 4
cluster_std = 1.0
X, y = make_blobs(n_samples=n_samples, n_features=n_features, centers=centers, random_state=0, cluster_std=cluster_std)

# Plot decision boundaries
fig, axes = plt.subplots(2, 2, figsize=(9.5, 9))
axes = axes.ravel()

for ax, m, alph in zip(axes, [5, 25, 100, None], 'abcd'):
    # Create a RandomForestClassifier with specified max_depth
    rfc = RandomForestClassifier(max_depth=m)
    
    # Fit the classifier to the data
    rfc.fit(X, y)
    
    # Plot the decision boundary using DecisionBoundaryDisplay
    DecisionBoundaryDisplay.from_estimator(rfc, X, cmap=_cmap, ax=ax,
                                           response_method="predict",
                                           plot_method="pcolormesh",
                                           xlabel='Feature 1', ylabel='Feature 2',
                                           shading="auto",
                                           alpha=0.3,)
    
    # Scatter plot for data points
    for num in np.unique(y):
        ax.scatter(X[:, 0][y == num], X[:, 1][y == num], c=colors[num],
                   s=40, edgecolors='k', marker='o', label=str(num))
    
    # Set title and remove grid lines
    ax.set_title(f'({alph}) max_depth = {m}', weight='bold')
    
    # Setaxis limits, and turn off grid
#     _ = ax.set(xlim=[-6, 6], ylim=[-3, 12])
    _ = ax.grid(False)

# Adjust layout for better presentation
plt.tight_layout()
sklearnex.unpatch_sklearn()


---

<font color='Red'><b>Note:</b></font>


"Scikit-learn extensions" or "sklearnex" refers to additional modules or libraries that build upon the scikit-learn library, which is a machine learning library for Python. These extensions typically provide extra functionality, new algorithms, or improved features to enhance the capabilities of scikit-learn in various ways. The term "sklearnex" may encompass a range of third-party contributions aimed at extending and complementing the existing scikit-learn ecosystem.

The function `sklearnex.patch_sklearn()` is a method within the scikit-learn extensions framework. Its primary purpose is to patch or modify the behavior of the scikit-learn library by incorporating additional functionalities or improvements provided by sklearnex.

This function is typically employed to seamlessly integrate the extensions into the scikit-learn library, ensuring that the enhanced features or modifications become part of the standard scikit-learn functionality. By invoking `sklearnex.patch_sklearn()`, users can apply the necessary adjustments to the scikit-learn library, enabling the utilization of extended capabilities offered by the sklearnex framework within their machine learning workflows. Additional information can be found by referring to the documentation available [here](https://github.com/intel/scikit-learn-intelex).

---

The resulting visualization is a grid of subplots, each depicting a different scenario based on the chosen maximum depth value. The arrangement of these subplots allows for a clear comparison of how the complexity of decision boundaries changes with the depth of the decision trees. Through this example, one gains insight into the flexibility and versatility of Random Forests in handling complex classification tasks and capturing intricate decision boundaries.

In [None]:
import numpy as np
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from pprint import pprint
try:
  import sklearnex
except ImportError:
  !pip install pip install scikit-learn-intelex
  import sklearnex
sklearnex.patch_sklearn()

# Set a random seed for reproducibility
rng = np.random.RandomState(0)

# Create a RandomForestClassifier with default parameters
rfc = RandomForestClassifier(random_state=rng)

# Define the hyperparameter search space using param_dist
param_dist = {'n_estimators': [10, 20, 25, 50],
              "max_depth": [3, 5, 8, None],
              "max_features": [2, 5, 7, 11],
              "min_samples_split": [2, 5, 7, 11],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy", "log_loss"]
              }

# Initialize HalvingRandomSearchCV with the estimator and parameter distributions
rsh = HalvingRandomSearchCV(estimator=rfc,
                            param_distributions=param_dist,
                            factor=2, random_state=rng)

# Fit the search object to your data
_ = rsh.fit(X, y)

# Get the best hyperparameters found by the search
best_params_ = rsh.best_params_
pprint(best_params_)
sklearnex.unpatch_sklearn()

The core of this example is the utilization of the `HalvingRandomSearchCV` technique, which efficiently narrows down the hyperparameter search space. The technique gradually discards suboptimal combinations, ultimately converging on the best configuration. By fitting the search object to a dataset (`X` and `y`), the code extracts and prints the best hyperparameters found by the search process. This example provides a valuable insight into how `HalvingRandomSearchCV` can significantly speed up the search process while identifying hyperparameters that lead to improved model performance. It's a demonstration of harnessing cutting-edge techniques to fine-tune machine learning models effectively.

In [None]:
import numpy as np
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
try:
  import sklearnex
except ImportError:
  !pip install pip install scikit-learn-intelex
  import sklearnex
sklearnex.patch_sklearn()

def print_bold(txt, c = 31):
    print(f"\033[1;{c}m" + txt + "\033[0m")

def _Line(n = 80):
    print(n * '_')

# Create a RandomForestClassifier instance
rfc = RandomForestClassifier(random_state=rng, **best_params_)

# Initialize StratifiedKFold cross-validator
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
# The splitt would be 80-20!

# Lists to store train and test scores for each fold
train_acc_scores, test_acc_scores, train_f1_scores, test_f1_scores = [], [], [], []
train_class_proportions, test_class_proportions = [], []

# Perform Cross-Validation
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    rfc.fit(X_train, y_train)

    # Calculate class proportions for train and test sets
    train_class_proportions.append([np.mean(y_train == rfc) for rfc in np.unique(y)])
    test_class_proportions.append([np.mean(y_test == rfc) for rfc in np.unique(y)])

    # train
    y_train_pred = rfc.predict(X_train)
    train_acc_scores.append(metrics.accuracy_score(y_train, y_train_pred))
    train_f1_scores.append(metrics.f1_score(y_train, y_train_pred, average = 'weighted'))

    # test
    y_test_pred = rfc.predict(X_test)
    test_acc_scores.append(metrics.accuracy_score(y_test, y_test_pred))
    test_f1_scores.append(metrics.f1_score(y_test, y_test_pred, average = 'weighted'))

_Line()
#  Print the Train and Test Scores for each fold
for fold in range(n_splits):
    print_bold(f'Fold {fold + 1}:')
    print(f"\tTrain Class Proportions: {train_class_proportions[fold]}*{len(y_train)}")
    print(f"\tTest Class Proportions: {test_class_proportions[fold]}*{len(y_test)}")
    print(f"\tTrain Accuracy Score = {train_acc_scores[fold]:.4f}, Test Accuracy Score = {test_acc_scores[fold]:.4f}")
    print(f"\tTrain F1 Score (weighted) = {train_f1_scores[fold]:.4f}, Test F1 Score (weighted)= {test_f1_scores[fold]:.4f}")

_Line()
print_bold('Accuracy Score:')
print(f"\tMean Train Accuracy Score: {np.mean(train_acc_scores):.4f} ± {np.std(train_acc_scores):.4f}")
print(f"\tMean Test Accuracy Score: {np.mean(test_acc_scores):.4f} ± {np.std(test_acc_scores):.4f}")
print_bold('F1 Score:')
print(f"\tMean F1 Accuracy Score (weighted): {np.mean(train_f1_scores):.4f} ± {np.std(train_f1_scores):.4f}")
print(f"\tMean F1 Accuracy Score (weighted): {np.mean(test_f1_scores):.4f} ± {np.std(test_f1_scores):.4f}")
_Line()
sklearnex.unpatch_sklearn()