# Random Forest and Extra Trees as examples of the ensemble methods

## Introduction 

Both algorithms are so called ensemble methods. These are techniques that combine the predictions from multiple machine learning models to produce a single, more accurate result. The idea is that a group of models (ensemble) working together will outperform any individual model.

Both Random Forests and Extra Trees are ensemble methods that use multiple **decision trees** as their base models. In other words, decision trees are the building blocks of these algorithms.

How do they work?
- Instead of building one decision tree, these algorithms build a **forest** of trees and aggregate the results.
- Making Predictions:
  - **Classification Tasks:** Each tree in the ensemble predicts a class label. The final prediction is made by **majority voting**—the class that gets the most votes from all the trees is chosen.
  - **Regression Tasks:** Each tree predicts a numerical value. The final prediction is the **average** of all the tree predictions.
- Finally, the predictions from all the trees are **aggregated** to make a final prediction. This process helps to reduce overfitting and improves generalization to unseen data.

For better understanding of the algorithms, I would recommend watching the following videos from Normalized Nerd YT channel:
- [Decision Trees](https://youtu.be/ZVR2Way4nwQ)
- [Classification Random Forest](https://youtu.be/v6VJ2RO66Ag)
- [Regression Random Forest](https://youtu.be/UhY5vPfQIrA)

Here are some key screenshots from the videos:


![Decision Tree Classification](../images/decision-tree-classification.png)

![Decision Tree Split Selection](../images/decision-tree-classification-split-selection.png)

![Decision Tree Split Selection](../images/decision-tree-regression-split-selection.png)

![Decision Tree Regression](../images/decision-tree-regression-aggregation.png)

![Random Forest Classification](../images/random-forest-classification.png)

Here are some key differences between Random Forest and Extra Trees:

## Random Forests:

**Data Sampling:** Random Forests use a technique called *bootstrap aggregation* or *bagging*. This means that each tree in the forest is trained on a random subset of the original data, created by sampling with replacement. So, some data points might appear multiple times in a subset, and some might not appear at all.

**Feature Selection and Splitting:**
- At each node (decision point) in a tree, a random subset of features is selected.
- The algorithm then looks for the best possible split among these features by evaluating all possible thresholds (e.g., for "age," it might consider "age > 30," "age > 35," etc.).
- The split that best separates the data based on a criterion (like Gini impurity or information gain) is chosen.

## Extremely Randomized Trees (Extra Trees):

**Data Sampling:** Extra Trees use the *entire* original dataset to train each tree, without any bootstrapping. So, every tree sees all the data points.

**Feature Selection and Splitting:**
- At each node, a random subset of features is still selected, just like in Random Forests.
- **However, the key difference is in how splits are decided:**
  - For each of these randomly chosen features, a split point (threshold) is selected **randomly**, not based on the best possible split.
  - Among these randomly generated splits, the one that provides the best separation (according to the same criteria used in Random Forests) is chosen.


In [None]:
# imports
import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import label_binarize
from sklearn import tree
from imblearn.over_sampling import SMOTE

## Hands-on example using the Wine Quality dataset

In this notebook, I will show you how to use Random Forest and Extra Trees for classification tasks using the `scikit-learn` library. We will use the [Wine Quality dataset](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset) for this purpose.

In [None]:
# Download the latest version of the dataset
path = kagglehub.dataset_download("yasserh/wine-quality-dataset")

print("Path to dataset files:", path)

### Information about the dataset from Kaggle:

**Description:**

This datasets is related to red variants of the Portuguese "Vinho Verde" wine.The dataset describes the amount of various chemicals present in wine and their effect on it's quality. The datasets can be viewed as classification or regression tasks. The classes are ordered and **not balanced (e.g. there are much more normal wines than excellent or poor ones)**.

A simple yet challenging project, to anticipate the quality of wine.
The complexity arises due to the fact that the dataset has fewer samples, & is highly imbalanced.

**This data frame contains the following columns:**

Input variables (based on physicochemical tests):\
1 - fixed acidity\
2 - volatile acidity\
3 - citric acid\
4 - residual sugar\
5 - chlorides\
6 - free sulfur dioxide\
7 - total sulfur dioxide\
8 - density\
9 - pH\
10 - sulphates\
11 - alcohol

Output variable (based on sensory data):\
12 - quality (score between 0 and 10)



### EDA and Data Preprocessing

Let's start by loading the data and performing some exploratory data analysis (EDA) and data preprocessing steps.

In [None]:
!ls -l $path

In [None]:
# Load the dataset and display the first few rows
wine_data_path = path + "/WineQT.csv"
wine_df = pd.read_csv(wine_data_path)
wine_df.head()

In [None]:
# Display basic information
wine_df.describe()

In [None]:
wine_df.info()

In [None]:
# Check the distribution of the 'quality' variable
wine_df.quality.hist(bins=6)
wine_df.quality.value_counts().sort_index()

In [None]:
# Separate features and target variable
X = wine_df.drop(['quality', 'Id'], axis=1)
y = wine_df['quality']

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Training and test set sizes
pd.concat([y_train.value_counts(), y_test.value_counts()], axis=1, keys=['Train', 'Test']).sort_index()

### Initialize, fit and predict using Random Forest and Extra Trees

Next, we will initialize, fit, and predict using Random Forest and Extra Trees classifiers. We will compare the performance of both algorithms.

In [None]:
# Initialize, fit and predict using the classifiers
rf_classifier = RandomForestClassifier(random_state=42)
et_classifier = ExtraTreesClassifier(random_state=42)

rf_classifier.fit(X_train, y_train)
et_classifier.fit(X_train, y_train)

y_pred_rf = rf_classifier.predict(X_test)
y_pred_et = et_classifier.predict(X_test)

In [None]:
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

In [None]:
cm_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_classifier.classes_)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=rf_classifier.classes_)
disp_rf.plot(cmap='Blues')
plt.title("Random Forest Confusion Matrix")

In [None]:
print("Extra Trees Classification Report:")
print(classification_report(y_test, y_pred_et))

In [None]:
cm_et = confusion_matrix(y_test, y_pred_et, labels=et_classifier.classes_)
disp_et = ConfusionMatrixDisplay(confusion_matrix=cm_et, display_labels=et_classifier.classes_)
disp_et.plot(cmap='Blues')
plt.title("Extra Trees Confusion Matrix")

In [None]:
importances_rf = rf_classifier.feature_importances_
indices_rf = np.argsort(importances_rf)[::-1]
features = X.columns

plt.figure(figsize=(10, 6))
plt.title("Random Forest Feature Importances")
sns.barplot(x=importances_rf[indices_rf], y=features[indices_rf], hue=features[indices_rf], legend=False)

In [None]:
classes = sorted(y.unique())
y_test_binarized = label_binarize(y_test, classes=classes)
n_classes = y_test_binarized.shape[1]
y_score_rf = rf_classifier.predict_proba(X_test)
y_score_et = et_classifier.predict_proba(X_test)

In [None]:
# Initialize dictionaries to store ROC curves and AUC scores
fpr_rf = dict()
tpr_rf = dict()
roc_auc_rf = dict()

fpr_et = dict()
tpr_et = dict()
roc_auc_et = dict()

for i in range(n_classes):
    fpr_rf[i], tpr_rf[i], _ = roc_curve(y_test_binarized[:, i], y_score_rf[:, i])
    roc_auc_rf[i] = auc(fpr_rf[i], tpr_rf[i])
    
    fpr_et[i], tpr_et[i], _ = roc_curve(y_test_binarized[:, i], y_score_et[:, i])
    roc_auc_et[i] = auc(fpr_et[i], tpr_et[i])


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(21, 9))

for i in range(n_classes):
    ax1.plot(fpr_rf[i], tpr_rf[i], lw=2,
             label='Class {0} (area = {1:0.2f})'.format(classes[i], roc_auc_rf[i]))
    ax2.plot(fpr_et[i], tpr_et[i], lw=2,
             label='Class {0} (area = {1:0.2f})'.format(classes[i], roc_auc_et[i]))

ax1.set_title('Random Forest ROC Curves')
ax2.set_title('Extra Trees ROC Curves')

ax1.plot([0, 1], [0, 1], 'k--', lw=2)
ax2.plot([0, 1], [0, 1], 'k--', lw=2)
ax1.legend(loc='lower right')
ax2.legend(loc='lower right')

plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')


In [None]:
# Extract a single tree
estimator_rf = rf_classifier.estimators_[0]

plt.figure(figsize=(20, 10))
tree.plot_tree(estimator_rf,
               feature_names=features,
               class_names=[str(c) for c in classes],
               filled=True,
               rounded=True)
plt.title("Decision Tree from Random Forest")

## Try some balancing techniques to improve the model performance

Since the dataset is imbalanced and the results of the raw models aren't impressive, we will try some balancing techniques to improve the model performance. First, we will use the `SMOTE` technique to oversample the minority classes. Then, we will try changing the class weights in the models.

In [None]:
Counter(y_train)

In [None]:
# Desired number of samples per class
sampling_strategy = {
    3: 30,  
    4: 60,  
    5: Counter(y_train)[5],  
    6: Counter(y_train)[6],  
    7: 200, 
    8: 30
}

# Initialize and apply SMOTE to training data
smote = SMOTE(sampling_strategy=sampling_strategy, random_state=42, k_neighbors=4)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Resampled training set class distribution
Counter(y_resampled)

In [None]:
# Initialize, fit and predict using new classifiers
rf_classifier_smote = RandomForestClassifier(random_state=42)
et_classifier_smote = ExtraTreesClassifier(random_state=42)

rf_classifier_smote.fit(X_resampled, y_resampled)
et_classifier_smote.fit(X_resampled, y_resampled)

y_pred_rf_smote = rf_classifier_smote.predict(X_test)
y_pred_et_smote = et_classifier_smote.predict(X_test)

In [None]:
print("Random Forest Classification Report (After SMOTE):")
print(classification_report(y_test, y_pred_rf_smote))

In [None]:
cm_rf_smote = confusion_matrix(y_test, y_pred_rf_smote, labels=rf_classifier_smote.classes_)
disp_rf_smote = ConfusionMatrixDisplay(confusion_matrix=cm_rf_smote, display_labels=rf_classifier_smote.classes_)
disp_rf_smote.plot(cmap='Blues')
plt.title("Random Forest Confusion Matrix (After SMOTE)")


In [None]:
print("Extra Trees Classification Report (After SMOTE):")
print(classification_report(y_test, y_pred_et_smote))

In [None]:
cm_rf_smote = confusion_matrix(y_test, y_pred_rf_smote, labels=rf_classifier_smote.classes_)
disp_rf_smote = ConfusionMatrixDisplay(confusion_matrix=cm_rf_smote, display_labels=rf_classifier_smote.classes_)
disp_rf_smote.plot(cmap='Blues')
plt.title("Random Forest Confusion Matrix (After SMOTE)")

In [None]:
# Compute class weights
classes = np.unique(y_train)
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=y_train
)
class_weight_dict = dict(zip(classes, class_weights))

class_weight_dict

In [None]:
# Initialize, fit and predict using classifiers with class weights
rf_classifier_weighted = RandomForestClassifier(
    class_weight=class_weight_dict,
    random_state=42
)
et_classifier_weighted = ExtraTreesClassifier(
    class_weight=class_weight_dict,
    random_state=42
)

rf_classifier_weighted.fit(X_train, y_train)
et_classifier_weighted.fit(X_train, y_train)

y_pred_rf_weighted = rf_classifier_weighted.predict(X_test)
y_pred_et_weighted = et_classifier_weighted.predict(X_test)


In [None]:
print("Random Forest Classification Report (With Class Weights):")
print(classification_report(y_test, y_pred_rf_weighted))

In [None]:
cm_rf_weighted = confusion_matrix(y_test, y_pred_rf_weighted, labels=rf_classifier_weighted.classes_)
disp_rf_weighted = ConfusionMatrixDisplay(confusion_matrix=cm_rf_weighted, display_labels=rf_classifier_weighted.classes_)
disp_rf_weighted.plot(cmap='Blues')
plt.title("Random Forest Confusion Matrix (With Class Weights)")


In [None]:
print("Extra Trees Classification Report (With Class Weights):")
print(classification_report(y_test, y_pred_et_weighted))

In [None]:
cm_et_weighted = confusion_matrix(y_test, y_pred_et_weighted, labels=et_classifier_weighted.classes_)
disp_et_weighted = ConfusionMatrixDisplay(confusion_matrix=cm_et_weighted, display_labels=et_classifier_weighted.classes_)
disp_et_weighted.plot(cmap='Blues')
plt.title("Extra Trees Confusion Matrix (With Class Weights)")