In [None]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Bagging algorithms
- Random Forest
- Extra-Trees (Extremely Randomized Trees)
- Bagging Meta-Estimator
- their comparative analysis

# Random Forest

A Random Forest is an ensemble learning method predominantly used for classification and regression. It builds multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random Forests correct for decision trees' habit of overfitting to their training set.

Here's how the Random Forest algorithm typically works:
- **Step 1: Create Bootstrap Samples -** Random Forest starts by creating multiple bootstrap samples from the original dataset. A bootstrap sample is a sample taken with replacement.
- **Step 2: Build Decision Trees -** For each bootstrap sample, it grows a decision tree. When splitting a node during the construction of the tree, the chosen split is no longer the best among all features. Instead, the split that is best among a random subset of features is chosen. This adds a layer of randomness to the model, hence the name.
- **Step 3: Node Splitting -** Typically, for a classification problem with p features, √p (rounded down) features are used in each split. For regression, p/3 features are used. This ensures that the individual trees in the forest are de-correlated.
- **Step 4: Final Prediction -** Predictions for new data points are made by averaging the predictions of all the individual trees for regression or by majority vote for classification.

Here are some reasons why Random Forest is powerful:
- **Reduces Overfitting:** It handles overfitting by averaging or taking the majority vote of the predictions of individual trees which may have overfitted the data.
- **Handles Missing Values:** It can handle missing values in the data.
- **Automatic Feature Selection:** It gives estimates of what variables are important in the classification.
- **Flexibility:** It can perform both classification and regression tasks.
- **Easy to Use:** It has few hyperparameters to tune and can often work well with the default settings.

### Practical Example of Using Random Forest:

In this practical session, we will delve into the Random Forest algorithm and our objectives are to:
1. Understand the basics of the Random Forest algorithm and its strengths.
2. Apply Random Forest to a real-world dataset and evaluate its performance.
3. Contrast the performance of Random Forest with that of a single decision tree to demonstrate the advantages of ensemble learning.
4. Visualize the impact of Random Forest on feature importance and potentially decision boundaries.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns  # seaborn is used for enhanced visual representation of data
import time
import numpy as np

# Load the dataset from OpenML
data = fetch_openml(name='KDDCup99', version=1, as_frame=True, parser='auto')
df = data.frame

# Preprocessing: encoding categorical variables, handling missing values
df.replace('?', pd.NA, inplace=True)
df.dropna(inplace=True)
label_encoders = {}
for column in df.select_dtypes(include=['category', 'object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# Prepare data for training and testing
X = df.drop('label', axis=1)  # Feature matrix
y = df['label'].astype(int)   # Target variable

# Split data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a Decision Tree Classifier for a simple baseline comparison
single_tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
single_tree_clf.fit(X_train, y_train)
y_tree_pred = single_tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, y_tree_pred) * 100

# Initialize the Random Forest Classifier with default settings for quick comparison
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier and measure the training time
start_time = time.time()
rf_clf.fit(X_train, y_train)
rf_train_time = time.time() - start_time
print(f"Random Forest Training Time: {rf_train_time:.3f} seconds")

# Predict with the Random Forest Classifier and measure the prediction time
start_time = time.time()
y_rf_pred = rf_clf.predict(X_test)
rf_predict_time = time.time() - start_time
print(f"Random Forest Prediction Time: {rf_predict_time:.3f} seconds")

# Accuracy evaluation for the Random Forest Classifier
rf_accuracy = accuracy_score(y_test, y_rf_pred)*100

# Print accuracy scores for both the simple decision tree and the random forest
print(f"Single Decision Tree Accuracy: {tree_accuracy:.2f}%")
print(f"Random Forest Model Accuracy: {rf_accuracy:.2f}%")

# Generate and print classification reports
tree_report = classification_report(y_test, y_tree_pred, zero_division=1)
rf_report = classification_report(y_test, y_rf_pred, zero_division=1)
print("Single Decision Tree Classification Report:")
print(tree_report)
print("Random Forest Model Classification Report:")
print(rf_report)

# Visualization 1: Comparing Model Accuracies
# This bar chart makes it easier to compare the accuracy of the two models side by side.
plt.figure(figsize=(5, 3))
plt.barh(['Single Decision Tree', 'Random Forest'], [tree_accuracy, rf_accuracy], color=['orange', 'forestgreen'])
plt.xlabel('Accuracy (%)')
plt.title('Model Accuracy Comparison')
plt.xlim(0, 100)  # Set the x-axis limits from 0 to 100 for percentages
plt.xticks(np.arange(0, 101, 10))  # Set the ticks to be in increments of 10%
plt.show()

# Visualization 2: Confusion Matrix Heatmap
# The confusion matrix is an important metric to understand the performance of classification models. It shows the number of correct and incorrect predictions made by the model, divided by each class.
conf_matrix = confusion_matrix(y_test, y_rf_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Random Forest')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# After fitting the RandomForest model, extract the feature importances
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]

# Visualization 3: Plotting the feature importances provides insight into which features the model finds most important
plt.figure(figsize=(10, 4))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
        color="r", align="center")
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()

# Conclusion and interpretation of results
print("\nConclusion and Interpretation:")
print("Comparing the two models, the Random Forest Classifier typically outperforms a single decision tree, benefiting from the ensemble approach.")
print("The confusion matrix is particularly useful for visualizing the performance of the classifier across different classes, helping to identify where the classifier is making mistakes.\n")

# Extra-Trees (Extremely Randomized Trees)

Extra-Trees is an ensemble learning method similar to Random Forest, but it introduces more randomness into the construction of the individual trees. While Random Forest uses bootstrapping and chooses the best split among a subset of features at each node, Extra-Trees goes a step further by using the entire original sample and randomly selecting both the features and the split points for each feature, regardless of the outcome.

Here’s how the Extra-Trees algorithm typically works:
- **Step 1: Original Sample -** Unlike Random Forest, which creates bootstrap samples, Extra-Trees uses the entire original dataset to build each tree. This means that each tree in the Extra-Trees ensemble uses the full dataset rather than a bootstrap sample.
- **Step 2: Random Splits -** When constructing the trees, for each feature at every node, a random split point is chosen, not necessarily the best one as in Random Forest. This increases the diversity among the trees at the cost of a slight increase in bias.
- **Step 3: Building the Trees -** Extra-Trees builds multiple decision trees with the aforementioned method of random splits. No pruning is typically done, meaning the trees are grown to their maximum length.
- **Step 4: Averaging/Majority Voting -** Similar to Random Forest, for regression problems, the final prediction is the average of the predictions of all the individual trees. For classification, it is the majority vote.

Reasons why Extra-Trees can be powerful:
- **Increased Randomness:** By randomizing the cut-points and using the whole dataset, Extra-Trees can create a more diverse ensemble, which can increase the accuracy for some datasets.
- **Computational Efficiency:** Since it selects splits randomly, it can be faster to train compared to models that need to find the optimal splits.
- **Reduction of Variance:** Like Random Forest, it helps in reducing variance by averaging the results, which can help with overfitted decision trees.
- **No Need for Bootstrap Samples:** This can lead to a more efficient use of the data and sometimes better performance because the full dataset is used.

### Practical Example of Using Extra-Trees:
In practice, we can apply Extra-Trees to the same dataset used for the Random Forest to compare their performances later. By evaluating metrics like accuracy and feature importance, we can see how Extra-Trees performs in terms of prediction and understanding which features it deems most significant.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns  # seaborn is used for enhanced visual representation of data
import time
import numpy as np

# Load the dataset from OpenML
data = fetch_openml(name='KDDCup99', version=1, as_frame=True, parser='auto')
df = data.frame

# Preprocessing: encoding categorical variables, handling missing values
df.replace('?', pd.NA, inplace=True)
df.dropna(inplace=True)
label_encoders = {}
for column in df.select_dtypes(include=['category', 'object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# Prepare data for training and testing
X = df.drop('label', axis=1)  # Feature matrix
y = df['label'].astype(int)   # Target variable

# Split data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a Decision Tree Classifier for a simple baseline comparison
single_tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
single_tree_clf.fit(X_train, y_train)
y_tree_pred = single_tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, y_tree_pred) * 100

# Initialize the Extra Trees Classifier with default settings for quick comparison
et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)

# Train the Extra Trees Classifier and measure the training time
start_time = time.time()
et_clf.fit(X_train, y_train)
et_train_time = time.time() - start_time
print(f"Extra Trees Training Time: {et_train_time:.3f} seconds")

# Predict with the Extra Trees Classifier and measure the prediction time
start_time = time.time()
y_et_pred = et_clf.predict(X_test)
et_predict_time = time.time() - start_time
print(f"Extra Trees Prediction Time: {et_predict_time:.3f} seconds")

# Accuracy evaluation for the Extra Trees Classifier
et_accuracy = accuracy_score(y_test, y_et_pred) * 100

# Print accuracy scores for both the simple decision tree and the extra trees
print(f"Single Decision Tree Accuracy: {tree_accuracy:.2f}%")
print(f"Extra Trees Model Accuracy: {et_accuracy:.2f}%")

# Generate and print classification reports
tree_report = classification_report(y_test, y_tree_pred, zero_division=1)
et_report = classification_report(y_test, y_et_pred, zero_division=1)
print("Single Decision Tree Classification Report:")
print(tree_report)
print("Extra Trees Model Classification Report:")
print(et_report)

# Visualization 1: Comparing Model Accuracies
plt.figure(figsize=(5, 3))
plt.barh(['Single Decision Tree', 'Extra Trees'], [tree_accuracy, et_accuracy], color=['orange', 'forestgreen'])
plt.xlabel('Accuracy (%)')
plt.title('Model Accuracy Comparison')
plt.xlim(0, 100)  # Set the x-axis limits from 0 to 100 for percentages
plt.xticks(np.arange(0, 101, 10))  # Set the ticks to be in increments of 10%
plt.show()

# Visualization 2: Confusion Matrix Heatmap
conf_matrix = confusion_matrix(y_test, y_et_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Extra Trees')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# After fitting the Extra Trees model, extract the feature importances
importances = et_clf.feature_importances_
indices = np.argsort(importances)[::-1]

# Visualization 3: Plotting the feature importances provides insight into which features the model finds most important
plt.figure(figsize=(10, 3))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
        color="r", align="center")
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()

# Conclusion and interpretation of results
print("\nConclusion and Interpretation:")
print("Comparing the two models, the Extra Trees Classifier typically exhibits performance comparable to or better than Random Forest, depending on the dataset and parameter settings.")
print("The confusion matrix and classification reports for both models provide insight into their performance across different classes.\n")

# Bagging Meta-Estimator

A Bagging Meta-Estimator, commonly referred to as BaggingClassifier in classification contexts or BaggingRegressor for regression tasks, embodies a robust ensemble learning strategy. By embracing the bagging (Bootstrap Aggregating) approach, it imparts the advantages of this technique to the base estimators, which can be any standard learning algorithms. This meta-estimator fits individual models on distinct random samples drawn from the initial dataset and synthesizes their predictions to arrive at a final verdict. This synthesis is accomplished through either voting or averaging, depending on the nature of the task—classification or regression respectively. The incorporation of randomness in the construction of these models serves to diminish their variance.

Here's how the Bagging algorithm typically works:
- **Step 1: Create Bootstrap Samples -** Bagging begins by creating multiple bootstrap samples from the original dataset. A bootstrap sample is a random sample of the dataset with replacement.
- **Step 2: Train Base Estimators -** For each bootstrap sample, a base estimator (like a decision tree) is trained. Each instance of the model learns from a subset of the data.
- **Step 3: Parallel Training -** Unlike Random Forest, which introduces randomness by selecting a subset of features at each split, Bagging uses all features for each model, and the models are trained in parallel.
- **Step 4: Aggregation of Predictions -** Predictions for new data points are made by averaging the predictions (for regression) or by majority vote or averaging probabilities (for classification) from all individual base estimators.

Here are some reasons why Bagging is powerful:
- **Reduces Overfitting:** By averaging the results of individual estimators, it reduces the chance of overfitting.
- - Flexibility:y:*It can be used with many different types of predictive models, not just decision trees.e.
- Parallelizable:s:*Each model can be trained in parallel since they are independent of one another.
- **Variance Reduction:** It is effective in reducing the variance of a prediction model, especially if the base estimator is a high variance and low bias machine learning algorithm.s.

### Practical Example of Using Bagging meta-estimator

In this practical session, we will explore the Bagging meta-estimator and our objectives are to:o:
- Understand the basics of the Bagging algorithm and its strengths.s.
- Apply Bagging to a real-world dataset and evaluate its performance.e.
- Compare the performance of the Bagging Meta-Estimator with that of a single base estimator.
- Visualize the stability of Bagging in terms of prediction accuracy.y.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
import time
import numpy as np

# Load the dataset from OpenML
data = fetch_openml(name='KDDCup99', version=1, as_frame=True, parser='auto')
df = data.frame

# Preprocessing: encoding categorical variables, handling missing values
# Replace missing values with NaN and then drop any rows with missing data.
df.replace('?', pd.NA, inplace=True)
df.dropna(inplace=True)

# Initialize a dictionary to keep track of label encoders for each categorical column
label_encoders = {}
for column in df.select_dtypes(include=['category', 'object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# Prepare data for training and testing by splitting into features and target
X = df.drop('label', axis=1)  # Feature matrix
y = df['label'].astype(int)   # Target variable

# Split data into training (70%) and testing (30%) sets with a fixed random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a single Decision Tree Classifier for a baseline comparison
single_tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
single_tree_clf.fit(X_train, y_train)
y_tree_pred = single_tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, y_tree_pred) * 100

# Initialize the Bagging Meta-Estimator with 100 Decision Trees
bag_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)

# Train the Bagging Meta-Estimator and measure the training time
start_time = time.time()
bag_clf.fit(X_train, y_train)
bag_train_time = time.time() - start_time
print(f"Bagging Meta-Estimator Training Time: {bag_train_time:.3f} seconds")

# Predict with the Bagging Meta-Estimator and measure the prediction time
start_time = time.time()
y_bag_pred = bag_clf.predict(X_test)
bag_predict_time = time.time() - start_time
print(f"Bagging Meta-Estimator Prediction Time: {bag_predict_time:.3f} seconds")

# Evaluate the accuracy of the Bagging Meta-Estimator
bag_accuracy = accuracy_score(y_test, y_bag_pred) *100

# Print accuracy scores for both the baseline decision tree and the bagging meta-estimator
print(f"Single Decision Tree Accuracy: {tree_accuracy:.2f}%")
print(f"Bagging Meta-Estimator Model Accuracy: {bag_accuracy:.2f}%")

# Generate and print classification reports for both models
tree_report = classification_report(y_test, y_tree_pred, zero_division=1)
bag_report = classification_report(y_test, y_bag_pred, zero_division=1)
print("Single Decision Tree Classification Report:")
print(tree_report)
print("Bagging Meta-Estimator Model Classification Report:")
print(bag_report)

# Visualization 1: Comparing Model Accuracies with a bar chart
plt.figure(figsize=(5, 3))
plt.barh(['Single Decision Tree', 'Bagging Meta-Estimator'], [tree_accuracy, bag_accuracy], color=['orange', 'royalblue'])
plt.xlabel('Accuracy (%)')
plt.title('Model Accuracy Comparison')
plt.xlim(0, 100)  # Set the x-axis limits from 0 to 100 for percentages
plt.xticks(np.arange(0, 101, 10))  # Set the ticks to be in increments of 10%
plt.show()

# Visualization 2: Confusion Matrix Heatmap for the Bagging Meta-Estimator
# This gives us a visual understanding of the true positives, false positives, true negatives, and false negatives.
conf_matrix_bag = confusion_matrix(y_test, y_bag_pred)
plt.figure(figsize=(5, 4))
sns.heatmap(conf_matrix_bag, annot=True, fmt='d', cmap='Purples')
plt.title('Confusion Matrix for Bagging Meta-Estimator')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Conclusion and interpretation of results
# This summarizes the findings from the comparison between a single decision tree and the bagging meta-estimator.
print("\nConclusion and Interpretation:")
print("The Bagging Meta-Estimator benefits from the ensemble approach, often outperforming a single decision tree.")
print("By comparing training and prediction times as well as accuracy, one can understand the trade-offs involved in using a bagging ensemble.\n")

# Comparative Analysis

In [None]:
import pandas as pd

# Hypothetical data
results = {
    'Model': ['Random Forest', 'Extra-Trees', 'Bagging Meta- Estimator'],
    'Accuracy': [rf_accuracy, et_accuracy, bag_accuracy],  
    'Training Time': [rf_train_time, et_train_time, bag_train_time],  
    'Prediction Time': [rf_predict_time, et_predict_time, bag_predict_time],  
}

results_df = pd.DataFrame(results)
print(results_df)

# Visualizing the accuracy of each model
plt.figure(figsize=(5, 3))
plt.bar(results_df['Model'], results_df['Accuracy'], color=['blue', 'green', 'red'])
plt.xlabel('Model')
plt.xlabel('Accuracy (%)')
plt.title('Comparison of Model Accuracies')
plt.yticks(np.arange(0, 101, 10))  # Set the ticks to be in increments of 10%
plt.ylim(0, 100)  # Assuming accuracy is between 0 and 1
plt.show()

# Visualizing the training time of each model
plt.figure(figsize=(5, 3))
plt.bar(results_df['Model'], results_df['Training Time'], color=['blue', 'green', 'red'])
plt.xlabel('Model')
plt.ylabel('Training Time (seconds)')
plt.title('Comparison of Model Training Times')
plt.show()

# Visualizing the prediction time of each model
plt.figure(figsize=(5, 3))
plt.bar(results_df['Model'], results_df['Prediction Time'], color=['blue', 'green', 'red'])
plt.xlabel('Model')
plt.ylabel('Prediction Time (seconds)')
plt.title('Comparison of Model Prediction Times')
plt.show()

## Final notes

When comparing Random Forest, Extra-Trees, and Bagging Meta-Estimator, consider factors beyond accuracy such as training and prediction times, and other performance characteristics.

**Random Forest** is an ensemble method known for handling large datasets and providing high accuracy. Training times are relatively fast, but not as fast as simpler methods like AdaBoost. However, it offers robustness and handles complex data patterns well. It also provides feature importance metrics, aiding model interpretation.

**Extra-Trees** are similar to Random Forest but introduce more randomness in building trees. This leads to faster training times as the search for optimal thresholds is not required. Extra-Trees may yield better generalization and are less sensitive to data noise but might not always be more accurate than Random Forest.

Generally, **Random Forest** may be preferred for its accuracy and capability to handle imbalanced datasets automatically. **Extra-Trees** might be a better choice for faster training when some loss in precision is acceptable. **Bagging Meta-Estimator** is effective for models prone to overfitting and aiming to decrease their variance.