# **Ensemble Learning**

1 Can we use Bagging for regression problems
- Bagging (Bootstrap Aggregating) is a general ensemble technique that reduces variance by training multiple models on different bootstrap samples of the data. It works for both:

 - Classification

- Regression

2 What is the difference between multiple model training and single model training
- 1️⃣ Single Model Training:-
Single model training refers to building one machine learning model on the entire dataset.
The model learns all patterns and makes predictions alone.

- 2️⃣ Multiple Model Training:-
Multiple model training refers to building several models on the dataset—either on different samples, different features, or different algorithms—and combining their predictions.

3 Explain the concept of feature randomness in Random Forest?
- Feature randomness in a Random Forest means that each decision tree is allowed to consider only a random subset of features when splitting a node.

4  What is OOB (Out-of-Bag) Score?
- OOB (Out-of-Bag) Score is a built-in validation score used in Bagging and Random Forest models.
It is calculated using the data samples not included in the bootstrap sample for each tree.

5 How can you measure the importance of features in a Random Forest model
- Gini Importance (MDI): Based on total impurity reduction contributed by each feature across all trees.

- Permutation Importance (MDA): Measures drop in model accuracy when a feature is randomly permuted.

6  Explain the working principle of a Bagging Classifier
- A Bagging Classifier (Bootstrap Aggregating) improves model performance by training multiple versions of the same model on different random samples of the data and combining their predictions.

7  How do you evaluate a Bagging Classifier’s performance
- A Bagging Classifier is evaluated using the same performance metrics as other classification models.
The goal is to measure how well the ensemble predicts unseen data.

8 How does a Bagging Regressor work
- A Bagging Regressor uses the Bagging (Bootstrap Aggregating) technique to improve prediction accuracy and reduce variance in regression tasks.

9 What is the main advantage of ensemble techniques
- The main advantage of ensemble techniques is that they combine multiple models to achieve higher accuracy, better stability, and improved generalization compared to individual models.

10 What is the main challenge of ensemble methods
- The main challenge of ensemble methods is their increased complexity, which makes them computationally expensive, harder to interpret, and more difficult to train and deploy compared to single models.

11 Explain the key idea behind ensemble technique
- The key idea behind ensemble techniques is to combine multiple models to produce a stronger, more accurate, and more stable final prediction than any single model could achieve

12 What is a Random Forest Classifier
- A Random Forest Classifier is an ensemble learning algorithm that builds multiple decision trees and combines their predictions using majority voting to perform classification tasks.

13 What are the main types of ensemble techniques
- The main types of ensemble techniques are Bagging (to reduce variance), Boosting (to reduce bias), and Stacking (to combine diverse models using a meta-learner).

14 What is ensemble learning in machine learning
- Ensemble learning is a technique in machine learning where multiple models are combined to make a final prediction that is more accurate, stable, and reliable than any single model.

15 When should we avoid using ensemble methods
- Avoid ensemble methods when interpretability is required, when computational resources are limited, when the dataset is very small, or when a simple model already performs well.

16 How does Bagging help in reducing overfitting
- Bagging (Bootstrap Aggregating) reduces overfitting by training multiple models on different random samples of the data and averaging their predictions, which smooths out noise and reduces variance.

17 Why is Random Forest better than a single Decision Tree
- A Random Forest is better because it combines many decision trees to create a more accurate, stable, and generalizable model, while a single decision tree is prone to overfitting and instability.

18 What is the role of bootstrap sampling in Bagging
- Bootstrap sampling is the core mechanism that creates diversity among the models in Bagging.
It involves drawing random samples with replacement from the training data to build multiple different datasets for training.

19 What are some real-world applications of ensemble techniques
- Ensemble methods are widely used because they improve accuracy, reduce errors, and provide stable results. Here are some common applications:

20 What is the difference between Bagging and Boosting
- Bagging (Bootstrap Aggregating) — Bagging is an ensemble technique where multiple models (usually of the same type) are trained independently on different bootstrapped samples of the training data, and their predictions are averaged (regression) or voted (classification) to reduce variance and prevent overfitting.

- Boosting — Boosting is an ensemble technique where multiple weak models are trained sequentially, and each new model focuses on correcting the errors made by the previous ones. Boosting reduces bias and creates a strong final model by giving more weight to difficult samples.

Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X = data.data
y = data.target

# Split into train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Bagging Classifier with Decision Trees as base estimator
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

# Train the model
bagging_model.fit(X_train, y_train)

# Make predictions
y_pred = bagging_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load sample regression dataset
data = load_boston()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Bagging Regressor using Decision Trees
bagging_regressor = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)

# Train the model
bagging_regressor.fit(X_train, y_train)

# Predict
y_pred = bagging_regressor.predict(X_test)

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)


Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest model
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Put into a DataFrame for clean output
feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print Feature Importance
print(feature_importance_df)


Train a Random Forest Regressor and compare its performance with a single Decision Tree2

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = load_boston()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# 1. Train a Single Decision Tree
# -----------------------------
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_pred)

# -----------------------------
# 2. Train a Random Forest Regressor
# -----------------------------
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# -----------------------------
# Print results
# -----------------------------
print("Decision Tree MSE:", dt_mse)
print("Random Forest MSE:", rf_mse)


Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split (not required for OOB but used for reference)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Random Forest with OOB enabled
rf = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,         # Enable Out-of-Bag evaluation
    bootstrap=True,         # Required for OOB
    random_state=42
)

# Train model
rf.fit(X_train, y_train)

# Print OOB Score
print("OOB Score:", rf.oob_score_)


Train a Bagging Classifier using SVM as a base estimator and print accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Bagging Classifier with SVM base estimator
bagging_svm = BaggingClassifier(
    base_estimator=SVC(),
    n_estimators=20,
    random_state=42
)

# Train the model
bagging_svm.fit(X_train, y_train)

# Predictions
y_pred = bagging_svm.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Classifier Accuracy (SVM Base Estimator):", accuracy)


Train a Random Forest Classifier with different numbers of trees and compare accuracy

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# List of different number of trees
n_trees = [10, 50, 100, 200, 300]

# Compare accuracy
print("Number of Trees  |  Accuracy")
print("-----------------------------")

for n in n_trees:
    rf = RandomForestClassifier(
        n_estimators=n,
        random_state=42
    )
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{n:<17} |  {acc:.4f}")


Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score

In [18]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Load binary classification dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Bagging Classifier with Logistic Regression
bagging_logreg = BaggingClassifier(
    estimator=LogisticRegression(max_iter=1000), # Increased max_iter to address convergence warning
    n_estimators=20,
    random_state=42
)

# Train the model
bagging_logreg.fit(X_train, y_train)

# Predict probabilities
y_prob = bagging_logreg.predict_proba(X_test)[:, 1]

# AUC Score
auc = roc_auc_score(y_test, y_prob)
print("Bagging Classifier AUC Score (Logistic Regression):", auc)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Bagging Classifier AUC Score (Logistic Regression): 0.9980347199475925


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Train a Random Forest Regressor and analyze feature importance scores

In [None]:
from sklearn.datasets import fetch_california_housing # Changed load_boston to fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Load dataset
data = fetch_california_housing() # Changed load_boston to fetch_california_housing
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Convert to DataFrame for clear display
feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print Feature Importance
print(feature_importance_df)


Train an ensemble model using both Bagging and Random Forest and compare accuracy.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ------------------------------
# 1. Bagging Classifier
# ------------------------------
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed base_estimator to estimator
    n_estimators=50,
    random_state=42
)

bagging_model.fit(X_train, y_train)
y_pred_bag = bagging_model.predict(X_test)
bagging_acc = accuracy_score(y_test, y_pred_bag)

# ------------------------------
# 2. Random Forest Classifier
# ------------------------------
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred_rf)

# ------------------------------
# Print results
# ------------------------------
print("Bagging Classifier Accuracy:", bagging_acc)
print("Random Forest Accuracy:", rf_acc)

Train a Random Forest Classifier and tune hyperparameters using GridSearchCV

In [17]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Initialize Random Forest model
rf = RandomForestClassifier(random_state=42)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,           # Use all CPU cores
    scoring='accuracy'
)

# Fit GridSearch
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Use best model for prediction
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after Hyperparameter Tuning:", accuracy)


Best Hyperparameters:
{'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Accuracy after Hyperparameter Tuning: 0.9649122807017544


Train a Bagging Regressor with different numbers of base estimators and compare performance

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = load_boston()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Different number of estimators to test
n_estimators_list = [10, 20, 50, 100, 200]

print("Estimators  |  MSE")
print("-----------------------")

for n in n_estimators_list:
    model = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(),
        n_estimators=n,
        random_state=42
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    print(f"{n:<11} |  {mse:.4f}")


Train a Random Forest Classifier and analyze misclassified samples

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Find misclassified samples
misclassified_indices = (y_test != y_pred)

# Create DataFrame for analysis
misclassified_df = pd.DataFrame(
    X_test[misclassified_indices],
    columns=feature_names
)
misclassified_df["Actual"] = y_test[misclassified_indices]
misclassified_df["Predicted"] = y_pred[misclassified_indices]

print("\nMisclassified Samples:")
print(misclassified_df)


Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ----------------------------------
# 1. Single Decision Tree Classifier
# ----------------------------------
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

# ----------------------------------
# 2. Bagging Classifier (with Decision Trees)
# ----------------------------------
bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=30,
    random_state=42
)

bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

# ----------------------------------
# Print Comparison
# ----------------------------------
print("Single Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


Train a Random Forest Classifier and visualize the confusion matrix

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot(cmap="Blues")
plt.title("Random Forest Classifier - Confusion Matrix")
plt.show()

# Print accuracy
accuracy = rf.score(X_test, y_test)
print("Accuracy:", accuracy)


Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Base models
dt = DecisionTreeClassifier(random_state=42)
svm = SVC(probability=True, random_state=42)
lr = LogisticRegression(max_iter=1000)

# Stacking Classifier
estimators = [
    ("decision_tree", dt),
    ("svm", svm)
]

stack_model = StackingClassifier(
    estimators=estimators,
    final_estimator=lr
)

# Train models
dt.fit(X_train, y_train)
svm.fit(X_train, y_train)
stack_model.fit(X_train, y_train)

# Predictions
dt_pred = dt.predict(X_test)
svm_pred = svm.predict(X_test)
stack_pred = stack_model.predict(X_test)

# Accuracy comparison
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("Stacking Classifier Accuracy:", accuracy_score(y_test, stack_pred))


Train a Random Forest Classifier and print the top 5 most important features

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
})

# Sort and print top 5
top_5 = feature_importance_df.sort_values(by="Importance", ascending=False).head(5)

print("Top 5 Most Important Features:")
print(top_5)


Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bagging Classifier with Decision Tree as base estimator
bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

# Train model
bagging.fit(X_train, y_train)

# Predictions
y_pred = bagging.predict(X_test)

# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Bagging Classifier Performance:")
print("--------------------------------")
print(f"Accuracy  : {accuracy:.4f}")
print(f"Precision : {precision:.4f}")
print(f"Recall    : {recall:.4f}")
print(f"F1-Score  : {f1:.4f}")


Train a Random Forest Classifier and analyze the effect of max_depth on accuracy

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Different max_depth values to test
depth_values = [None, 2, 4, 6, 8, 10, 15, 20]

print("Effect of max_depth on Random Forest Accuracy")
print("---------------------------------------------")

for depth in depth_values:
    # Create Random Forest with given max_depth
    rf = RandomForestClassifier(
        n_estimators=200,
        max_depth=depth,
        random_state=42
    )

    # Train model
    rf.fit(X_train, y_train)

    # Predictions
    y_pred = rf.predict(X_test)

    # Accuracy
    acc = accuracy_score(y_test, y_pred)

    print(f"max_depth = {depth} --> Accuracy = {acc:.4f}")


Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare
performance

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Load a regression dataset
data = load_diabetes()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------------------------------
# 1. Bagging with Decision Tree
# -------------------------------
bag_dt = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)

bag_dt.fit(X_train, y_train)
y_pred_dt = bag_dt.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)

# -------------------------------
# 2. Bagging with KNN
# -------------------------------
bag_knn = BaggingRegressor(
    base_estimator=KNeighborsRegressor(),
    n_estimators=50,
    random_state=42
)

bag_knn.fit(X_train, y_train)
y_pred_knn = bag_knn.predict(X_test)
mse_knn = mean_squared_error(y_test, y_pred_knn)

# -------------------------------
# Compare Performance
# -------------------------------
print("Bagging Regressor Performance Comparison")
print("---------------------------------------")
print(f"Decision Tree + Bagging  | MSE = {mse_dt:.4f}")
print(f"KNN + Bagging            | MSE = {mse_knn:.4f}")


Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Random Forest Classifier
rf = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

# Train model
rf.fit(X_train, y_train)

# Predict probabilities for ROC-AUC
y_prob = rf.predict_proba(X_test)[:, 1]

# Compute ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_prob)

print("Random Forest Classifier Performance:")
print("-------------------------------------")
print(f"ROC-AUC Score : {roc_auc:.4f}")


Train a Bagging Classifier and evaluate its performance using cross-validatio.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Bagging Classifier with Decision Tree base estimator
bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

# Perform 5-Fold Cross-Validation
cv_scores = cross_val_score(bagging, X, y, cv=5, scoring='accuracy')

# Print results
print("Bagging Classifier Cross-Validation Performance")
print("----------------------------------------------")
print(f"Cross-Validation Scores : {cv_scores}")
print(f"Mean Accuracy           : {cv_scores.mean():.4f}")
print(f"Standard Deviation      : {cv_scores.std():.4f}")


Train a Random Forest Classifier and plot the Precision-Recall curv

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc

# 1. Generate Synthetic Data
# Create a binary classification dataset for demonstration
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0,
                           n_classes=2, n_clusters_per_class=1, random_state=42)

# 2. Split Data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# 4. Calculate Probabilities (for the positive class, index 1)
# Precision-Recall curve requires probability scores
y_scores = rf_model.predict_proba(X_test)[:, 1]

# 5. Calculate Precision-Recall Curve Points and AUC
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
pr_auc = auc(recall, precision)

# 6. Plot the Curve
plt.figure(figsize=(8, 6))
# Plot the curve with the calculated PR-AUC value in the label
plt.plot(recall, precision, label=f'Random Forest (PR-AUC = {pr_auc:.4f})', color='darkorange')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Random Forest Classifier')
plt.legend(loc="lower left")
plt.grid(True)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.savefig('precision_recall_curve.png')

Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Generate Synthetic Data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0,
                           n_classes=2, n_clusters_per_class=1, random_state=42)

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Instantiate Base Estimators
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(random_state=42, solver='liblinear')

# 4. Instantiate Stacking Classifier
estimators = [
    ('rf', rf),
    ('lr', lr)
]
# The final_estimator combines the predictions of the base estimators.
stacking_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(solver='liblinear'), cv=5)

# 5. Train and Evaluate all models
models = {
    'Random Forest': rf,
    'Logistic Regression': lr,
    'Stacking Classifier': stacking_clf
}

accuracy_results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results[name] = accuracy

# Output the comparison
print("--- Accuracy Comparison ---")
for name, accuracy in accuracy_results.items():
    print(f"{name}: {accuracy:.4f}")

Train a Bagging Regressor with different levels of bootstrap samples and compare performance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

# 1. Generate Synthetic Data
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Define Estimator Levels
n_estimators_list = [5, 10, 20, 50, 100, 200]
results = []
r2_scores = []

# 3. Train and Evaluate
for n in n_estimators_list:
    bagging_reg = BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=n,
        random_state=42,
        n_jobs=-1
    )

    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)

    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    results.append({
        'n_estimators': n,
        'R2 Score': r2,
        'Mean Squared Error (MSE)': mse
    })
    r2_scores.append(r2)

# 4. Plot R2 Score vs. n_estimators
plt.figure(figsize=(10, 6))
plt.plot(n_estimators_list, r2_scores, marker='o', linestyle='-', color='indigo')
plt.title('Bagging Regressor $R^2$ Score vs. Number of Estimators')
plt.xlabel('Number of Bootstrap Samples ($n\_{\text{estimators}}$)')
plt.ylabel('$R^2$ Score (Goodness of Fit)')
plt.xticks(n_estimators_list)
plt.grid(True, linestyle='--', alpha=0.7)
plt.savefig('bagging_regressor_performance_plot.png')