#Ensemble Learning | **Assignment**

## Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it


**Answer**: Ensemble Learning is a machine learning technique where multiple models—often called base learners—are combined to produce a final model that is more accurate and robust than any individual model.

The main idea is that while each base model may be weak or make errors in different ways, combining their predictions—through averaging, voting, or more complex methods—can reduce overall errors like bias and variance. In essence, collective decisions from diverse models often outperform any single model.

Ensemble strategies commonly include:

* Bagging or Bootstrap Aggregating
* Boosting
* Stacking or Stacked Generalization

## Question 2: What is the difference between Bagging and Boosting?


**Answer**: Difference Between Bagging and Boosting

1. *Bagging (Bootstrap Aggregating)*

   * Trains multiple independent base learners in parallel on different random subsets of the data, created via bootstrap sampling (sampling with replacement).
   
   * Aims to reduce variance and help prevent overfitting, especially effective for high-variance models like decision trees.
   
   * Aggregates predictions through majority voting (classification) or averaging (regression).
   
   * Example: Random Forest, which adds extra randomness by selecting different features at each split to further lower correlation among trees.


2. *Boosting*

   * Builds models sequentially—each new learner focuses on correcting errors made by previous ones.

   * Targets reducing bias by combining many weak learners into a strong predictive model.

   * Predictions are combined using weighted sums or weighted voting, with  higher weights for better-performing learners.

   * Examples include AdaBoost, Gradient Boosting, XGBoost, and CatBoost.

## Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?


**Answer**: **Bootstrap sampling** is a statistical resampling technique in which you repeatedly draw samples with replacement from an original dataset. Each drawn sample (called a bootstrap sample) has the same size as the original dataset, but due to replacement, some observations may appear multiple times while others may be omitted.

*Role of Bootstrap Sampling in Bagging & Random Forest*

1. Creating Diverse Training Subsets:
Bagging (Bootstrap Aggregating) uses bootstrap sampling to generate multiple distinct subsets of the training data. Each subset trains its own independent base learner.

2. Reducing Variance & Overfitting:
High-variance models (like deep decision trees) are prone to overfitting. When many such models are trained on different bootstrap samples and aggregated (via average or majority vote), the overall variance drops, and overfitting is mitigated.

3. Usage in Random Forests:
Random Forest applies bootstrap sampling to train each decision tree on a unique sample from the data. Combined with random feature selection, this approach decorrelates the trees and further reduces variance without increasing bias.

## Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?



**Answer**: **Out-of-Bag (OOB) samples** are the data points not included in the bootstrap sample used to train a particular base learner (like a decision tree) in methods such as Bagging or Random Forest. When sampling with replacement, only about 63 % of the original dataset is included in each bootstrap sample—leaving roughly 37 % as OOB for that tree.

*How Are OOB Samples Used for Evaluation?*

1. Training and OOB Division:
Each tree is trained on its unique bootstrap sample, and the omitted samples become that tree’s OOB samples.

2. Prediction on OOB Samples:
After training, each tree predicts the outcomes for its own OOB samples. Since these samples were never used in training that tree, they serve as unbiased “unseen” data.

3. Aggregation Across Trees:
Each data point in the dataset is likely OOB for several trees. For classification, predictions are combined via majority vote; for regression, averaged.

4. Compute OOB Score / Error:
The aggregated OOB predictions are compared with actual values to compute:
   * OOB Score (e.g., accuracy for classification)
   * OOB Error (e.g., misclassification rate or MSE)

## Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.


**Answer**:
1. Single Decision Tree -

   * Calculation Method : Feature importance is computed based on how much each feature reduces impurity (like Gini impurity or entropy) across all splits using that feature. The score for each feature is the total (and then normalized) reduction in impurity attributed to that feature.

   * Characteristics

     * Easy to interpret—feature contributions can be traced through the tree.

     * However, because the tree can overfit to the data, importance scores may become biased or overly sensitive to noise.

2. Random Forest -
   * How It's Derived:
      * Importance is first calculated for each feature in every tree (via impurity reduction methods like Gini importance).

      * These values are then averaged across all trees in the forest.

      * Finally, the averaged scores are normalized to sum to 1.

   * Why It’s More Reliable:

      * Each tree is trained on different subsets of data and features, so averaging importance scores yields more stable and robust estimates. A feature overly favored in one tree is less likely to dominate overall.

   * Limitations & Caveats

     * Importance scores from impurity-based methods may be biased toward continuous or high-cardinality features.

     * Correlated features may split credit unevenly, making individual importance scores less clear.

## Question 6: Write a Python program to:
####● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()

####● Train a Random Forest Classifier

####● Print the top 5 most important features based on feature importance scores.

In [2]:
# Import Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the Data
data = load_breast_cancer()

# Split the data into X and y
X, y = data.data, data.target
feature_names = data.feature_names

# Train Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=1)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

feature_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

print("Top 5 Most Important Features:")
print(feature_importances.head(5))

Top 5 Most Important Features:
                 Feature  Importance
22       worst perimeter    0.141142
27  worst concave points    0.125184
23            worst area    0.115155
20          worst radius    0.089507
7    mean concave points    0.081823


## Question 7: Write a Python program to:
###● Train a Bagging Classifier using Decision Trees on the Iris dataset

###● Evaluate its accuracy and compare with a single Decision Tree


In [4]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_acc = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
bagging_acc = accuracy_score(y_test, y_pred_bagging)

# Print results
print("Accuracy of Single Decision Tree:", dt_acc)
print("Accuracy of Bagging Classifier (with Decision Trees):", bagging_acc)

Accuracy of Single Decision Tree: 0.9333333333333333
Accuracy of Bagging Classifier (with Decision Trees): 0.9666666666666667


## Question 8: Write a Python program to:
####● Train a Random Forest Classifier
####● Tune hyperparameters max_depth and n_estimators using GridSearchCV
####● Print the best parameters and final accuracy


In [7]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 15]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,                # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Final Test Accuracy:", accuracy)

Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Test Accuracy: 0.956140350877193


## Question 9: Write a Python program to:
####● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
####● Compare their Mean Squared Errors (MSE)

In [8]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Bagging Regressor with Decision Trees
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# Train Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)


Mean Squared Error (Bagging Regressor): 0.2572988359842641
Mean Squared Error (Random Forest Regressor): 0.2553684927247781


### Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:
####● Choose between Bagging or Boosting
####● Handle overfitting
####● Select base models
####● Evaluate performance using cross-validation
####● Justify how ensemble learning improves decision-making in this real-world context.
####(Include your Python code and output in the code box below.)


Step-by-step:

1. Bagging vs Boosting

   * Use Bagging (e.g., Random Forest) to reduce variance & overfitting.

   * Use Boosting (e.g., Gradient Boosting) to reduce bias by focusing on hard cases.

2. Handle Overfitting

   * Limit max_depth, use min_samples_leaf, early stopping (Boosting).

   * Monitor CV gap between train & test.

3. Select Base Models

   * Bagging → full decision trees.

   * Boosting → shallow trees (stumps).

4. Evaluate with Cross-Validation

   * Use StratifiedKFold for imbalanced default prediction.

   * Metrics: ROC-AUC and Average Precision (PR curve).

5. Why Ensembles Help

   * Combine weak models → better accuracy, lower variance/bias.

   * Improves credit risk prediction → fewer missed defaults, more robust loan approvals.

In [9]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Simulated imbalanced loan default dataset
X, y = make_classification(n_samples=5000, n_features=20, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight="balanced")
gb = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=3, random_state=42)

print("RF CV ROC-AUC:", cross_val_score(rf, X_train, y_train, cv=cv, scoring="roc_auc").mean())
print("GB CV ROC-AUC:", cross_val_score(gb, X_train, y_train, cv=cv, scoring="roc_auc").mean())

RF CV ROC-AUC: 0.9247606808169475
GB CV ROC-AUC: 0.9221359208568722
