Assignment Code: DA-AG-014
# Ensemble Learning | Assignment

**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

**Ans-**
 - Ensemble Learning is a machine learning technique where multiple models (often called “weak learners”) are combined together to solve a problem and achieve better performance than a single model.

  **Key Idea Behind Ensemble Learning**

 - “Wisdom of the Crowd” principle → Just like a group of people can make a better decision together than any individual, a group of models can produce more accurate predictions than a single model.

 - The core intuition is that different models may make different errors, but when combined, their strengths can balance out the weaknesses of others.

**Question 2: What is the difference between Bagging and Boosting?**

**Ans-**

 - Key Differences: Bagging vs Boosting

  | Aspect              | **Bagging**                       | **Boosting**                                |
| ------------------- | --------------------------------- | ------------------------------------------- |
| **Training Style**  | Parallel (independent models)     | Sequential (each model depends on previous) |
| **Goal**            | Reduce **variance**               | Reduce **bias** (and variance)              |
| **Data Sampling**   | Random subsets (with replacement) | Weighted sampling (focus on mistakes)       |
| **Model Weighting** | Equal weight to each model        | Higher weight to better models              |
| **Overfitting**     | Handles overfitting well          | Can overfit if not tuned carefully          |
| **Examples**        | Random Forest                     | AdaBoost, Gradient Boosting, XGBoost        |


**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Ans-**

**Bootstrap Sampling**

 - Bootstrap sampling is a random sampling technique with replacement used to create multiple subsets of a dataset.

**Role in Bagging (e.g., Random Forest)**

   Bagging (Bootstrap Aggregating) relies heavily on bootstrap sampling. Here’s how it works:

**1.Create Bootstrapped Datasets**

 - From the original dataset, generate multiple subsets using bootstrap sampling.

 - Each subset is slightly different from the others.

**2.Train Multiple Models**

 - Train a weak learner (like a decision tree) on each bootstrapped dataset independently.

**3.Aggregate Predictions**

 - For classification → majority voting.

 - For regression → average of predictions.

**4.Effect**

 - Reduces variance by averaging many diverse models.

 - Improves stability and accuracy compared to a single model.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**

**Ans-**

**Out-of-Bag (OOB) Samples**

 - When we use bootstrap sampling (sampling with replacement) to create subsets for training, not all data points are chosen.

 - On average, about 63% of the data ends up in each bootstrap sample.

 - The remaining ~37% of data points are not selected for that sample → these are called Out-of-Bag (OOB) samples.

 **OOB Score**

 - Idea: Since each tree in a Random Forest is trained on a bootstrapped sample, the OOB samples for that tree can be used as test data.

 - For each data point:

   - Use only the trees where that data point was OOB to predict its label.

   - Compare prediction with the true label.

 - The OOB score is simply the average accuracy (or error) across all such predictions.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**


**Ans-**

**Key Differences**
| Aspect               | **Decision Tree**                                         | **Random Forest**                              |
| -------------------- | --------------------------------------------------------- | ---------------------------------------------- |
| **Basis**            | Reduction in impurity from splits                         | Average reduction in impurity across all trees |
| **Stability**        | Unstable, can vary with small data changes                | Stable, more robust due to averaging           |
| **Bias**             | May overemphasize features with many levels (categorical) | Reduces bias by aggregating over many trees    |
| **Interpretability** | Easier to see directly from the tree structure            | Requires aggregate importance scores           |


Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [1]:
#Ans-

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    "Feature": data.feature_names,
    "Importance": importances
})

# Sort by importance and get top 5
top5_features = feature_importance_df.sort_values(by="Importance", ascending=False).head(5)

# Print results
print("Top 5 Most Important Features:")
print(top5_features)


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree


In [4]:
#ans-

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -------------------------------
# Single Decision Tree
# -------------------------------
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# -------------------------------
# Bagging Classifier with Decision Trees
# -------------------------------
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # in sklearn >=1.2 use "estimator" instead of "base_estimator"
    n_estimators=50,   # number of trees
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

# -------------------------------
# Display Results
# -------------------------------
print("="*40)
print(" Accuracy Comparison on Iris Dataset ")
print("="*40)
print(f"Single Decision Tree Accuracy : {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy   : {bagging_accuracy:.4f}")
print("="*40)


 Accuracy Comparison on Iris Dataset 
Single Decision Tree Accuracy : 1.0000
Bagging Classifier Accuracy   : 1.0000


**Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy**

In [5]:
#Ans-

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset (Iris for demonstration)
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    "n_estimators": [50, 100, 150],   # number of trees
    "max_depth": [None, 3, 5, 7]      # depth of trees
}

# GridSearchCV setup
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,                # 5-fold cross-validation
    scoring="accuracy",
    n_jobs=-1            # use all CPU cores
)

# Train GridSearchCV
grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_

# Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# -------------------------------
# Print Results
# -------------------------------
print("="*50)
print(" Best Hyperparameters for Random Forest ")
print("="*50)
print(best_params)
print(f"\nFinal Accuracy on Test Set: {final_accuracy:.4f}")
print("="*50)


 Best Hyperparameters for Random Forest 
{'max_depth': None, 'n_estimators': 100}

Final Accuracy on Test Set: 1.0000


**Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)**

In [6]:
#Ans-

# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -------------------------------
# Bagging Regressor (with Decision Tree)
# -------------------------------
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
y_pred_bag = bagging_reg.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bag)

# -------------------------------
# Random Forest Regressor
# -------------------------------
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# -------------------------------
# Print Results
# -------------------------------
print("="*50)
print(" Comparison of Regressors on California Housing ")
print("="*50)
print(f"Bagging Regressor MSE       : {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE : {mse_rf:.4f}")
print("="*50)


 Comparison of Regressors on California Housing 
Bagging Regressor MSE       : 0.2579
Random Forest Regressor MSE : 0.2565


**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**

You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
 -  Choose between Bagging or Boosting
 -  Handle overfitting
 -  Select base models
 -  Evaluate performance using cross-validation
 - Justify how ensemble learning improves decision-making in this real-world
context.

**Ans-**

**Step-by-Step Approach for Loan Default Prediction with Ensemble Learning**

**1. Choosing Between Bagging and Boosting**

 - Bagging (e.g., Random Forest): Best when the dataset has high variance and risk of overfitting (like complex decision trees). It reduces variance by averaging many models.

 - Boosting (e.g., XGBoost, LightGBM, AdaBoost): Best when we need higher accuracy and want to reduce both bias and variance. It sequentially focuses on misclassified/under-predicted cases.

**2. Handling Overfitting**

 - Use cross-validation (k-fold CV) to ensure model generalization.

 - Apply regularization techniques available in boosting (e.g., learning_rate, max_depth, subsample, colsample_bytree).

 - Perform feature selection/importance analysis to remove noisy or redundant features.

 - Use early stopping in boosting to avoid overtraining.

**3. Selecting Base Models**

 - For Bagging: Decision Trees are common base learners (Random Forest is essentially bagged trees).

 - For Boosting: Shallow Decision Trees (depth 3–6) are used as weak learners because they balance bias/variance trade-off.

 - Could also experiment with logistic regression or SVM as base learners in ensemble stacking.

**4. Evaluating Performance with Cross-Validation**

 - Use Stratified k-fold Cross-Validation to preserve class distribution (important due to imbalanced loan default data).

 - Metrics:

   - ROC-AUC (better than accuracy for imbalanced data).

   - Precision-Recall AUC (to capture ability to identify defaults).

   - F1-score for balance between precision & recall.

 - Perform hyperparameter tuning (GridSearchCV or RandomizedSearchCV) within CV.

**5. Justification: Why Ensemble Learning Improves Decision-Making**

 - Higher Accuracy & Robustness: Combining multiple models reduces risk of poor predictions from a single model.

 - Bias-Variance Trade-off: Bagging reduces variance; Boosting reduces bias — both improve generalization.

 - Better Risk Assessment: In loan defaults, even small improvements in predictive accuracy mean millions saved in avoided bad loans.

