Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

Answer: Ensemble Learning in machine learning is a technique where multiple models (often called “weak learners”) are combined to create a stronger, more accurate model.

 Key idea: Instead of relying on a single model, ensemble methods aggregate predictions from several models to reduce errors, improve generalization, and handle variance/bias better.

Examples of ensemble methods:

Bagging (e.g., Random Forest) → reduces variance by training models on random subsets of data.

Boosting (e.g., XGBoost, AdaBoost) → reduces bias by sequentially improving weak models.

Stacking → combines different models through a meta-model.

Question 2: What is the difference between Bagging and Boosting?

Answer: Bagging vs Boosting:

**Bagging (Bootstrap Aggregating):**

Trains multiple models in parallel on random subsets of data.

Focuses on reducing variance.

Example: Random Forest.

**Boosting:**

Trains models sequentially, where each new model focuses on correcting errors of the previous one.

Focuses on reducing bias (and also variance).

Example: AdaBoost, XGBoost.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer:**Bootstrap Sampling:**
It is a statistical method where we create new datasets by randomly sampling (with replacement) from the original dataset, so some samples may appear multiple times while others may be left out.

Role in Bagging (e.g., Random Forest):

Each model (tree) is trained on a different bootstrap sample of the data.

This creates diversity among the models, reducing overfitting and variance.

Final prediction is made by averaging (regression) or voting (classification).

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Answer:**Out-of-Bag (OOB) Samples:**
In bootstrap sampling (used in Bagging/Random Forest), since each new dataset is created by sampling with replacement, about one-third of the original data is left out in each bootstrap sample. These unused data points are called OOB samples.

OOB Score:

OOB samples act like a built-in validation set.

After training each tree, its OOB samples (not seen during training) are used to test it.

The combined accuracy/error across all trees on their OOB samples is called the OOB score.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest

Answer:Feature Importance: Decision Tree vs. Random Forest

Single Decision Tree:

Feature importance is based on how much a feature reduces impurity (e.g., Gini, entropy, variance) at its splits.

Importance can be biased toward features that allow more splits or have many categories.

Since it’s only one tree, results may be unstable (high variance).

Random Forest:

Aggregates feature importance across many trees, making it more reliable and stable.

Reduces bias and variance compared to a single tree.

Two common ways:

Mean Decrease in Impurity (MDI): average impurity reduction across trees.

Mean Decrease in Accuracy (MDA): measures drop in accuracy when a feature is randomly shuffled.

Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scoresQuestion 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores



In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
})

# Sort by importance and print top 5
top5 = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)
print("Top 5 Most Important Features:\n", top5)


Top 5 Most Important Features:
                  Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed base_estimator to estimator
    n_estimators=50,      # number of trees
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

# Print results
print("Accuracy of Single Decision Tree:", acc_dt)
print("Accuracy of Bagging Classifier :", acc_bag)

Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier : 1.0


Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
final_acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Final Accuracy on Test Set:", final_acc)


Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Accuracy on Test Set: 0.9707602339181286


Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Bagging Regressor with Decision Trees
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print comparison
print("Mean Squared Error (Bagging Regressor):", mse_bag)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)

Mean Squared Error (Bagging Regressor): 0.25787382250585034
Mean Squared Error (Random Forest Regressor): 0.25650512920799395


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

Answer:

 **Step 1:** **Choose between Bagging or Boosting**

Bagging (e.g., Random Forest) → Best when the main problem is high variance (unstable predictions). It reduces overfitting by averaging many independent models.

Boosting (e.g., XGBoost, LightGBM) → Best when the main problem is high bias (model underfits). It builds models sequentially, focusing on hard-to-predict cases.
For loan default prediction (imbalanced + complex patterns), Boosting is often preferred since it captures non-linear interactions and rare patterns better.

**Step 2: Handle Overfitting**

Use techniques like:

Regularization in boosting (shrinkage/learning rate, max_depth).

Early stopping (stop training when validation performance stops improving).

Cross-validation to tune hyperparameters.

Feature selection/engineering to avoid noise.

**Step 3: Select Base Models**

Base models depend on data complexity:

Decision Trees (most common in ensemble methods).

Logistic Regression / Linear Models (if data is highly linear).

Neural Nets / Gradient Boosted Trees (for very large datasets with non-linear patterns).
In practice, Decision Trees are the best base learners for bagging and boosting in tabular financial data.

**Step 4: Evaluate Performance using Cross-Validation**

Apply Stratified k-Fold Cross-Validation (to maintain class balance since defaults are rare).

Metrics to use:

AUC-ROC (to evaluate ability to separate defaulters vs non-defaulters).

Precision, Recall, F1 (important because catching defaulters is more critical than overall accuracy).

Confusion Matrix to understand business trade-offs (false positives vs false negatives).

**Step 5: Justify How Ensemble Learning Improves Decision-Making**

Reduces risk: By combining multiple models, predictions are more robust and less sensitive to noise.

Improves accuracy: Boosting captures complex, non-linear relations in customer behavior.

Better generalization: Bagging prevents overfitting by averaging across models.

Business impact: More reliable default predictions → lower financial losses, better credit risk management, and improved trust in the institution’s lending system.