1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.

-> Ensemble Learning is a technique in machine learning where multiple models are combined to improve accuracy and robustness.

Key idea: Different models make different errors, so combining them reduces mistakes.

Main types:

Bagging: Train models on random subsets → reduce variance (e.g., Random Forest)

Boosting: Train models sequentially, focusing on mistakes → reduce bias (e.g., XGBoost)

Stacking: Combine different models using a meta-model → leverage strengths of each

Pros: More accurate, less overfitting.
Cons: Slower, harder to interpret.

2. What is the difference between Bagging and Boosting?

-> Bagging: Builds multiple models in parallel on random subsets and combines them to reduce variance (e.g., Random Forest).

Boosting: Builds models sequentially, each focusing on previous errors to reduce bias (e.g., XGBoost).

Key difference: Bagging focuses on stability, Boosting focuses on correcting mistakes.

Bagging treats all models equally when combining predictions.

Boosting weights models based on their performance, giving more importance to accurate ones.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

-> Bootstrap Sampling:

It is a method of creating random subsets of the training data by sampling with replacement.

Some data points may appear multiple times in a subset, while others may be left out.

Role in Bagging :

Each decision tree in a Random Forest is trained on a different bootstrap sample, making the trees diverse.

Each split is selected by randomly subset of features.

Combining predictions from these diverse trees via voting or averaging reduces variance, reduce pverfitting and improves accuracy.

4 What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

-> Out-of-Bag Samples:

In bootstrap sampling, each model is trained on a random subset of data.

The data points not included in a model’s training subset are called OOB samples.

OOB Score:

Used to evaluate ensemble models without a separate test set.

Each OOB sample is predicted by the trees that did not see it during training, and these predictions are compared to the true labels.

The average accuracy or error over all OOB samples gives the OOB score, a reliable estimate of model performance.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.


-> Single Decision Tree:

Feature importance is based on how much each feature reduces impurity (like Gini or entropy) in that one tree.

Can be biased toward features with more levels or numerical range.

Random Forest:

Feature importance is averaged across all trees, making it more robust.

Less biased, as multiple trees reduce the effect of any single tree’s errors.

6. Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))

Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


7. Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

bagging = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42,
    bootstrap=True
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)

print(f"Accuracy of Single Decision Tree: {accuracy_dt:.4f}")
print(f"Accuracy of Bagging Classifier: {accuracy_bag:.4f}")

Accuracy of Single Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000


8. Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [6]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5, 7]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print(f"Test Set Accuracy: {accuracy:.4f}")

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Test Set Accuracy: 1.0000


10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

1. Choosing Between Bagging or Boosting

Bagging (Random Forest) → reduces variance, stable baseline.

Boosting (XGBoost/LightGBM) → reduces bias, captures complex patterns.

2. Handling Overfitting

Limit tree depth, min samples per leaf.

Use regularization and early stopping.

Perform cross-validation to monitor performance.

3. Selecting Base Models

Use decision trees for flexibility and non-linear patterns.

Optionally combine with logistic regression for stacking.

4. Evaluating Performance

Use k-fold cross-validation.

Metrics: Accuracy, Precision, Recall, ROC-AUC.

5. Justification for Ensemble Learning

Combines multiple models → reduces errors.

Provides more reliable loan risk prediction.

Supports better financial decisions and risk management.