Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Answer: Ensemble Learning is a machine learning technique, where multiple models are trained, and combined to solve the same problem, instead of relying on a single model. The final prediction is made by aggregating the predictions of all the models.

Key Idea => A group of weak or moderately accurate models can combine together to form a strong model. Each model makes different errors, when the models are combined , these errors tend to cancel out. This leads to better accuracy. Common ensemble techniques are bagging , boosting and stacking.

Question 2: What is the difference between Bagging and Boosting?

Answer:

Bagging - trains multiple independent models on different random samples of dataset and combines their predictions.
Models are trained in parallel, each model has equal importance, focuses on reducing variance, works best with high-variance models.

Boosting - trains models sequentially, where each new model focuses more on the errors made by previous models.
Here models train one after another. Misclassified points get higher weight, focuses on reducing bias, can convert weak learners into strong learners.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer:

Bootstrap sampling is a statistical resampling technique where multiple new datasets are created from the original dataset by sampling with replacement.

Each boostrap sample has the same size as the original dataset. Some datapoints may appear multiple times, some datasets may not appear at all.

Role of Bootstrap sampling in Bagging & Random Forest.

Multiple bootstrap samples are generated from the original dataset, a decision tree is trained on each bootstrap sample. Trees are trained independently and in parallel. Final prediction is made by classification/regression.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Answer:

In bagging methods like Random forest, each model is trained on a bootstrap sample.

OOB samples are the datapoints from the original dataset that are not selected in a particular bootstrap sample.

On an average about 63% of data points appear in a bootstrap sample.

To evaluate ensemble models ---> Each tree is trained on its bootstrap sample. For every datapoint predictions are collected only from trees where that point was OOB, Final aggregated prediction is compared with the true label.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Answer:

1)Feature Importance at a Single Decision Tree -
Based on the impurity reduction (Gini/ Entropy), a feature is important if it creates large impurity decrease at splits.

a. depends heavily on one tree structure, b. highly sensitive to data variations, c.prone to overfitting

2)Feature Importance in a Random Forest.
Importance is averaged across many trees. Each tree is trained on - different bootstrap samples, random subset of features.

a. It is more stable and realiable, less sensitive to noise, reduces bias from individual trees.


Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
(Include your Python code and output in the code box below.)
Answer:


In [5]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()
X = data.data
y = data.target

feature_names = data.feature_names
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

feature_importance_df = feature_importance_df.sort_values(
    by='Importance',
    ascending=False

)

print('Top 5 Most Important Features:')
print(feature_importance_df.head(5))

Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)
Answer:

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

bagging = BaggingClassifier(estimator = DecisionTreeClassifier(), n_estimators = 100, random_state = 42)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

print(f'Single Decision Tree Accuracy: {accuracy}')
print(f'Bagging Classifier Accuracy: {bagging_accuracy}')


Single Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
(Include your Python code and output in the code box below.

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Random Forest model
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}
# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_


y_pred = best_rf.predict(X_test)

final_accuracy = accuracy_score(y_test, y_pred)

# Results
print("Best Parameters:", grid_search.best_params_)
print("Final Test Accuracy:", final_accuracy)

Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Test Accuracy: 0.9649122807017544


Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset (offline)
data = load_diabetes()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bagging Regressor
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)
# Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)

rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)

Bagging Regressor MSE: 2970.863235955056
Random Forest Regressor MSE: 2952.0105887640448
Bagging Regressor MSE: 2970.863235955056
Random Forest Regressor MSE: 2952.0105887640448


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

Answer:

Choosing between Bagging and Boosting - Analyze the data & problem

Loan default data is usually:

Noisy

High-dimensional

Imbalanced

Decision logic:

If the base model overfits → choose Bagging

If the base model underfits → choose Boosting

Final choice in this case:

Start with Bagging (Random Forest) for a stable baseline

Move to Boosting (Gradient Boosting / XGBoost) if higher accuracy is needed

✔ Reason: Boosting captures complex patterns in customer behavior better.

2️⃣ Handling Overfitting

Techniques used:

Bagging:

Bootstrap sampling reduces variance

Boosting:

Limit max_depth

Use learning_rate

General strategies:

Early stopping

Regularization

Feature selection

Pruning trees

✔ Result: Model generalizes better to unseen customers.

3️⃣ Selecting Base Models

Criteria:

Should be simple and interpretable

Should have high variance (to benefit from ensembling)

Common choices:

Decision Trees (depth-controlled)

Logistic Regression (baseline comparison)

✔ Decision Trees are preferred because:

Handle non-linearity

Capture feature interactions

Work well with ensemble methods.

4️⃣ Evaluating Performance Using Cross-Validation

Why cross-validation?

Prevents biased evaluation

Ensures stability across different data splits

Approach:

Use Stratified K-Fold CV (important for imbalanced loan data)

Evaluate using:

ROC-AUC

Precision & Recall

F1-score

✔ Cross-validation ensures the model performs consistently across customer segments.

5️⃣ Justifying the Final Ensemble Choice

Final justification:

Ensemble methods:

Reduce bias and variance

Handle complex financial patterns

Improve predictive power

Boosting often performs best for loan default prediction due to its focus on hard-to-classify customers

Business justification:

Fewer bad loans approved

Better risk control

Regulatory-friendly explainability (via feature importance)