Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.

->Ensemble Learning is a machine learning technique where multiple models (called base learners or weak learners) are combined to build a stronger and more accurate model.

Key Idea:

The main idea behind ensemble learning is that a group of weak models can perform better together than any single strong model alone. Each model makes its own predictions, and their outputs are combined (by averaging, voting, or weighting) to produce the final result.

This helps to:

Reduce errors (bias and variance)

Improve accuracy

Increase model stability

Example:

Suppose three classifiers predict whether an email is spam:

Model 1 → Spam

Model 2 → Not Spam

Model 3 → Spam

By majority voting, the final decision will be Spam — this combined prediction is usually more reliable than any single model.

Question 2: What is the difference between Bagging and Boosting?

->Meaning:

Bagging means Bootstrap Aggregating, where multiple models are trained in parallel using random subsets of data.

Boosting means training models sequentially, where each model learns from the mistakes of the previous one.

Objective:

Bagging reduces variance.

Boosting reduces bias.

Model Training:

In Bagging, all models are trained independently.

In Boosting, models are trained one after another.

Data Sampling:

Bagging uses random samples (with replacement).

Boosting gives more weight to wrongly predicted samples.

Combining Results:

Bagging: Uses voting (for classification) or averaging (for regression).

Boosting: Uses weighted voting based on model performance.

examples:

Bagging: Random Forest

Boosting: AdaBoost, Gradient Boosting, XGBoost

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

->Bootstrap sampling is a technique where multiple random samples are drawn with replacement from the original training dataset. Each sample is of the same size as the original dataset, but because of replacement, some data points may appear multiple times, while others may be left out.

Role in Bagging (ex:- Random Forest):

Creates diversity: Each model (like a decision tree) is trained on a different bootstrap sample, which makes them see different subsets of the data.

Reduces variance: Since the models are trained on varied samples, their combined (averaged or voted) predictions become more stable and less prone to overfitting.

OOB estimation: The data points not included in a particular sample (called out-of-bag samples) are used to estimate model performance without needing a separate validation set.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

->Out-of-Bag (OOB) samples are the data points that are not included in a particular bootstrap sample during training in Bagging methods like Random Forest.
Since bootstrap sampling is done with replacement, about one-third of the data is typically left out each time — these are the OOB samples.

OOB Score – Evaluation:

For each model (tree), its OOB samples are used as a test set to predict outcomes.

The predictions from all trees for their respective OOB samples are combined to estimate the model’s accuracy.

The resulting accuracy is called the OOB score.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

->Decision Tree :

Calculated from how much each feature reduces impurity (like Gini or entropy) in that single tree.

Importance depends heavily on the specific training data used.

Can be biased toward features with more levels or higher cardinality.

Easier to interpret, as it’s just one tree.

Random Forest:-

Calculated by averaging the feature importance scores from all trees in the forest.

Importance is more stable and reliable due to averaging across many trees.

Reduces bias since multiple trees are trained on different samples and subsets of features.

Harder to interpret directly but gives more accurate overall importance.

In [6]:
'''Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
'''

import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer

data =load_breast_cancer()

x = data.data
y = data.target

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=1)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

model.fit(x_train,y_train)

importance = model.feature_importances_

df = pd.DataFrame({'features' : data.feature_names,
                   'importances' : importance})

df = df.sort_values(by = 'importances',ascending=False)

print("the top 5 features are : ",df.head(5))

the top 5 features are :                  features  importances
23            worst area     0.140143
20          worst radius     0.130077
27  worst concave points     0.123622
7    mean concave points     0.108659
22       worst perimeter     0.101750


In [9]:
'''Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
'''
from sklearn.datasets import load_iris

data = load_iris()
x = data.data
y = data.target

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=1)

from sklearn.ensemble import BaggingClassifier
model1 = BaggingClassifier(estimator=DecisionTreeClassifier(),n_estimators=100,random_state=1)

model2 = DecisionTreeClassifier()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)

y_pred_bagging = model1.predict(x_test)
y_pred_tree = model2.predict(x_test)
from sklearn.metrics import accuracy_score

print("the accuracy of the bagging classifier is : ",accuracy_score(y_test,y_pred_bagging))
print("the accuracy of the tree classifier is : ",accuracy_score(y_test,y_pred_tree))


the accuracy of the bagging classifier is :  0.9666666666666667
the accuracy of the tree classifier is :  0.9666666666666667


In [12]:
'''Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
'''
from sklearn.datasets import load_iris
data = load_iris()

x = data.data
y = data.target

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

param_grid = {'max_depth' : [1,2,3,4],
              'n_estimators' : [100,200,300]}

from sklearn.model_selection import GridSearchCV

model = GridSearchCV(estimator = clf,param_grid=param_grid,cv =5,verbose = 0)

model.fit(x_train,y_train)
print("the best parameters are : ",model.best_params_)

y_pred = model.best_estimator_.predict(x_test)
from sklearn.metrics import accuracy_score
print("the final accuracy are : ",accuracy_score(y_test,y_pred))

the best parameters are :  {'max_depth': 2, 'n_estimators': 100}
the final accuracy are :  0.9666666666666667


In [13]:
'''Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
'''

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bagging = BaggingRegressor(n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)


Mean Squared Error (Bagging Regressor): 0.25592438609899626
Mean Squared Error (Random Forest Regressor): 0.2553684927247781


Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

->1. Choose between Bagging or Boosting

Use Bagging (Random Forest) if model shows high variance (overfits easily).

Use Boosting (XGBoost/Gradient Boosting) if model shows high bias (underfits).

Try both and select the one with better validation accuracy.

2. Handle Overfitting

Limit tree depth (max_depth).

Use cross-validation for tuning.

Apply regularization (like learning_rate, min_samples_split).

Use early stopping (in boosting).

3. Select Base Models

Use Decision Tree as base for both bagging and boosting.

Can also test Logistic Regression or SVM for stacking.

4. Evaluate Performance (Cross-Validation)

Use Stratified K-Fold Cross-Validation to keep class balance.

Measure metrics like Accuracy, AUC-ROC, and F1-Score.

Choose model with best mean CV score.

5. How Ensemble Improves Decision-Making

Combines multiple models → reduces error.

Gives more stable and accurate loan default predictions.

Helps identify high-risk customers better → fewer bad loans and losses.

Improves fairness and confidence in credit decisions.