**Ensemble Learning | Assignment**

Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Ans -  Ensemble Learning is a technique where multiple models (weak learners) are combined to make better predictions.

**Key idea:** Combining diverse models reduces errors, increases accuracy, and improves generalization compared to a single model.


Question 2: What is the difference between Bagging and Boosting?

Ans - Bagging: Trains models in parallel on different random subsets of data to reduce variance (e.g., Random Forest).

Boosting: Trains models sequentially, where each model focuses on correcting the errors of the previous one to reduce bias (e.g., AdaBoost, XGBoost).

Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Ans - Bootstrap sampling is randomly selecting data points with replacement to create multiple subsets.
Role in Bagging: It ensures each model is trained on slightly different data, increasing diversity and reducing overfitting in methods like Random Forest.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Ans - OOB samples: Data points not included in a bootstrap sample for a given model.

OOB score: Performance is measured on these unused samples, providing a built-in validation method without needing a separate test set.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Ans - Decision Tree: Feature importance is based on how much each feature reduces impurity at its splits.

Random Forest: Importance is averaged over many trees, giving more stable, reliable, and less biased results.

Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Feature importance
importances = model.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)

# Print top 5 features
print("Top 5 important features:")
print(feat_imp.head(5))


Top 5 important features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

# Bagging Classifier with Decision Trees
bagging = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy


In [3]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest with GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5, 7]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42),
                    param_grid,
                    cv=5,
                    scoring='accuracy')
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0


Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)


In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor with Decision Trees
bagging = BaggingRegressor(DecisionTreeRegressor(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25772464361712627


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.


Ans -

1. **Choose Bagging or Boosting:**

   * Start with **Bagging** (e.g., Random Forest) to reduce variance.
   * Use **Boosting** (e.g., XGBoost, LightGBM) if the goal is to reduce bias and capture complex patterns.

2. **Handle Overfitting:**

   * Use regularization (max\_depth, learning\_rate in boosting).
   * Limit tree size, tune n\_estimators, apply early stopping.
   * Use cross-validation to prevent model from memorizing data.

3. **Select Base Models:**

   * Decision Trees (common choice).
   * Try Logistic Regression or Gradient Boosted Trees depending on data patterns.

4. **Evaluate Performance (Cross-Validation):**

   * Use **k-fold cross-validation** for robust performance estimation.
   * Evaluate with metrics like AUC-ROC, Precision-Recall (important for imbalance).

5. **Justification of Ensemble Learning:**

   * Combines strengths of multiple models, reducing both variance and bias.
   * Provides more reliable predictions, lowering risk of misclassifying good/bad borrowers.
   * Improves decision-making by giving financial institutions a **more accurate, fair, and risk-aware loan approval system**.
