#Ensemble Learning

Q1. What is Ensemble Learning in machine learning? Explain the key idea behind it ?

Ans- Ensemble Learning is a machine learning technique where multiple models (called base learners or weak learners) are trained and their predictions are combined to produce a better overall model.


Key Idea Behind Ensemble Learning
- A group of diverse models, when combined intelligently, performs better than any single model.

Reducces Variance

- Some models (e.g., Decision Trees) are sensitive to data changes

- Combining many such models stabilizes predictions

Reduces Bias

- Weak models can be combined to form a strong model

Improves Accuracy & Robustness

 - Errors made by one model may be corrected by others

Q2. What is the difference between Bagging and Boosting?

Ans-

| Aspect             | **Bagging (Bootstrap Aggregating)**                               | **Boosting**                                                       |
| ------------------ | ----------------------------------------------------------------- | ------------------------------------------------------------------ |
| Basic idea         | Builds **independent models** on different random subsets of data | Builds models **sequentially**, each correcting previous errors    |
| Data sampling      | Uses **bootstrap sampling** (with replacement)                    | Uses **weighted sampling**, focusing more on misclassified samples |
| Model dependency   | Models are **independent** of each other                          | Models are **dependent** on previous models                        |
| Goal               | Reduce **variance**                                               | Reduce **bias** (and variance in some cases)                       |
| Handling errors    | All samples treated **equally**                                   | Misclassified samples get **higher importance**                    |
| Overfitting        | Helps prevent overfitting                                         | Can **overfit** if too many iterations                             |
| Noise sensitivity  | Less sensitive to noise                                           | More sensitive to noisy data and outliers                          |
| Parallel training  | Yes (models can be trained in parallel)                           | No (models trained one after another)                              |
| Typical base model | High-variance models (e.g., Decision Trees)                       | Weak learners (e.g., Decision Stumps)                              |
| Example algorithms | Random Forest                                                     | AdaBoost, Gradient Boosting, XGBoost                               |


Q3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Ans- Bootstrap sampling is a resampling technique where multiple new training datasets are created by randomly sampling from the original dataset with replacement.

Each bootstrap sample has the same size as the original dataset, but some records may appear multiple times, while others may not appear at all.

Role of Bootstrap Sampling in Bagging

Creates Diversity Among Models

- Each tree is trained on a different bootstrap dataset

- Different data → different trees → model diversity

- Diversity is essential for an effective ensemble

Reduces Variance

- Individual decision trees are high-variance models

- Averaging predictions from many trees trained on bootstrap samples stabilizes results

- Helps prevent overfitting

Enables Out-of-Bag (OOB) Error Estimation

- Data points not selected in a bootstrap sample (~37%) are used as validation data

- OOB error provides a built-in accuracy estimate without a separate test set

Foundation of Random Forest

In Random Forest:

- Bootstrap sampling selects different training data for each tree

- Random feature selection further increases diversity

- Together, they make Random Forest robust and accurate

Q4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Ans- Out-of-Bag (OOB) samples are the data points that are not selected in a bootstrap sample when training a model in Bagging-based ensemble methods such as Random Forest.

For each tree in a Random Forest:

- A bootstrap sample is drawn from the dataset

- Some observations are not selected at all

- These unselected observations are the OOB samples for that tree

Step-by-Step Process:

1)  For a given data point, collect predictions only from trees where this point was OOB

2) Combine these predictions:

- Majority vote (classification)

- Average (regression)

3) Compare the combined prediction with the true label

4) Repeat for all data points

5) Compute overall accuracy (or error)


Q5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Ans-

| Aspect                       | **Single Decision Tree**                                                                   | **Random Forest**                                                 |
| ---------------------------- | ------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
| How importance is calculated | Based on **total reduction in impurity** (Gini/Entropy/MSE) from splits using that feature | **Averaged impurity reduction** of a feature across **all trees** |
| Stability                    | **Unstable** (changes with small data variations)                                          | **More stable and reliable**                                      |
| Bias                         | Can be **biased toward features with many split points**                                   | Bias is **reduced but not eliminated**                            |
| Overfitting effect           | High risk of overfitting → misleading importance                                           | Lower overfitting → more trustworthy importance                   |
| Use of data                  | Uses **entire dataset once**                                                               | Uses **multiple bootstrap samples**                               |
| Feature interaction capture  | Limited (single structure)                                                                 | Better capture of **feature interactions**                        |
| Sensitivity to noise         | Highly sensitive                                                                           | More robust to noisy features                                     |
| Generalization               | Poorer generalization                                                                      | Better generalization                                             |
| Example output               | One feature may dominate strongly                                                          | Importance is **distributed** across features                     |


Q6. Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
(Include your Python code and output in the code box below.)



In [1]:
# Ans Q6.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd


data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)


importances = rf.feature_importances_


feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print top 5 most important features
print(feature_importance_df.head())


                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Q7. Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [2]:
#Ans Q7.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score


data = load_iris()
X = data.data
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_predictions = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_predictions = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)

# Print accuracies
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bagging_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


Q8. Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [3]:
#Ans Q8.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score


data = load_breast_cancer()
X = data.data
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Initialize Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}


grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Train the model
grid_search.fit(X_train, y_train)

# Get the best model
best_rf = grid_search.best_estimator_


y_pred = best_rf.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# Print best parameters and final accuracy
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", final_accuracy)


Best Parameters: {'max_depth': 5, 'n_estimators': 150}
Final Accuracy: 0.9707602339181286


Q9. Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [4]:
#Ans Q9.


from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error


data = fetch_california_housing()
X = data.data
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


bagging_regressor = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)

bagging_regressor.fit(X_train, y_train)
y_pred_bagging = bagging_regressor.predict(X_test)


mse_bagging = mean_squared_error(y_test, y_pred_bagging)


random_forest_regressor = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

random_forest_regressor.fit(X_train, y_train)
y_pred_rf = random_forest_regressor.predict(X_test)

# Calculate MSE for Random Forest Regressor
mse_rf = mean_squared_error(y_test, y_pred_rf)


print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)


Mean Squared Error (Bagging Regressor): 0.25592438609899626
Mean Squared Error (Random Forest Regressor): 0.2553684927247781


Q10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world
context

Ans-

1. Choosing Between Bagging and Boosting

Understanding the Problem

- Loan default data is usually imbalanced, noisy, and non-linear

- False negatives (missing a defaulter) are costly

Decision Strategy

- Bagging reduces variance and is robust to noise

- Boosting focuses on hard-to-classify samples and reduces bias

Choice

- Start with Bagging (Random Forest) for stability

- Use Boosting (Gradient Boosting / AdaBoost) to improve performance

In financial institutions, Random Forest is preferred initially due to stability and explainability.

2. Handling Overfitting

Overfitting can lead to risky credit decisions.

Techniques Used

- Limit tree depth (max_depth)

- Set minimum samples per leaf

- Use ensemble averaging (Bagging)

- Use cross-validation

- Use early stopping (Boosting)

3. Selecting Base Models

Criteria

- Handle non-linear patterns

- Work well with mixed data types

- Easy to interpret

Selected Base Models

- Decision Trees (for Bagging & Boosting)

- Logistic Regression (benchmark & explainability)

Decision Trees are chosen because:

- They capture interactions in demographic & transaction data

- Ensembles of trees reduce individual model weaknesses

4. Performance Evaluation using Cross-Validation

Why Cross-Validation?

- Prevents biased evaluation

- Ensures model stability

Method

- Stratified K-Fold Cross-Validation

- Maintains default / non-default ratio

Metrics Used

- ROC-AUC → Ranking default risk

- Accuracy → Overall correctness

5. Why Ensemble Learning Improves Decision-Making

Business Justification

- Combines multiple models → lower risk

- More stable predictions for high-value loans

- Handles noisy transactional data

- Improves approval quality

- Supports regulatory explainability (feature importance, SHAP)

In [6]:
#Ans Q10.


from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score


X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    weights=[0.85, 0.15],   # Imbalanced dataset
    random_state=42
)


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=6,
    random_state=42
)

rf_auc = cross_val_score(
    rf_model, X, y, cv=cv, scoring="roc_auc"
)


gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_auc = cross_val_score(
    gb_model, X, y, cv=cv, scoring="roc_auc"
)

# Print Results
print("Random Forest Mean ROC-AUC:", rf_auc.mean())
print("Gradient Boosting Mean ROC-AUC:", gb_auc.mean())


Random Forest Mean ROC-AUC: 0.9400632464462253
Gradient Boosting Mean ROC-AUC: 0.9509973193896281
