Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

Ans:-
Ensemble Learning is a machine learning paradigm where multiple models (called base learners) are combined to solve the same problem, with the goal of achieving better performance, stability, and generalization than any single model alone.

Key Idea Behind Ensemble Learning

- “Many weak learners together can form a strong learner.”

Instead of relying on one model:

- Train several different models

- Combine their predictions intelligently

- Reduce individual model errors

This works because different models tend to make different mistakes, and combining them helps cancel out errors.

Why Ensemble Learning Works

Ensembles improve performance by addressing:
- Bias → underfitting
- Variance → overfitting
- Noise sensitivity


Question 2: What is the difference between Bagging and Boosting?

Ans:- Difference Between Bagging and Boosting

Both Bagging and Boosting are ensemble learning techniques, but they differ fundamentally in how models are trained and how errors are handled.

1. Bagging (Bootstrap Aggregating)

Key Idea

- Reduce variance by training models independently on different random subsets of data.

How It Works

1. Create multiple bootstrap samples (sampling with replacement).
2. Train a base model on each sample independently.
3. Combine predictions by:

- Majority voting (classification)
- Averaging (regression)

Characteristics

- Models are trained in parallel
- Each model has equal weight
- Does not focus on difficult samples

Best Suited For

- High-variance models
- Models prone to overfitting

2. Boosting

Key Idea

- Reduce bias by training models sequentially, focusing on previous errors.

How It Works

1. Train a weak learner on the dataset.
2. Increase weights of misclassified samples.
3. Train the next model emphasizing these difficult samples.
4. Combine models using weighted voting

Characteristics

- Models are trained sequentially
- Each model has a different weight
- Strongly focuses on hard-to-classify points

Best Suited For

- High-bias models
- Complex patterns

Examples

- AdaBoost
- Gradient Boosting
- XGBoost / LightGBM



Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Ans:-
Bootstrap sampling is a resampling technique where multiple datasets are created by randomly sampling from the original dataset with replacement.

Each bootstrap sample has the same size as the original dataset, but may contain duplicate records and leave out some original samples.

How Bootstrap Sampling Works

- Given a dataset of size N:
- Randomly sample N points with replacement
- Some observations appear multiple times
- Some observations are not selected at all

On average:

- ~63.2% unique samples are included
- ~36.8% samples are left out (out-of-bag samples)

Role of Bootstrap Sampling in Bagging
1. Creates Diversity Among Models

- Each model sees a different version of the data
- Leads to decorrelated models

Why important?
- Ensemble learning only works if models make different errors.

2. Reduces Variance

- Individual models (e.g., decision trees) have high variance
- Averaging predictions from multiple bootstrap-trained models stabilizes results

3. Enables Parallel Training

- Bootstrap samples are independent
- Models can be trained in parallel

Bootstrap Sampling in Random Forest

Random Forest applies two levels of randomness:

1. Data Randomness (Bootstrap Sampling)

- Each decision tree is trained on a bootstrap sample

2. Feature Randomness

- At each split, a random subset of features is considered

- This double randomness makes trees less correlated.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Ans- Out-of-Bag (OOB) samples arise naturally in Bagging-based ensemble methods such as Random Forest due to bootstrap sampling

What Are Out-of-Bag (OOB) Samples?

- In bootstrap sampling, each model is trained on a dataset created by sampling with replacement from the original dataset.

As a result:

- Some samples appear multiple times
- Some samples are not selected at all

The samples not used to train a particular model are called Out-of-Bag (OOB) samples for that model.

- Key Fact

- On average, ~36.8% of the original data becomes OOB for each model.

How OOB Score Is Used to Evaluate Ensemble Models
Step-by-Step Process

1. For each data point:

- Identify all trees where this data point was OOB

2. Make predictions using only those trees

3. Aggregate predictions:

- Majority vote (classification)
- Mean (regression)

4. Compare predicted value with the true label

OOB Score

- The OOB score is the overall accuracy (or R² for regression) computed using only OOB predictions.
- It serves as an internal validation metric.

Why OOB Score Is Useful
1. No Need for a Separate Validation Set

- Saves data (important when data is limited)
- Especially valuable in biomedical and small-sample problems

2. Unbiased Performance Estimate

- Each sample is evaluated on models that never saw it during training
- Similar to cross-validation

3. Computationally Efficient

- Comes “for free” during training
- No extra resampling required


Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Ans:- Feature importance explains which input features most influence a model’s predictions.

Both Decision Trees and Random Forests provide feature importance, but they differ significantly in stability, reliability, and interpretation.

1. Feature Importance in a Single Decision Tree
How It Is Computed

- Based on reduction in impurity (Gini, Entropy, or MSE)

- Features used near the root usually get higher importance

- Importance is calculated from one tree only

Characteristics

- Highly sensitive to data variations

- Can change drastically with small data changes

- Prone to overfitting

- Strong bias toward features with:

- Many unique values

- Continuous variables

2. Feature Importance in Random Forest
How It Is Computed

- Average of impurity reduction across many trees

- Each tree sees:

- A different bootstrap sample

- A random subset of features at each split

Characteristics

- Much more stable than a single tree

- Less sensitive to noise

- Captures global feature relevance

- Still biased toward high-cardinality features (for impurity-based importance)



Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [None]:
# Question 6: Write a Python program to:
# Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
# Train a Random Forest Classifier
# Print the top 5 most important features based on feature importance scores.

In [2]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# 1. Load the Breast Cancer dataset
bcs = load_breast_cancer()
X = bcs.data
y = bcs.target
feature_names = bcs.feature_names

print("Dataset loaded successfully.")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")

# 2. Train a Random Forest Classifier
# Using a random state for reproducibility
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X, y)

print("\nRandom Forest Classifier trained successfully.")

# 3. Print the top 5 most important features based on feature importance scores.
# Get feature importances from the trained model
importances = rf_classifier.feature_importances_

# Create a pandas Series for easier sorting and display
feature_importances = pd.Series(importances, index=feature_names)

# Sort the features by importance in descending order and get the top 5
top_5_features = feature_importances.nlargest(5)

print("\nTop 5 most important features:")
print(top_5_features)

Dataset loaded successfully.
Number of samples: 569
Number of features: 30

Random Forest Classifier trained successfully.

Top 5 most important features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [None]:
# Question 7: Write a Python program to:
# Train a Bagging Classifier using Decision Trees on the Iris dataset
# Evaluate its accuracy and compare with a single Decision Tree




In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

print("Iris dataset loaded successfully.")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("\nData split into training and testing sets.")

# 2. Train a single Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy for single Decision Tree
y_pred_dt = dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

print("\nSingle Decision Tree Classifier trained successfully.")
print(f"Accuracy of single Decision Tree: {accuracy_dt:.4f}")

# 3. Train a Bagging Classifier using Decision Trees
# Base estimator is a Decision Tree. We'll use 10 base estimators (n_estimators=10)
bag_classifier = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=10,
    random_state=42,
    bootstrap=True, # Bootstrap sampling is enabled by default
    n_jobs=-1 # Use all available CPU cores
)
bag_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy for Bagging Classifier
y_pred_bag = bag_classifier.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)

print("\nBagging Classifier trained successfully.")
print(f"Accuracy of Bagging Classifier: {accuracy_bag:.4f}")

# 4. Compare the accuracies
print("\n--- Comparison ---")
if accuracy_bag > accuracy_dt:
    print(f"The Bagging Classifier (Accuracy: {accuracy_bag:.4f}) performed better than the single Decision Tree (Accuracy: {accuracy_dt:.4f}).")
elif accuracy_bag < accuracy_dt:
    print(f"The single Decision Tree (Accuracy: {accuracy_dt:.4f}) performed better than the Bagging Classifier (Accuracy: {accuracy_bag:.4f}).")
else:
    print(f"Both models performed equally well (Accuracy: {accuracy_bag:.4f}).")


Iris dataset loaded successfully.
Number of samples: 150
Number of features: 4

Data split into training and testing sets.

Single Decision Tree Classifier trained successfully.
Accuracy of single Decision Tree: 1.0000

Bagging Classifier trained successfully.
Accuracy of Bagging Classifier: 1.0000

--- Comparison ---
Both models performed equally well (Accuracy: 1.0000).


In [None]:
# Question 8: Write a Python program to:
# Train a Random Forest Classifier
# Tune hyperparameters max_depth and n_estimators using GridSearchCV
# Print the best parameters and final accuracy


In [6]:
# Import necessary libraries for Random Forest and GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

print("Assuming Iris dataset (X_train, X_test, y_train, y_test) is already loaded from previous steps.")

# 1. Initialize a Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# 2. Define the hyperparameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10, 15]
}

# 3. Set up GridSearchCV
# cv=5 means 5-fold cross-validation
# verbose=1 shows some progress messages
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='accuracy')

# Fit GridSearchCV to the training data
print("\nStarting GridSearchCV to find the best hyperparameters...")
grid_search.fit(X_train, y_train)

print("GridSearchCV completed.")

# 4. Print the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"\nBest Hyperparameters: {best_params}")
print(f"Best Cross-validation Accuracy: {best_score:.4f}")

# 5. Evaluate the model with the best parameters on the test set
best_rf_model = grid_search.best_estimator_
y_pred_tuned = best_rf_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred_tuned)

print(f"Final Accuracy on Test Set with Best Parameters: {final_accuracy:.4f}")

Assuming Iris dataset (X_train, X_test, y_train, y_test) is already loaded from previous steps.

Starting GridSearchCV to find the best hyperparameters...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
GridSearchCV completed.

Best Hyperparameters: {'max_depth': None, 'n_estimators': 100}
Best Cross-validation Accuracy: 0.9429
Final Accuracy on Test Set with Best Parameters: 1.0000


In [None]:
# Question 9: Write a Python program to:
# Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
# Compare their Mean Squared Errors (MSE)


In [10]:
# Import Required Libraries
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

# Load the California Housing Dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train–Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a Bagging Regressor
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

bagging_reg.fit(X_train, y_train)

# Predictions
y_pred_bagging = bagging_reg.predict(X_test)

# MSE
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
print("Bagging Regressor MSE:", mse_bagging)

# Train a Random Forest Regressor

rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

rf_reg.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_reg.predict(X_test)

# MSE
mse_rf = mean_squared_error(y_test, y_pred_rf)
print("Random Forest Regressor MSE:", mse_rf)

# Compare the Results
print("\nMSE Comparison:")
print(f"Bagging Regressor MSE: {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")



Bagging Regressor MSE: 0.25592438609899626
Random Forest Regressor MSE: 0.2553684927247781

MSE Comparison:
Bagging Regressor MSE: 0.2559
Random Forest Regressor MSE: 0.2554


Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world context.

Ans:-
Step-by-Step Ensemble Strategy for Loan Default Prediction

Loan default prediction is a high-stakes, imbalanced, and noisy real-world problem. Ensemble learning improves accuracy, stability, and risk control, which are critical in finance.

1. Choosing Between Bagging and Boosting

Step 1: Understand Data Characteristics

- Mixed features: demographics + transaction behavior
- Non-linear relationships
- Class imbalance (few defaults)
- Risk of overfitting

2. Handling Overfitting

Techniques Used

A. Model-Level Controls
- Random Forest:
- Limit tree depth
- Minimum samples per leaf

Boosting:
- Learning rate
- Early stopping
- Subsampling

B. Data-Level Controls

- Feature scaling (where required)
- Handle class imbalance:
- Class weights
- SMOTE (if justified)

C. Validation Controls

- Stratified cross-validation
- Out-of-Bag (OOB) error for bagging

3. Selecting Base Models

Why Tree-Based Models?

- Handle non-linearity
- Work well with mixed data types
- No strict assumptions

4. Evaluating Performance Using Cross-Validation
* Why Accuracy Is Not Enough
- Loan default is imbalanced, so we focus on risk-aware metrics.


In [9]:
# 1. Import Required Libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    roc_auc_score, classification_report, confusion_matrix
)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Load / Simulate Loan Default Dataset

np.random.seed(42)

# Simulated dataset
n_samples = 5000
X = pd.DataFrame({
    "age": np.random.randint(21, 65, n_samples),
    "income": np.random.normal(50000, 15000, n_samples),
    "loan_amount": np.random.normal(20000, 8000, n_samples),
    "credit_score": np.random.normal(650, 70, n_samples),
    "transaction_count": np.random.poisson(30, n_samples),
    "late_payments": np.random.poisson(2, n_samples)
})

# Target: 1 = Default, 0 = No Default (imbalanced)
y = np.random.binomial(1, 0.18, n_samples)

# Train–Test Split (Stratified)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

# Baseline Model (Single Decision Tree)

dt_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", DecisionTreeClassifier(random_state=42))
])

dt_pipeline.fit(X_train, y_train)
y_pred_dt = dt_pipeline.predict(X_test)

print("Single Decision Tree Performance:")
print(classification_report(y_test, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))

# Bagging Approach – Random Forest

rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=8,
    min_samples_leaf=30,
    class_weight="balanced",
    oob_score=True,
    random_state=42
)

rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict_proba(X_test)[:, 1]

print("Random Forest ROC-AUC:",
      roc_auc_score(y_test, rf_pred))
print("OOB Score:", rf_model.oob_score_)

# Boosting Approach – Gradient Boosting

gb_model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.8,
    random_state=42
)

gb_model.fit(X_train, y_train)

gb_pred = gb_model.predict_proba(X_test)[:, 1]

print("Gradient Boosting ROC-AUC:",
      roc_auc_score(y_test, gb_pred))

# Cross-Validation Evaluation (Risk-Aware)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf_cv_auc = cross_val_score(
    rf_model, X_train, y_train,
    scoring="roc_auc",
    cv=cv
)

gb_cv_auc = cross_val_score(
    gb_model, X_train, y_train,
    scoring="roc_auc",
    cv=cv
)

print("RF CV ROC-AUC:", rf_cv_auc.mean())
print("GB CV ROC-AUC:", gb_cv_auc.mean())

# Final Model Evaluation (Confusion Matrix)

threshold = 0.5
final_preds = (gb_pred > threshold).astype(int)

print("Confusion Matrix:\n",
      confusion_matrix(y_test, final_preds))

print("\nClassification Report:\n",
      classification_report(y_test, final_preds))

# Feature Importance (Explainability)

importances = pd.Series(
    rf_model.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

print("Feature Importance (Random Forest):")
print(importances)



Single Decision Tree Performance:
              precision    recall  f1-score   support

           0       0.81      0.79      0.80      1018
           1       0.17      0.19      0.18       232

    accuracy                           0.68      1250
   macro avg       0.49      0.49      0.49      1250
weighted avg       0.69      0.68      0.69      1250

Confusion Matrix:
[[807 211]
 [188  44]]
Random Forest ROC-AUC: 0.5000169365219158
OOB Score: 0.6674666666666667
Gradient Boosting ROC-AUC: 0.49721817627532006
RF CV ROC-AUC: 0.4995066190982561
GB CV ROC-AUC: 0.4934798197463907
Confusion Matrix:
 [[1017    1]
 [ 232    0]]

Classification Report:
               precision    recall  f1-score   support

           0       0.81      1.00      0.90      1018
           1       0.00      0.00      0.00       232

    accuracy                           0.81      1250
   macro avg       0.41      0.50      0.45      1250
weighted avg       0.66      0.81      0.73      1250

Feature Impor