# **Ensemble Learning | Assignment**

#**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

#**Answer:**
 Ensemble Learning is a machine learning technique where multiple models (called base learners or weak learners) are combined to solve the same problem and produce a better overall model. Instead of relying on a single model, ensemble learning aggregates the predictions of many models to improve accuracy, stability, and robustness.

The key idea behind ensemble learning is "wisdom of the crowd". Individual models may make mistakes, but when many diverse models are combined, their errors tend to cancel out. This results in better generalization performance on unseen data.

Common ensemble strategies include Bagging, Boosting, and Random Forests. Ensemble learning is widely used because it reduces overfitting, improves prediction accuracy, and handles complex datasets effectively.

#**Question 2: What is the difference between Bagging and Boosting?**

#**Answer:**
 Bagging (Bootstrap Aggregating) and Boosting are two popular ensemble techniques, but they work in different ways.

**Bagging:**

- Trains multiple models independently.

- Uses bootstrap sampling (random sampling with replacement).

- All models have equal importance.

- Reduces variance and helps prevent overfitting.

- Example: Random Forest.

**Boosting:**

- Trains models sequentially.

- Each new model focuses on correcting errors made by previous models.

- Models have different weights based on performance.

- Reduces bias and improves weak learners.

- Example: AdaBoost, Gradient Boosting.

In short, Bagging focuses on reducing variance, while Boosting focuses on reducing bias.

#**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

#**Answer:**
 Bootstrap sampling is a statistical technique where multiple datasets are created by randomly sampling from the original dataset with replacement. Each bootstrap sample has the same size as the original dataset, but some data points may appear multiple times, while others may not appear at all.

In Bagging methods like Random Forest, bootstrap sampling plays a crucial role by:

- Creating diverse training datasets for each decision tree.

- Reducing correlation between trees.

- Improving model stability and accuracy.

Because each tree is trained on a different subset of data, Random Forest becomes more robust and less prone to overfitting.

#**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

#**Answer:**
 Out-of-Bag (OOB) samples are the data points that are not selected in a bootstrap sample during training. On average, about 36% of the data is left out for each tree.

The OOB score is calculated by:

- Using OOB samples as a validation set.

- Evaluating predictions only on the data not seen by the corresponding tree.

OOB score provides an unbiased estimate of model performance without using a separate validation dataset. It is especially useful in Random Forest models.

#**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

#**Answer:**
 In a single Decision Tree, feature importance is calculated based on how much each feature reduces impurity (Gini or Entropy). However, this importance can be unstable because a single tree is sensitive to small changes in data.

In a Random Forest, feature importance is averaged across many trees. This provides:

- More reliable importance scores.

- Reduced bias toward dominant features.

- Better generalization.

Thus, Random Forest gives more accurate and stable feature importance compared to a single Decision Tree.

#**Question 6: Write a Python program to:**
 **● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()**

 **● Train a Random Forest Classifier**

**● Print the top 5 most important features based on feature importance scores.**

#**Answer:**

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd


# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target


# Train model
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)


# Feature importance
importance = rf.feature_importances_
features = pd.DataFrame({
'Feature': data.feature_names,
'Importance': importance
}).sort_values(by='Importance', ascending=False)


# Print top 5 features
print(features.head(5))

                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


#**Question 7: Write a Python program to:**

**● Train a Bagging Classifier using Decision Trees on the Iris dataset**

**● Evaluate its accuracy and compare with a single Decision Tree**

#**Answer:**
In this experiment, we train a single Decision Tree and a Bagging Classifier using Decision Trees on the Iris dataset. The goal is to compare their accuracies and observe how Bagging improves model performance.

In [6]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

# Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed 'base_estimator' to 'estimator'
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)

# Evaluate accuracy
dt_accuracy = accuracy_score(y_test, dt_pred)
bag_accuracy = accuracy_score(y_test, bag_pred)

print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


#**Comparison & Explanation:**

- The single Decision Tree gives good accuracy but may overfit the training data.

- The Bagging Classifier combines multiple Decision Trees trained on different bootstrap samples.

- Bagging reduces variance and improves generalization.

- As a result, the Bagging Classifier achieves higher accuracy than a single Decision Tree.

#**Conclusion:**

This experiment shows that Bagging improves classification performance by reducing overfitting and increasing model stability compared to a single Decision Tree.

#**Question 8: Write a Python program to:**

**● Train a Random Forest Classifier**

**● Tune hyperparameters max_depth and n_estimators using GridSearchCV**

**● Print the best parameters and final accuracy**
#**Answer:**


In [9]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Initialize Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20]
}

# Apply GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Train the model
grid_search.fit(X_train, y_train)

# Get the best model
best_rf = grid_search.best_estimator_

# Make predictions on test data
y_pred = best_rf.predict(X_test)

# Print best parameters and final accuracy
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9707602339181286


#**Conclusion:**

Using GridSearchCV helps in selecting the optimal values of n_estimators and max_depth, which improves the performance of the Random Forest Classifier. The tuned model achieves higher accuracy compared to the default model.

#**Question 9: Write a Python program to:**

**● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset**

**● Compare their Mean Squared Errors (MSE)**

#**Answer:**

In this experiment, a Bagging Regressor and a Random Forest Regressor are trained on the California Housing dataset. Their performances are compared using Mean Squared Error (MSE).

#**Python Code:**

In [10]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Bagging Regressor with Decision Trees
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)

# Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

# Train models
bagging.fit(X_train, y_train)
rf.fit(X_train, y_train)

# Predictions
bag_pred = bagging.predict(X_test)
rf_pred = rf.predict(X_test)

# Calculate Mean Squared Error
bag_mse = mean_squared_error(y_test, bag_pred)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print results
print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25650512920799395


#**Comparison & Explanation:**

- Bagging Regressor reduces variance by averaging predictions from multiple decision trees.

- Random Forest Regressor further improves performance by using both bagging and random feature selection.

- Lower MSE of Random Forest indicates better prediction accuracy.

#**Conclusion:**

Random Forest Regressor performs better than Bagging Regressor on the California Housing dataset, as shown by its lower Mean Squared Error.

#**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**
#**You decide to use ensemble techniques to increase model performance.**
#**Explain your step-by-step approach to:**
**● Choose between Bagging or Boosting**

**● Handle overfitting**

**● Select base models**

**● Evaluate performance using cross-validation**

**● Justify how ensemble learning improves decision-making in this real-world
context.**

#**Answer:**
As a data scientist working at a financial institution, my goal is to accurately predict whether a customer will default on a loan using demographic and transaction history data. Since loan default prediction is a high-risk and high-impact problem, I would use ensemble learning techniques to improve model performance, reliability, and decision-making. My step-by-step approach is explained below:

1. Choosing Bagging or Boosting:

    The first step is to select the appropriate ensemble method.

    - Bagging (Bootstrap Aggregating) is useful when:

        - The dataset contains noise.

        - The base model tends to overfit.

        - Reducing variance is important.

    - Boosting is useful when:

        - The data has complex patterns.

        - Reducing bias is required.

        - Misclassified defaulters must be handled carefully.

**Decision:**

For loan default prediction, I would start with Bagging using Random Forest because financial data is often noisy. If higher accuracy is required, I would then try Boosting models like Gradient Boosting.

2. Handling Overfitting:

    Overfitting can lead to poor performance on new customers. To control overfitting, I would:

      - Use ensemble models instead of a single model.

      - Limit tree complexity using:

        - max_depth

        - min_samples_leaf

      - Use cross-validation.

      - Use early stopping in boosting models.

      - Compare training and validation performance.

    These steps ensure that the model generalizes well to unseen data.

3. Selecting Base Models:

    The choice of base models is important for ensemble learning.

    I would select:

      - Decision Trees as base learners because:

        - They handle non-linear relationships well.

        - They work with both numerical and categorical data.

        - They are easy to interpret, which is important in financial decisions.

    Decision Trees provide a good balance between interpretability and performance.

Step 4: Evaluating Performance Using Cross-Validation

To evaluate model performance reliably, I would use k-fold cross-validation.

Steps:


1. Divide the dataset into k equal parts.

2. Train the model on k−1 folds.

3. Test on the remaining fold.

4. Repeat this process k times.

5. Calculate the average performance.

Evaluation metrics:

- Accuracy

- Precision

- Recall

- ROC-AUC

Cross-validation ensures stable and unbiased performance evaluation.

Step 5: How Ensemble Learning Improves Decision-Making


Ensemble learning improves decision-making by:

- Combining predictions from multiple models.

- Reducing the risk of wrong loan approvals.

- Improving prediction accuracy and stability.

- Handling complex customer behavior patterns.

- Providing more reliable risk assessment.

In real-world financial systems, ensemble models help institutions make safer and more informed loan decisions.

**Conclusion:**

By carefully choosing between Bagging and Boosting, controlling overfitting, selecting strong base models, and evaluating performance using cross-validation, ensemble learning significantly improves loan default prediction. This leads to better risk management, fairer decisions, and higher financial stability for the institution.

#**Python Code:**

In [4]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Create synthetic loan default dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=6,
    random_state=42
)

# Base model
dt = DecisionTreeClassifier(random_state=42)

# Bagging model
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42
)

# Boosting model
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42
)

# 5-fold cross-validation
dt_acc = cross_val_score(dt, X, y, cv=5, scoring='accuracy')
rf_acc = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
gb_acc = cross_val_score(gb, X, y, cv=5, scoring='accuracy')

print("Decision Tree Accuracy:", np.mean(dt_acc))
print("Random Forest Accuracy:", np.mean(rf_acc))
print("Gradient Boosting Accuracy:", np.mean(gb_acc))


Decision Tree Accuracy: 0.842
Random Forest Accuracy: 0.891
Gradient Boosting Accuracy: 0.882
