**Ensemble Learning - Assignment**

**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

Ensemble learning is a machine learning technique that combines multiple models to produce a more accurate and robust prediction than any single model alone.

The key idea is that by combining diverse "weak learners," their individual errors cancel out, leading to improved overall performance, reduced bias, and greater reliability.

**Question 2: What is the difference between Bagging and Boosting?**

Bagging builds multiple models in parallel on different subsets of data, while boosting builds models sequentially, with each new model trying to correct the errors of the previous one.

Bagging reduces variance by averaging independent models, whereas boosting reduces bias by giving more weight to misclassified data points.

**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

Bootstrap sampling is a resampling technique where multiple datasets are created from a single dataset by randomly sampling with replacement.

This process is crucial for Bagging (Bootstrap Aggregating) methods like Random Forest because it creates diverse training sets for each model (e.g., decision tree), which helps to reduce variance and prevent overfitting by decorrelating the individual models.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

Out-of-Bag (OOB) samples are data points from the original dataset that are not included in a specific bootstrap sample used to train one tree in an ensemble model like a Random Forest.

 The OOB score is an evaluation metric calculated by using these OOB samples to estimate the model's performance, acting as a built-in validation set.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**


**Stability and Robustness**: A single Decision Tree's feature importance can be highly unstable and sensitive to small changes in the training data, leading to fluctuating importance scores. Random Forests, by averaging feature importances across multiple trees trained on bootstrapped samples, provide more stable and robust estimates less prone to noise or minor data variations.

**Overfitting Bias:** A single Decision Tree is prone to overfitting, which can lead to inflated importance scores for features that might be highly relevant only to the specific training data. Random Forests mitigate this by considering feature importance across diverse trees, reducing the bias towards features that might be overfit in a single tree.

Question 6: Write a Python program to:

- Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer dataset
# [1]
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
feature_names = cancer.feature_names

# Train a Random Forest Classifier
# [2]
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a pandas Series to associate feature names with their importance scores
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

# Print the top 5 most important features
# [3]
print("Top 5 most important features:")
print(feature_importances.head(5))

Top 5 most important features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


Question 7: Write a Python program to:
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree

In [6]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Train and evaluate a single Decision Tree Classifier
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_single = single_tree.predict(X_test)
accuracy_single_tree = accuracy_score(y_test, y_pred_single)
print(f"Accuracy of a single Decision Tree: {accuracy_single_tree:.4f}")

# 2. Train and evaluate a Bagging Classifier with Decision Trees as base estimators
# n_estimators: number of base estimators (Decision Trees) in the ensemble
# base_estimator: the estimator to be used as the base learner (DecisionTreeClassifier)
bagging_classifier = BaggingClassifier(estimator=DecisionTreeClassifier(random_state=42),
                                       n_estimators=10,
                                       random_state=42)
bagging_classifier.fit(X_train, y_train)
y_pred_bagging = bagging_classifier.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy of Bagging Classifier: {accuracy_bagging:.4f}")

# Compare the accuracies
if accuracy_bagging > accuracy_single_tree:
    print("\nThe Bagging Classifier performed better than the single Decision Tree.")
elif accuracy_bagging < accuracy_single_tree:
    print("\nThe single Decision Tree performed better than the Bagging Classifier.")
else:
    print("\nThe Bagging Classifier and the single Decision Tree achieved the same accuracy.")

Accuracy of a single Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000

The Bagging Classifier and the single Decision Tree achieved the same accuracy.


Question 8: Write a Python program to:
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
- Print the best parameters and final accuracy

In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris # Using a sample dataset for demonstration

# Load a sample dataset (Iris dataset)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20]      # Maximum depth of the tree
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform GridSearchCV to find the best hyperparameters
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print("Best Parameters:", grid_search.best_params_)

# Get the best estimator (model with the best parameters)
best_rf_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred = best_rf_model.predict(X_test)

# Calculate and print the final accuracy
final_accuracy = accuracy_score(y_test, y_pred)
print("Final Accuracy with best parameters:", final_accuracy)

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy with best parameters: 1.0


Question 9: Write a Python program to:
- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
- Compare their Mean Squared Errors (MSE)

In [8]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.data
y = california_housing.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Bagging Regressor
bagging_regressor = BaggingRegressor(random_state=42)
bagging_regressor.fit(X_train, y_train)

# Make predictions with the Bagging Regressor
y_pred_bagging = bagging_regressor.predict(X_test)

# Calculate Mean Squared Error for Bagging Regressor
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# Initialize and train the Random Forest Regressor
random_forest_regressor = RandomForestRegressor(random_state=42)
random_forest_regressor.fit(X_train, y_train)

# Make predictions with the Random Forest Regressor
y_pred_random_forest = random_forest_regressor.predict(X_test)

# Calculate Mean Squared Error for Random Forest Regressor
mse_random_forest = mean_squared_error(y_test, y_pred_random_forest)

# Print the Mean Squared Errors
print(f"Mean Squared Error (Bagging Regressor): {mse_bagging:.4f}")
print(f"Mean Squared Error (Random Forest Regressor): {mse_random_forest:.4f}")

# Compare and determine the better model
if mse_bagging < mse_random_forest:
    print("The Bagging Regressor performed better (lower MSE).")
elif mse_random_forest < mse_bagging:
    print("The Random Forest Regressor performed better (lower MSE).")
else:
    print("Both regressors achieved the same MSE.")

Mean Squared Error (Bagging Regressor): 0.2824
Mean Squared Error (Random Forest Regressor): 0.2554
The Random Forest Regressor performed better (lower MSE).


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world context.

To predict loan defaults, choose Bagging for reducing variance (e.g., Random Forest) or Boosting for reducing bias (e.g., Gradient Boosting) based on initial model performance.

 Prevent overfitting by using techniques like regularization, pruning, or early stopping, and by employing cross-validation to get an unbiased performance estimate.

 Select base models like decision trees or logistic regression, which are robust and interpretable for this task, then combine them using ensemble methods to improve accuracy and decision-making, as ensemble models handle diverse data patterns better than single models.