#Ensemble Learning | Assignment

1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.
   - Ensemble Learning in machine learning is a technique where multiple models, often referred to as "learners" or "base models," are combined to solve the same problem with the goal of achieving better performance than any single model alone. The key idea behind ensemble learning is that by aggregating the predictions of several diverse models, the overall system can make more accurate and robust predictions. This approach leverages the strengths of individual models while minimizing their weaknesses, as different models may capture different patterns in the data or make different types of errors. By combining them—through methods like bagging, boosting, or stacking—the ensemble reduces errors related to bias and variance, leading to improved generalization on unseen data.

2. What is the difference between Bagging and Boosting?
   - Bagging and Boosting are both ensemble learning techniques, but they differ in how they build and combine multiple models. Bagging (short for Bootstrap Aggregating) builds multiple models independently and in parallel using random subsets of the training data (with replacement). Each model is trained separately, and their outputs are typically combined through majority voting (for classification) or averaging (for regression), which helps reduce variance and prevent overfitting. In contrast, Boosting builds models sequentially, where each new model focuses on correcting the errors made by the previous ones. The models are dependent on each other, and weights are assigned to training instances so that misclassified ones get more attention in the next iteration. This sequential learning process aims to reduce both bias and variance, making Boosting generally more accurate but also more prone to overfitting if not properly tuned.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
  - Bootstrap sampling is a statistical technique where multiple random samples are drawn with replacement from a dataset. This means that each sample is the same size as the original dataset, but some data points may appear multiple times while others may not appear at all. In Bagging methods like Random Forest, bootstrap sampling plays a crucial role by creating diverse subsets of the training data for each individual model (or decision tree). Since each tree is trained on a different subset of data, it learns different patterns or makes different errors, introducing diversity among the models. This diversity is key to Bagging's effectiveness because when the predictions of multiple diverse models are combined—typically through majority voting—the ensemble can reduce variance, improve generalization, and become more robust to overfitting compared to a single model trained on the full dataset.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
   - Out-of-Bag (OOB) samples are the data points that are not included in a particular bootstrap sample during the training of ensemble models like Random Forests. Since bootstrap sampling is done with replacement, on average, about one-third of the original dataset is left out in each sample. These left-out instances are called OOB samples for that specific model. The OOB score is an internal validation method where each model in the ensemble is tested on its own OOB samples. The predictions from all models for their respective OOB samples are then aggregated to estimate the overall performance of the ensemble. This provides a reliable and efficient way to evaluate the model's accuracy without needing a separate validation set or cross-validation, making it especially useful when data is limited.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
   - Feature importance analysis in a single Decision Tree is based on how much each feature contributes to reducing impurity (such as Gini impurity or entropy) at each split in the tree. The more a feature is used to make key decisions that result in large information gains, the higher its importance score. However, this analysis can be unstable and biased, especially if the tree is deep or overfitted, since small changes in the data can significantly affect which features are selected for splitting. In contrast, Random Forests, which are ensembles of many decision trees trained on different bootstrap samples and random subsets of features, provide a more reliable and robust measure of feature importance. By averaging the importance scores of each feature across all trees, Random Forests reduce the variance and bias of the individual trees' assessments. This leads to a more generalized and stable understanding of which features truly matter across the entire dataset.

6. Write a Python program to:
    
    ● Load the Breast Cancer dataset using sklearn.datasets load_breast_cancer()

    ● Train a Random Forest Classifier
    
    ● Print the top 5 most important features based on feature importance scores.


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Get feature importance scores
importances = model.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance in descending order
top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

# Print the top 5 most important features
print("Top 5 Most Important Features:")
print(top_features.to_string(index=False))


Top 5 Most Important Features:
             Feature  Importance
          worst area    0.139357
worst concave points    0.132225
 mean concave points    0.107046
        worst radius    0.082848
     worst perimeter    0.080850


7.  Write a Python program to:

    ● Train a Bagging Classifier using Decision Trees on the Iris dataset

    ● Evaluate its accuracy and compare with a single Decision Tree

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
tree_preds = tree.predict(X_test)
tree_accuracy = accuracy_score(y_test, tree_preds)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_preds = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)

# Print results
print(f"Accuracy of Single Decision Tree: {tree_accuracy:.4f}")
print(f"Accuracy of Bagging Classifier:   {bagging_accuracy:.4f}")


8. Write a Python program to:

    ● Train a Random Forest Classifier

    ● Tune hyperparameters max_depth and n_estimators using GridSearchCV

    ● Print the best parameters and final accuracy


In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [2, 4, 6, None]
}

# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Get best model
best_model = grid_search.best_estimator_

# Make predictions and evaluate
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print(f"Final Accuracy on Test Set: {final_accuracy:.4f}")


Best Parameters: {'max_depth': 2, 'n_estimators': 10}
Final Accuracy on Test Set: 1.0000


9. Write a Python program to:

    ● Train a Bagging Regressor and a Random Forest Regressor on the California
    Housing dataset
    
    ● Compare their Mean Squared Errors (MSE)

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize regressors
bagging = BaggingRegressor(random_state=42)
random_forest = RandomForestRegressor(random_state=42)

# Train regressors
bagging.fit(X_train, y_train)
random_forest.fit(X_train, y_train)

# Predict on test set
y_pred_bagging = bagging.predict(X_test)
y_pred_rf = random_forest.predict(X_test)

# Calculate MSE
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print(f"Bagging Regressor MSE: {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")


Bagging Regressor MSE: 0.2824
Random Forest Regressor MSE: 0.2554


10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

    ● Choose between Bagging or Boosting

    ● Handle overfitting

    ● Select base models

    ● Evaluate performance using cross-validation

    ● Justify how ensemble learning improves decision-making in this real-world context.

    - As a data scientist tasked with predicting loan default, I would start by choosing between Bagging and Boosting based on the characteristics of the data and the problem. Since financial data often benefits from reducing bias and capturing complex patterns, I would lean towards Boosting methods like XGBoost or LightGBM, which sequentially improve on errors and often yield higher accuracy. To handle overfitting, I would apply regularization techniques such as limiting tree depth, using a low learning rate, early stopping during training, and incorporating subsampling of data or features to add randomness. For base models, decision trees are typically the preferred choice due to their ability to model nonlinear relationships and interactions within demographic and transaction features. To rigorously evaluate performance and tune hyperparameters, I would employ stratified k-fold cross-validation, ensuring that the model generalizes well across different subsets of data, and use metrics such as ROC-AUC and F1-score to account for class imbalance. Ensemble learning improves decision-making in this context by combining multiple models to reduce variance and bias, leading to more accurate and stable predictions of loan default risk. This increased accuracy helps the financial institution better assess credit risk, minimize loan losses, and make informed lending decisions, ultimately balancing profitability with risk management.