# Ensemble Learning

 # 1: What is Ensemble Learning in machine learning? Explain the key idea
    behind it.

    -> Ensemble Learning is a machine learning technique that combines   predictions from multiple models to improve overall performance.
    Key Idea: Combine weak learners to form a strong learner. Common approaches:
        Bagging (Bootstrap Aggregation) – reduces variance

Bo      osting – reduces bias

# 2. What is the difference between Bagging and Boosting?

  -> Bagging: Builds multiple models in parallel using bootstrap samples to reduce variance.

   Boosting: Builds models sequentially, each correcting previous errors, to reduce bias and variance.

# 3. : What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

  -> Bootstrap sampling is a technique where multiple datasets are created from the original dataset by randomly sampling with replacement. This means some data points may appear multiple times, while others may be left out in each sample.
  
  Role in Bagging (e.g., Random Forest):

   Used to train each individual model (like each decision tree) on a different subset of data.

   Introduces diversity among the models, which helps reduce variance and improves overall model stability and accuracy.

   The samples not selected in a bootstrap sample are called Out-of-Bag (OOB) samples and can be used to estimate model performance without a separate validation set.


# 4. : What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

  -> Out-of-Bag (OOB) Samples:
OOB samples are the data points not included in a bootstrap sample for a particular model in Bagging (like a tree in Random Forest).
  
  OOB Score:

   After training, the model predicts labels for its OOB samples.

   The OOB score is the average accuracy (or error) on all OOB samples across all models.

   It provides an internal validation metric without needing a separate test set, helping evaluate the ensemble model’s performance.

# 5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

  -> Single Decision Tree:
     * Feature importance is calculated based on how much a feature reduces impurity (like Gini or entropy) at each split.
     * can be unstable, as small changes in data may drastically change the tree structure and importance values.



  Random Forest:
    * Feature importance is averaged over all trees in the forest, making it more stable and reliable.

    * Captures feature contributions better because it considers multiple bootstrap samples and random feature subsets.

In [1]:
# 6.  Write a Python program to:
# ● Load the Breast Cancer dataset using
#  sklearn.datasets.load_breast_cancer()
#  ● Train a Random Forest Classifier
#  ● Print the top 5 most important features based on feature importance scores.
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importances = rf.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

top5_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

print("Top 5 Important Features:")
print(top5_features)


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [3]:
 # 7: Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree
# (Include your Python code and output in the code box below.)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees
bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)

# Print accuracies
print(f"Accuracy of single Decision Tree: {accuracy_dt:.4f}")
print(f"Accuracy of Bagging Classifier: {accuracy_bag:.4f}")




Accuracy of single Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000


In [4]:
# 8.n 8: Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy
# (Include your Python code and output in the code box below.)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 2, 4, 6]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Evaluate final model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print(f"Final Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0000


In [6]:
# 9.: Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California
# Housing dataset
# ● Compare their Mean Squared Errors (MSE)
# (Include your Python code and output in the code box below.)
# Train Random Forest Regressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor
bag_reg = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=100, random_state=42)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print(f"Mean Squared Error of Bagging Regressor: {mse_bag:.4f}")
print(f"Mean Squared Error of Random Forest Regressor: {mse_rf:.4f}")



Mean Squared Error of Bagging Regressor: 0.2568
Mean Squared Error of Random Forest Regressor: 0.2565


# 10. 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
contex

  -> 1.Choose between Bagging or Boosting:

    * Bagging (e.g., Random Forest) is suitable if the dataset is large and prone to high variance, helping stabilize predictions.

    * Boosting (e.g., AdaBoost, Gradient Boosting, XGBoost, LightGBM) is ideal if the dataset shows bias errors and we want to sequentially improve weak learners.

    * I would start with Boosting, as predicting loan default often benefits from reducing bias and focusing on misclassified “high-risk” cases.

  2. Handle Overfitting:
     * Use cross-validation to monitor performance.
     * Limit model complexity with max_depth, min_samples_split, and regularization parameters.
     * Apply early stopping in boosting methods.

  3. Select Base Models:

     * For Bagging, typically Decision Trees are used due to high variance and their suitability for structured/tabular data.

     * For Boosting, Decision Trees with shallow depth are preferred to avoid overfitting while allowing sequential learning.

  4. Evaluate Performance Using Cross-Validation:
     Use k-fold cross-validation (e.g., 5 or 10 folds) to estimate the model’s performance robustly.
    Track metrics relevant to loan default, such as accuracy, precision, recall, F1-score, and AUC-ROC, especially since the dataset may be imbalanced.


  5. Justify How Ensemble Learning Improves Decision-Making:

    * Ensemble methods combine multiple models to reduce variance, bias, or both, resulting in more accurate and stable predictions.

    * In financial institutions, this means better identification of high-risk borrowers, reducing default rates and financial loss.

    * Ensembles help mitigate the risk of relying on a single model that might underperform on certain customer segments, making decision-making more robust and reliable.