# Ensemble Learning | Assignment

1. **What is Ensemble Learning in machine learning? Explain the key idea behind it.**
     - Ensemble Learning is a machine learning technique that combines the predictions of multiple models to produce a more accurate and robust final result than any single model could achieve. The key idea behind ensemble learning is that a group of weak learners, when combined properly, can perform better than an individual strong learner.
In other words, ensemble methods leverage the diversity among models — by averaging, voting, or stacking their outputs — to reduce errors caused by bias, variance, or noise in the data.

     - Common ensemble techniques include:

        - **Bagging (Bootstrap Aggregating)**: Trains multiple models on random subsets of data and combines their results (e.g., Random Forest).

        - **Boosting**: Sequentially builds models that focus on correcting the errors of previous ones (e.g., AdaBoost, Gradient Boosting).

        - **Stacking**: Combines predictions from multiple models using another model (meta-learner) to make the final decision.

2.  **What is the difference between Bagging and Boosting?**

     - Ensemble learning is a technique that combines multiple machine learning models to improve overall performance. Two popular ensemble methods are Bagging and Boosting, and the main difference between them lies in how these models are trained and combined.
        - **Bagging (Bootstrap Aggregating)**: In Bagging, several models (like Decision Trees) are trained independently on different random subsets of the training data. The final prediction is made by averaging (for regression) or voting (for classification) the results of all models. Bagging helps to reduce variance and prevent overfitting.
        Example: Random Forest.  
        - **Boosting**: In Boosting, models are trained sequentially, where each new model focuses on correcting the errors made by the previous ones. It assigns higher weights to misclassified data points so that the next model learns them better. Boosting helps to reduce bias and improve accuracy.
        Example: AdaBoost, Gradient Boosting, XGBoost.

3.  **What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**
      - Bootstrap sampling is a statistical technique used to create multiple random samples from the original dataset with replacement. This means some data points may appear multiple times in a sample, while others may not appear at all.

           In Bagging methods like Random Forest, bootstrap sampling plays a key role by ensuring that each base model (e.g., each decision tree) is trained on a slightly different version of the data. This diversity among the models helps to:

           Reduce overfitting by preventing all models from seeing the same data.
           Increase model stability and accuracy when their predictions are combined through averaging or majority voting.

4.  **What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**
      - Out-of-Bag (OOB) samples are the data points that are not included in a particular bootstrap sample during the training of ensemble models like Random Forest. Since each tree in the forest is trained on a random subset of the data (with replacement), about one-third of the original data is left out as OOB samples for that tree. The OOB score is a performance estimate calculated by testing each tree on its corresponding OOB samples — the data it didn’t see during training. The predictions from all trees for their OOB samples are then combined to compute an overall accuracy or error rate.

5.  **Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**
      - In a single Decision Tree, feature importance is calculated based on how much each feature reduces impurity (like Gini impurity or entropy) across all its splits. Features used in higher-level splits or that produce greater impurity reduction are considered more important. However, since a single tree can be sensitive to small changes in the data, its feature importance may not be very reliable. In contrast, a Random Forest calculates feature importance by averaging the importance scores of each feature across all trees in the ensemble. Because Random Forest combines many trees trained on different subsets of data and features, its importance values are more stable, robust, and generalizable compared to a single Decision Tree.

6. Write a Python program to:
     - Load the Breast Cancer dataset using
        **sklearn.datasets.load_breast_cancer()**
     -  Train a Random Forest Classifier
     -  Print the top 5 most important features based on feature importance scores.

In [2]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Get feature importances
feature_importances = pd.Series(model.feature_importances_, index=X.columns)

# Print the top 5 most important features
top_features = feature_importances.sort_values(ascending=False).head(5)
print("Top 5 Most Important Features:")
print(top_features)


Top 5 Most Important Features:
worst area              0.153892
worst concave points    0.144663
mean concave points     0.106210
worst radius            0.077987
mean concavity          0.068001
dtype: float64


7.  Write a Python program to:
     - Train a Bagging Classifier using
        
        **Decision Trees on the Iris dataset**
     - Evaluate its accuracy and compare with a single Decision Tree

In [5]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a single Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Train a Bagging Classifier using Decision Trees as base estimators
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag_model.fit(X_train, y_train)
bag_pred = bag_model.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_pred)

# Compare accuracies
print("Accuracy of Single Decision Tree:", round(dt_accuracy, 3))
print("Accuracy of Bagging Classifier:", round(bag_accuracy, 3))

Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


8.  Write a Python program to:
     - Train a Random Forest Classifier  
       **Tune hyperparameters max_depth and n_estimators using GridSearchCV**
     - Print the best parameters and final accuracy

In [6]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 8, None]
}

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model on training data
grid_search.fit(X_train, y_train)

# Get the best parameters and accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on test data
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters found by GridSearchCV:", best_params)
print("Final Model Accuracy on Test Data:", round(accuracy, 3))


Best Parameters found by GridSearchCV: {'max_depth': 8, 'n_estimators': 150}
Final Model Accuracy on Test Data: 0.965


9.  Write a Python program to:
     - Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
     - Compare their Mean Squared Errors (MSE)

In [8]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Bagging Regressor using Decision Trees as base estimators
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_reg.fit(X_train, y_train)

# Train a Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)

# Make predictions
y_pred_bag = bagging_reg.predict(X_test)
y_pred_rf = rf_reg.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse_bag = mean_squared_error(y_test, y_pred_bag)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print the results
print("Mean Squared Error (Bagging Regressor):", round(mse_bag, 4))
print("Mean Squared Error (Random Forest Regressor):", round(mse_rf, 4))

# Compare performance
if mse_rf < mse_bag:
    print("\n✅ Random Forest performs better (lower MSE).")
else:
    print("\n✅ Bagging Regressor performs better (lower MSE).")

Mean Squared Error (Bagging Regressor): 0.2559
Mean Squared Error (Random Forest Regressor): 0.2554

✅ Random Forest performs better (lower MSE).


10.  You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:
   - Choose between Bagging or Boosting
   - Handle overfitting
   - Select base models
   - Evaluate performance using cross-validation
   - Justify how ensemble learning improves decision-making in this real-world context.

**ANSWER** - Loan Default Prediction using Ensemble Learning

  As a data scientist in a financial institution, my goal is to predict loan default accurately using customer demographic and transaction history data. To improve performance and reliability, I would apply ensemble learning techniques.

  1. **Choosing between Bagging and Boosting:** - Bagging methods like Random Forest reduce variance and are effective when models tend to overfit. Boosting methods like XGBoost or LightGBM reduce bias by focusing on difficult samples. For this task, Boosting is preferred because it performs better on complex tabular data and can capture subtle patterns in customer behavior.

  2. **Handling Overfitting:** - I would use techniques like cross-validation, early stopping, regularization parameters (e.g., learning rate, max_depth), and feature selection. Additionally, tuning hyperparameters carefully and using class weights for imbalanced data helps control overfitting.

  3. **Selecting Base Models:** - The base models could include Decision Trees for simplicity or Logistic Regression for interpretability. For ensembles, I would use Random Forest (Bagging) and XGBoost/LightGBM (Boosting) as they are robust and widely used in financial modeling.

  4. **Evaluating Performance:** - I would use Stratified K-Fold Cross-Validation to ensure balanced class representation. The main evaluation metrics would be AUC-ROC, Precision, Recall, and F1-score. These metrics help assess the model’s ability to distinguish between defaulters and non-defaulters effectively.

  5. **Justification of Ensemble Learning:** - Ensemble methods combine multiple weak models to form a stronger one, improving prediction accuracy and stability. They reduce both variance and bias, leading to more reliable credit risk assessment. In a real-world context, this means fewer false approvals, reduced financial losses, and better decision-making for loan approvals.