# Ensemble Learning Assignment

---
## 1. What is Ensemble Learning in machine learning? Explain the key idea behind it.

**Ensemble learning** is a technique in machine learning where **multiple models (often called “weak learners”) are combined** to produce a **stronger, more accurate, and more robust predictive model**.

Instead of relying on a single model, ensemble methods aggregate the predictions of several models to reduce errors, variance, and bias.

### **Key Idea Behind Ensemble Learning**

The central idea is that:

* **“A group of weak models can come together to form a strong model.”**
* Each model may capture different aspects or patterns in the data.
* By combining them, their individual mistakes can cancel out, leading to improved overall performance.

This works similar to the idea of **“wisdom of the crowd”** — multiple opinions, when aggregated, are often more reliable than a single opinion.

### **Why it works**

1. **Reduces Variance** → Averaging predictions (e.g., Bagging) prevents overfitting.
2. **Reduces Bias** → Combining diverse models (e.g., Boosting) helps capture complex patterns.
3. **Improves Generalization** → More robust to noise and unseen data.

### **Common Ensemble Methods**

* **Bagging (Bootstrap Aggregating):** Uses multiple models trained on random subsets of data (e.g., Random Forest).
* **Boosting:** Sequentially trains models, giving more weight to misclassified instances (e.g., AdaBoost, Gradient Boosting, XGBoost).
* **Stacking:** Combines different models and uses another “meta-model” to make final predictions.

---
## 2. What is the difference between Bagging and Boosting?

| **Aspect**             | **Bagging (Bootstrap Aggregating)**                                                        | **Boosting**                                                                                               |
| ---------------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------- |
| **Goal**               | Reduce **variance** (overfitting)                                                          | Reduce **bias** (underfitting)                                                                             |
| **How it works**       | Trains models **independently** on random subsets of data (with replacement)               | Trains models **sequentially**, where each new model focuses on correcting the errors of the previous ones |
| **Data Sampling**      | Uses **bootstrap samples** (random sampling with replacement)                              | Uses the **entire dataset**, but adjusts weights of misclassified samples to focus on hard cases           |
| **Model Combination**  | Combines predictions by **majority voting (classification)** or **averaging (regression)** | Combines predictions by **weighted voting/weighted sum**                                                   |
| **Bias & Variance**    | Lowers **variance**, bias remains about the same                                           | Lowers **bias**, variance may slightly increase                                                            |
| **Example Algorithms** | Random Forest                                                                              | AdaBoost, Gradient Boosting, XGBoost, LightGBM                                                             |
| **Parallelization**    | Models can be trained in **parallel** (since they are independent)                         | Models must be trained **sequentially** (each depends on previous)                                         |


### **Quick Intuition**

* **Bagging:** “Many independent models vote → reduces overfitting.”
* **Boosting:** “Models learn from each other’s mistakes → reduces underfitting.”

---
## 3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

**Bootstrap sampling** is a statistical resampling technique where we create new datasets by **randomly selecting samples *with replacement*** from the original dataset.

* If the original dataset has **N samples**, each bootstrap sample also has **N samples**.
* Since selection is **with replacement**, some data points may appear multiple times, while others may not appear at all in a given bootstrap sample.

### **Role in Bagging (e.g., Random Forest)**

1. **Diversity in Models**

   * Each model (e.g., each decision tree in a Random Forest) is trained on a different bootstrap sample.
   * This ensures the models are not identical and capture different patterns.

2. **Reduces Overfitting (Variance Reduction)**

   * By averaging the predictions of multiple diverse models, Bagging smooths out noise and reduces variance.

3. **Out-of-Bag (OOB) Error Estimation**

   * Since \~36% of the data is typically left out of each bootstrap sample, these “out-of-bag” samples can be used to estimate model accuracy without needing a separate validation set.

### **Example (Random Forest)**

* Suppose you have a dataset of 1000 rows.
* You draw 1000 samples *with replacement* → this is one bootstrap sample.
* Train **Tree 1** on this bootstrap sample.
* Repeat the process to generate bootstrap samples for **Tree 2, Tree 3, …**
* Final prediction = **majority voting (classification)** or **averaging (regression)** across all trees.

---
## 4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

### **Out-of-Bag (OOB) Samples**

* In **bootstrap sampling**, each new dataset is created by sampling **with replacement** from the original dataset.
* On average, about **63% of the data points** are included in a bootstrap sample, while the remaining **\~37% are left out**.
* These **left-out data points** for a given bootstrap sample are called **Out-of-Bag (OOB) samples**.

### **How OOB Score is Used**

1. **Training**

   * Each base model (e.g., a decision tree in Random Forest) is trained on its bootstrap sample.

2. **Evaluation with OOB samples**

   * The OOB samples (the \~37% not used in training for that model) are used as a **validation set** to test the model’s performance.
   * Since every data point is likely to be OOB for some models, we can evaluate the performance of the entire ensemble using these unused samples.

3. **OOB Score**

   * The **OOB score** is the average accuracy (for classification) or R² / error metric (for regression) calculated using OOB samples across all models.
   * It provides a reliable, unbiased estimate of model performance **without needing a separate validation or test dataset**.

### **Advantages of OOB Score**

* Saves data → No need to split into training/validation sets.
* Gives a built-in performance check during model training.
* Especially useful in Random Forests for quick evaluation.

---
## 5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

| **Aspect**               | **Decision Tree**                                                                                                                          | **Random Forest**                                                                                                                     |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| **How it is calculated** | Based on how much each feature **reduces impurity** (e.g., Gini Index, Entropy for classification; Variance for regression) at its splits. | Computed by **averaging the feature importance scores** from all the decision trees in the forest.                                    |
| **Bias**                 | Can be **biased toward features with more categories** (categorical variables with many levels may appear more important).                 | Bias is **reduced** since importance is averaged across many trees trained on different bootstrap samples and random feature subsets. |
| **Stability**            | Can be **unstable** → small changes in data may lead to very different importance values.                                                  | More **stable and reliable** due to aggregation across multiple trees.                                                                |
| **Interpretability**     | Easy to understand and explain because it comes from a single tree’s structure.                                                            | Harder to interpret directly, but more trustworthy for generalization.                                                                |
| **Overfitting**          | Prone to overfitting, so feature importance may not generalize well.                                                                       | Less prone to overfitting, so feature importance is more representative.                                                              |

### **Key Idea**

* **Decision Tree** → Feature importance = “How much this feature helped reduce impurity in this one tree.”
* **Random Forest** → Feature importance = “How much this feature consistently helped reduce impurity across many trees trained on different samples and subsets.”

---
## 6. Write a Python program to:
> * #### Load the Breast Cancer dataset using `sklearn.datasets.load_breast_cancer()`
> * #### Train a Random Forest Classifier
> * #### Print the top 5 most important features based on feature importance scores.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importance scores
importances = model.feature_importances_

# Create a DataFrame for feature importance
feat_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
})

# Sort by importance and print top 5
top5 = feat_importances.sort_values(by="Importance", ascending=False).head(5)
print("Top 5 Important Features:")
print(top5)

Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


---
## 7. Write a Python program to:
> * #### Train a Bagging Classifier using Decision Trees on the Iris dataset
> * #### Evaluate its accuracy and compare with a single Decision Tree

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)
y_pred_tree = dtree.predict(X_test)
tree_acc = accuracy_score(y_test, y_pred_tree)

# Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,        # number of trees
    random_state=42,
    n_jobs=-1
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_acc = accuracy_score(y_test, y_pred_bag)

# Print results
print("Accuracy of Single Decision Tree:", tree_acc)
print("Accuracy of Bagging Classifier :", bagging_acc)

Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier : 1.0


---
## 8. Write a Python program to:
> * #### Train a Random Forest Classifier
> * #### Tune hyperparameters `max_depth` and `n_estimators` using GridSearchCV
> * #### Print the best parameters and final accuracy

In [3]:
from sklearn.model_selection import GridSearchCV

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],  # number of trees
    'max_depth': [None, 5, 10]       # maximum depth of trees
}

# Grid Search with Cross-Validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    n_jobs=-1,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

# Best parameters and accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on test set
y_pred = best_model.predict(X_test)
final_acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Final Accuracy on Test Set:", final_acc)

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 1.0


---
## 9. Write a Python program to:
> * #### Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
> * #### Compare their Mean Squared Errors (MSE)

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Bagging Regressor (with Decision Trees as base estimator)
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

# Calculate Mean Squared Errors
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)

Mean Squared Error (Bagging Regressor): 0.2578738225058504
Mean Squared Error (Random Forest Regressor): 0.25650512920799395


---
## 10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:
> * #### Choose between Bagging or Boosting
> * #### Handle overfitting
> * #### Select base models
> * #### Evaluate performance using cross-validation
> * #### Justify how ensemble learning improves decision-making in this real-world context.

### **1. Choose between Bagging or Boosting**

* **Bagging** (e.g., Random Forest) is useful when the model has **high variance** (like Decision Trees), as it reduces overfitting by averaging predictions.
* **Boosting** (e.g., XGBoost, AdaBoost, LightGBM) is better when the dataset is **complex and prone to underfitting**, as it sequentially corrects errors and reduces bias.

👉 For loan default prediction, where patterns are complex and misclassification is costly, **Boosting** is often preferred because it gives higher accuracy and better handles difficult-to-predict cases.

### **2. Handle Overfitting**

* Use **regularization parameters** in boosting methods (e.g., `learning_rate`, `max_depth`, `min_child_weight` in XGBoost).
* Apply **early stopping** during training to prevent models from memorizing noise.
* Use **cross-validation** to tune hyperparameters and avoid overfitting to training data.

### **3. Select Base Models**

* For **Bagging**: Base model = **Decision Trees** (prone to variance, but bagging stabilizes them).
* For **Boosting**: Base model = **Weak learners (shallow Decision Trees)**, typically depth 3–5.

👉 In practice: Start with **Decision Trees** as base learners, then try Gradient Boosted Trees (XGBoost/LightGBM).

### **4. Evaluate Performance using Cross-Validation**

* Use **Stratified k-Fold Cross-Validation** since this is a **classification problem with class imbalance** (loan default vs non-default).
* Evaluate metrics:

  * **Accuracy** (overall correctness)
  * **Precision & Recall** (important to reduce false negatives → missing defaults is costly)
  * **ROC-AUC** (to capture ranking quality of predicted probabilities).

### **5. Justify how Ensemble Learning improves Decision-Making**

* **Higher accuracy**: Boosting reduces bias and improves predictive power.
* **More reliable risk prediction**: Helps financial institutions **identify risky borrowers more accurately**.
* **Balanced decision-making**: Reduces errors in both directions — avoids lending to high-risk customers while not unfairly rejecting safe ones.
* **Business impact**: Minimizes financial losses, improves credit risk models, and builds trust with regulators and customers.