<a href="https://colab.research.google.com/github/JMandal02/Data-Science_pwskills/blob/main/Assignment_Ensemble_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

**Answer:**

Ensemble Learning is a machine learning technique where multiple individual models (called base learners or weak learners) are combined to form a stronger predictive model.  
The main idea is that a group of weak models can work together to produce better performance and generalization than any single model alone.

In other words, instead of relying on one model’s predictions, ensemble learning aggregates the predictions from several models (through averaging, voting, or weighting) to reduce variance, bias, or improve accuracy.

**Key Idea:**
- Each individual model may make some errors, but their combination helps cancel out individual mistakes.
- It leverages diversity among models — models should make different types of errors.
- Common ensemble techniques include **Bagging, Boosting, and Stacking**.

**Benefits:**
- Improves prediction accuracy.
- Reduces overfitting.
- Increases model robustness and generalization.

---

# **Question 2: What is the difference between Bagging and Boosting?**

**Answer:**

Both Bagging and Boosting are ensemble learning methods, but they differ in how they train and combine models.

| Feature | **Bagging (Bootstrap Aggregating)** | **Boosting** |
|----------|------------------------------------|---------------|
| **Goal** | Reduce variance | Reduce bias |
| **Training** | Models are trained **independently** in parallel | Models are trained **sequentially**, where each model focuses on correcting errors of the previous one |
| **Data Sampling** | Uses random sampling **with replacement** (bootstrap samples) | Each new model is trained on data weighted by the previous model’s errors |
| **Combination** | Aggregates results by majority vote (classification) or averaging (regression) | Combines results using weighted voting or summation based on model performance |
| **Example Algorithms** | Random Forest | AdaBoost, Gradient Boosting, XGBoost |
| **Overfitting** | Less prone to overfitting | Can overfit if too many weak learners are added |

**In short:**  
- Bagging reduces variance by averaging over multiple models.  
- Boosting reduces bias by focusing on difficult examples and sequentially improving weak learners.

---

# **Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Answer:**

**Bootstrap Sampling** is a statistical method that involves sampling data points **with replacement** from the original dataset to create multiple subsets (bootstrap samples) of the same size as the original data.

Each subset may contain duplicate instances because sampling is done with replacement.

**Role in Bagging:**
- In Bagging (like Random Forest), each model is trained on a different bootstrap sample.
- This creates diversity among models since each one sees a slightly different version of the data.
- When predictions from all models are combined (through averaging or voting), the overall model variance is reduced.

**In Random Forest:**
- Each decision tree is trained on a unique bootstrap sample.
- This randomness helps ensure that trees are decorrelated, which leads to better ensemble performance.

**Example:**  
If the dataset has 1000 records, each tree in the Random Forest might train on a random sample of 1000 records (with replacement).

---

# **Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

**Answer:**

**Out-of-Bag (OOB) samples** are the data points **not included** in a particular bootstrap sample during Bagging.

Since each bootstrap sample is created with replacement, roughly **36.8% of the data** remains unused for each model — these are the OOB samples.

**OOB Score:**
- OOB samples act like a validation set for each model.
- The model is tested on its OOB samples, and performance is averaged across all models.
- This gives an unbiased estimate of model accuracy without needing a separate validation set.

**Advantages:**
- Efficient — no need to reserve additional validation data.
- Provides an internal estimate of generalization error.

**In Random Forest:**
```python
RandomForestClassifier(oob_score=True)


# **Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

**Answer:**

**Feature Importance** measures how much each feature contributes to a model’s predictions.  
It helps us understand which variables are most influential in decision-making.

---

### **Comparison Table**

| Aspect | **Decision Tree** | **Random Forest** |
|--------|-------------------|-------------------|
| **Computation Method** | Based on reduction in impurity (Gini or Entropy) from each split within the single tree. | Computed as the *average* of feature importance scores across all trees in the forest. |
| **Bias** | Can be biased toward features with many categories or continuous values. | Reduces bias through model averaging, giving a more balanced view. |
| **Stability** | Sensitive to small data changes — importance values may fluctuate. | More stable because results are aggregated from multiple trees. |
| **Interpretability** | Easy to interpret due to single model structure. | Harder to visualize, but provides more reliable importance values. |
| **Performance** | May overfit and rely too heavily on dominant features. | Handles noise and irrelevant features better due to ensemble averaging. |

---



# **Question 6:**
### Write a Python program to:
- Load the Breast Cancer dataset using `sklearn.datasets.load_breast_cancer()`
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.


In [1]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Initialize and train Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print the top 5 important features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


# **Question 7:**
### Write a Python program to:
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree


In [5]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Single Decision Tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)
tree_acc = accuracy_score(y_test, y_pred_tree)

# Bagging Classifier with Decision Tree as base estimator
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, y_pred_bag)

print("Accuracy of Single Decision Tree:", tree_acc)
print("Accuracy of Bagging Classifier:", bag_acc)

Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


# **Question 8:**
### Write a Python program to:
- Train a Random Forest Classifier
- Tune hyperparameters `max_depth` and `n_estimators` using `GridSearchCV`
- Print the best parameters and final accuracy

In [7]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get best model and evaluate
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", acc)



Best Parameters: {'max_depth': 3, 'n_estimators': 150}
Final Accuracy: 1.0


# **Question 9:**
### Write a Python program to:
- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
- Compare their Mean Squared Errors (MSE)


In [10]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.3, random_state=42
)

# Bagging Regressor
bag_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Mean Squared Error (Bagging Regressor):", mse_bag)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)

Mean Squared Error (Bagging Regressor): 0.25787382250585034
Mean Squared Error (Random Forest Regressor): 0.25650512920799395


# **Question 10: Real-world Ensemble Learning – Loan Default Prediction**

**Answer:**

**Scenario:**  
You are a data scientist at a financial institution tasked with predicting **loan default** using customer demographics and transaction data.  
You plan to use **Ensemble Learning** to improve prediction accuracy and decision reliability.

---

### **Step 1: Choose between Bagging or Boosting**

| Case | Recommended Method | Reason |
|------|--------------------|--------|
| High variance (model overfits easily) | **Bagging (e.g., Random Forest)** | Reduces variance via averaging multiple independent models. |
| High bias (model underfits or misses patterns) | **Boosting (e.g., AdaBoost, XGBoost, Gradient Boosting)** | Sequentially focuses on difficult samples and reduces bias. |

In loan default prediction, **Boosting** is often preferred because it handles **imbalanced and complex financial data** effectively.

---

### **Step 2: Handle Overfitting**

- Use **cross-validation** for tuning parameters.  
- Apply **early stopping** (in Boosting).  
- Use **regularization parameters** (L1/L2 penalties).  
- Limit **max_depth** and **n_estimators** to prevent overly complex models.  
- Drop highly correlated or irrelevant features.

---

### **Step 3: Select Base Models**

- For **Bagging** → Decision Tree as base estimator (Random Forest).  
- For **Boosting** → Shallow trees or weak learners as base models.  
- Optionally test Logistic Regression or SVM as base models for diversity.

---

### **Step 4: Evaluate Performance using Cross-Validation**

- Use **K-Fold Cross-Validation** (e.g., k = 5).  
- Evaluate with metrics suited for classification:  
  - Accuracy  
  - Precision & Recall  
  - F1-Score  
  - ROC-AUC (useful for imbalanced datasets)

- Compare ensemble accuracy with that of a single model baseline.

---

### **Step 5: Justify How Ensemble Learning Improves Decision-Making**

- **Combines multiple weak models** to improve robustness.  
- **Reduces overfitting** — predictions generalize better to unseen customers.  
- **Improves recall and precision**, reducing false loan approvals or rejections.  
- **Provides feature importance** — helps identify top risk factors like Credit Score, Debt-to-Income Ratio, and Payment History.  
- Enables **data-driven and explainable decisions** for credit risk management.

---
