<a href="https://colab.research.google.com/github/RushikeshChathe/FUNCTIONS/blob/main/Ensemble_Learning_Assignment_Detailed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Learning Assignment (DA-AG-014)
### Detailed Solutions with Original Questions
---


## Question 1
**What is Ensemble Learning in machine learning? Explain the key idea behind it.**

**Answer:**

Ensemble Learning is a machine learning technique where multiple models (often weak learners) are combined to improve prediction performance. The key idea is that while individual models may suffer from high variance or bias, their combination balances errors and produces better accuracy, stability, and generalization.

- **Why use ensembles?**
  - Reduce variance (bagging)
  - Reduce bias (boosting)
  - Improve robustness (stacking)

**Example:** Random Forest combines many Decision Trees using bagging, which results in better accuracy than a single tree.

## Question 2
**What is the difference between Bagging and Boosting?**

**Answer:**

| Aspect | Bagging | Boosting |
|--------|---------|----------|
| Training | Models trained in parallel | Models trained sequentially |
| Focus | Reduce variance | Reduce bias |
| Sampling | Bootstrap sampling (random with replacement) | Weighted sampling (focus on errors) |
| Model weights | Equal contribution | Higher weight to better learners |
| Examples | Random Forest | AdaBoost, Gradient Boosting, XGBoost |

**Summary:** Bagging stabilizes predictions, while Boosting makes weak models stronger.

## Question 3
**What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Answer:**

Bootstrap sampling creates training datasets by sampling with replacement from the original dataset. Each bootstrap sample is of the same size but includes duplicates.

**Role in Bagging (e.g., Random Forest):**
- Each tree is trained on a different bootstrap sample.
- Introduces diversity among models.
- Reduces overfitting and variance.

Thus, bootstrap sampling ensures that no two trees are identical.

## Question 4
**What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

**Answer:**

- **OOB samples:** Data points not selected in a bootstrap sample (~37% of data).
- **OOB score:** Predictions for each data point are made only by the trees where it was OOB. Aggregating these gives an unbiased performance estimate without needing a separate validation set.

## Question 5
**Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

**Answer:**

- **Decision Tree:** Feature importance is based on impurity reduction (Gini/Entropy). Can be unstable and biased towards features with more categories.
- **Random Forest:** Aggregates importance across many trees → more reliable, robust, and unbiased.

## Question 6
**Write a Python program to:**
- Load the Breast Cancer dataset using `sklearn.datasets.load_breast_cancer()`
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importance
importances = pd.Series(rf.feature_importances_, index=feature_names)
top5 = importances.sort_values(ascending=False).head(5)
print("Top 5 Important Features:")
print(top5)


## Question 7
**Write a Python program to:**
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree

In [None]:

from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


## Question 8
**Write a Python program to:**
- Train a Random Forest Classifier
- Tune hyperparameters `max_depth` and `n_estimators` using GridSearchCV
- Print the best parameters and final accuracy

In [1]:

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Reuse Breast Cancer dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Params:", grid.best_params_)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


NameError: name 'X' is not defined

## Question 9
**Write a Python program to:**
- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
- Compare their Mean Squared Errors (MSE)

In [None]:

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor
bag = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
bag_mse = mean_squared_error(y_test, bag.predict(X_test))

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_mse = mean_squared_error(y_test, rf.predict(X_test))

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


## Question 10
**Case Study:**
You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.

**You decide to use ensemble techniques to increase model performance.**

**Step-by-step Approach:**

1. **Choose between Bagging or Boosting:**
   - Loan default prediction is an imbalanced classification problem.
   - Boosting (XGBoost, LightGBM) is preferred because it handles imbalance and reduces bias.

2. **Handle overfitting:**
   - Use cross-validation and early stopping.
   - Regularization parameters (lambda, alpha).
   - Control depth of trees (max_depth).

3. **Select base models:**
   - Decision Trees as weak learners.
   - Combine them with Boosting algorithms.

4. **Evaluate performance:**
   - Use Stratified k-Fold cross-validation.
   - Metrics: AUC-ROC, Precision-Recall, F1-score (important for imbalanced datasets).

5. **Justification:**
   - Ensemble learning captures nonlinear patterns in financial data.
   - Boosting improves accuracy on minority class (loan defaults).
   - Reduces business risk by minimizing false negatives (approving risky loans).
