**Q1. What is Ensemble Learning in machine learning? Explain the key idea
behind it?**
- Ensemble Learning in machine learning refers to a technique that combines predictions from multiple models (often called base learners or weak learners) to produce a more accurate and robust overall model.

Key Idea Behind Ensemble Learning:
The collective wisdom of multiple models is often better than the prediction of any single model.

Instead of relying on one model that may have limitations or biases, ensemble methods harness the strengths of several models to improve prediction accuracy, reduce overfitting, and increase generalization.

**Q2. What is the difference between Bagging and Boosting ?**
- The key difference between Bagging and Boosting lies in how they build and combine multiple models, and what they focus on improving.

| Feature                   | **Bagging**                                                        | **Boosting**                                                                           |
| ------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------------------- |
| **Full Name**             | Bootstrap Aggregating                                              | Boosting                                                                               |
| **Goal**                  | Reduce **variance** (prevent overfitting)                          | Reduce **bias** (improve accuracy)                                                     |
| **Model Training**        | Models are trained **independently and in parallel**               | Models are trained **sequentially**, each one learning from the errors of the previous |
| **Data Sampling**         | Uses **random subsets** of the data (with replacement)             | Trains on the **full dataset**, but gives more weight to misclassified instances       |
| **Model Weighting**       | All models usually have **equal weight** in the final output       | Models are **weighted** based on performance (better models have more influence)       |
| **Combining Predictions** | **Majority voting** (classification) or **averaging** (regression) | **Weighted voting** or **weighted sum** of predictions                                 |
| **Overfitting Risk**      | Less likely (due to randomization)                                 | More likely if overfitting to noise, but can be controlled                             |
| **Examples**              | Random Forest, Bagged Decision Trees                               | AdaBoost, Gradient Boosting, XGBoost, LightGBM                                         |


**Q3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest ?**
- Bootstrap sampling is a statistical technique used to generate multiple random datasets from a single original dataset by sampling with replacement.
- We take the original dataset of size n.
- We randomly select data points with replacement to create a new dataset of the same size (n).
- Some samples may appear multiple times, others not at all.
- Each such dataset is called a bootstrap sample.

Bagging stands for Bootstrap Aggregating, and bootstrap sampling is the first step in this process.

- Bootstrap Sampling: For each decision tree in the forest, create a different bootstrap sample from the original training data.
- Model Training: Each tree is trained independently on its own bootstrap sample.
- Aggregation:
For prediction:
- Classification: Use majority voting from all trees.
- Regression: Use the average of all tree predictions.

**Q4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**
- In Bagging methods like Random Forest, each model (e.g. decision tree) is trained on a bootstrap sample — a random sample with replacement from the original dataset.
  - Since the sampling is with replacement, about 63% of the original data points end up in each bootstrap sample.
  - The remaining ~37% of the data points are not included in that bootstrap sample — these are called Out-of-Bag (OOB) samples.

OOB score is useful in many ways:
- Acts as a built-in cross-validation for bagging methods.
- Saves time and computational cost — no need to set aside a validation set.
- Gives a reliable estimate of model performance, especially in large datasets.
- Helps in hyperparameter tuning (e.g., choosing number of trees).

**Q5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest?**
- Here's a direct comparison of how feature importance is calculated and interpreted in a single Decision Tree vs. a Random Forest:

1. Single Decision Tree – Feature Importance

How it's calculated:
   -  Based on how much each feature reduces impurity (like Gini impurity or entropy for classification, variance for regression) when it's used to split nodes.

  -  The importance of a feature is the sum of all decreases in impurity across all nodes where that feature is used.

Pros:
- Fast and easy to compute.
- Interpretability: You can trace decisions through the tree to understand why a feature matters.

Cons:
- High variance: Results depend heavily on the particular training data.
- May overemphasize features that cause large splits in early nodes.
- Sensitive to noise and overfitting.

2. Random Forest – Feature Importance
 How it's calculated:
- Similar to a single tree, but the importance is averaged across all trees in the forest.
- For each tree:
   -    Track the reduction in impurity caused by each feature.
   -   Aggregate and average over all trees to get the final feature importance scores.

Pros:
   -   More stable and reliable than a single tree (less sensitive to data noise).
   -  Better at capturing true signal by averaging over many trees.
   -  Works well even when features are correlated or redundant.

Cons:
    
- Less interpretable than a single tree.

**Q6. Write a Python program to:**

● **Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()**

● **Train a Random Forest Classifier**

● **Print the top 5 most important features based on feature importance scores.**

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
})

top_5_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)
print("Top 5 Most Important Features:")
print(top_5_features)

Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


**Q7. Write a Python program to:**

● **Train a Bagging Classifier using Decision Trees on the Iris dataset**

● **Evaluate its accuracy and compare with a single Decision Tree**

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)
dtree_preds = dtree.predict(X_test)
dtree_accuracy = accuracy_score(y_test, dtree_preds)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

bagging.fit(X_train, y_train)
bagging_preds = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)

print("Single Decision Tree Accuracy:", round(dtree_accuracy, 4))
print("Bagging Classifier Accuracy:  ", round(bagging_accuracy, 4))

Single Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy:   1.0


**Q8. Write a Python program to:**

● **Train a Random Forest Classifier**

● **Tune hyperparameters max_depth and n_estimators using GridSearchCV**

● **Print the best parameters and final accuracy**

In [4]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 3, 5, 10]
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy on Test Set:", round(accuracy, 4))

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 1.0


**Q9. Write a Python program to:**

● **Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset**

● **Compare their Mean Squared Errors (MSE)**

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

bagging_model = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_preds)

rf_model = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_preds)

print("Bagging Regressor MSE:       ", round(bagging_mse, 4))
print("Random Forest Regressor MSE: ", round(rf_mse, 4))


Bagging Regressor MSE:        0.2579
Random Forest Regressor MSE:  0.2577


**Q10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**

**You decide to use ensemble techniques to increase model performance.**

**Explain your step-by-step approach to:**

● **Choose between Bagging or Boosting**

● **Handle overfitting**

● **Select base models**

● **Evaluate performance using cross-validation**

● **Justify how ensemble learning improves decision-making in this real-world context.**

- Step-by-Step Approach
 1. Choose Between Bagging or Boosting
Key Considerations:
- Bagging (e.g., Random Forest):
- Reduces variance
- Best for unstable models (e.g., decision trees)
- Works well if model is overfitting or if data is noisy
- Boosting (e.g., XGBoost, LightGBM):
- Reduces bias
- Builds strong learners by correcting mistakes from previous models
- Works better when the model is underfitting or needs fine-tuning

2. Handle Overfitting
- Boosting models are powerful but prone to overfitting, especially on noisy or high-dimensional data.
- Techniques:
     - Early stopping: Stop training when validation error stops improving
- Regularization:

     - max_depth (limit tree depth)
     - learning_rate (controls step size)
     - subsample (use a fraction of data per tree)
     - colsample_bytree (use a fraction of features per tree)
- Cross-validation to detect overfitting early
     - Feature selection / dimensionality reduction (e.g., remove irrelevant or highly correlated features)

3. Select Base Models
Bagging: Use DecisionTreeClassifier as the base estimator (high variance → good for bagging)
Boosting:
- Base learners are typically shallow trees (e.g., max_depth=3) to reduce overfitting
- Algorithms like XGBoost, LightGBM, and CatBoost implement this internally
- For boosting, no need to manually specify the base model — the framework uses efficient trees already

Evaluate Performance Using Cross-Validation
Process:
- Use Stratified K-Fold Cross-Validation to maintain class balance (important for imbalanced loan default data)
- Evaluation metrics:
- AUC-ROC (discrimination between default vs. no-default)
- Precision-Recall (especially useful with imbalanced data)
- F1-score, Confusion Matrix

Optionally, use cost-based evaluation (false negatives might be costlier)

Implementation:

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from xgboost import XGBClassifier

model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print("Average AUC-ROC:", scores.mean())


5. Justify Ensemble Learning in the Real-World Context
Why Ensemble Learning Helps in Loan Default Prediction:

| Challenge               | How Ensemble Helps                                                           |
| ----------------------- | ---------------------------------------------------------------------------- |
| High-dimensional data   | Learns complex interactions through many trees                               |
| Imbalanced classes      | Boosting can focus more on hard-to-classify (minority) samples               |
| Noise in features       | Bagging averages out instability caused by noisy samples                     |
| Interpretability needed | Feature importance from models like Random Forest or XGBoost aids compliance |
| Cost of wrong decisions | Boosting minimizes critical errors by iteratively improving predictions      |


Impact:
By increasing predictive accuracy, reducing false negatives, and offering more reliable risk assessments, ensemble models directly support better credit decisioning, fraud detection, and risk mitigation — all essential in a financial institution.