Q1: What is Ensemble Learning in machine learning? Explain the key idea behind it.
- Ensemble Learning is a machine learning technique where multiple models (called weak learners) are combined to build a stronger and more accurate model.
Instead of relying on a single model, ensemble methods aggregate predictions from many models to reduce errors, improve stability, and increase generalization.

- Key idea:

  “Many weak models together perform better than one strong model.”

- Why it works:

    Reduces variance (Bagging)

    Reduces bias (Boosting)

    Handles complex patterns better

    More robust to noise and overfitting

- Common ensemble methods:

    Bagging (Random Forest)

    Boosting (AdaBoost, Gradient Boosting, XGBoost)

    Voting & Stacking
Q2: What is the difference between Bagging and Boosting?     
- | Aspect      | Bagging                      | Boosting                         |
| ----------- | ---------------------------- | -------------------------------- |
| Goal        | Reduce variance              | Reduce bias                      |
| Training    | Models trained independently | Models trained sequentially      |
| Data        | Random bootstrap samples     | Focuses on misclassified samples |
| Overfitting | Good for overfitting models  | Can overfit if noisy             |
| Example     | Random Forest                | AdaBoost, Gradient Boosting      |

- “Bagging improves stability, while Boosting improves accuracy by learning from mistakes.”


Q3: What is bootstrap sampling and what role does it play in Bagging methods  like Random Forest?
- Bootstrap sampling is a technique where random samples are drawn with replacement from the original dataset.

- Role in Bagging / Random Forest:

    Each tree is trained on a different bootstrap sample

    Some rows appear multiple times, some not at all

    This creates diversity among models, reducing correlation

    Leads to lower variance and better generalization

Q4: What are Out-of-Bag (OOB) samples and how is OOB score used to  evaluate ensemble models?
- Out-of-Bag (OOB) samples are data points not selected in a bootstrap sample.

- Key points:

    About 36% of data is OOB for each tree

    Used as validation data

    OOB score estimates model performance without separate test set

- Why useful:

    Saves data

    Faster evaluation

    Less overfitting risk
Q5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.     
- | Aspect             | Decision Tree     | Random Forest            |
| ------------------ | ----------------- | ------------------------ |
| Stability          | Unstable          | Stable                   |
| Bias               | High              | Lower                    |
| Overfitting        | High              | Low                      |
| Feature importance | Based on one tree | Averaged over many trees |
| Reliability        | Low               | High                     |

Random Forest provides more reliable and robust feature importance.              





In [2]:
#Q6: Write a Python program to:
#● Load the Breast Cancer dataset using
#    sklearn.datasets.load_breast_cancer()
#● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importance
importance = pd.Series(rf.feature_importances_, index=data.feature_names)
top_5 = importance.sort_values(ascending=False).head(5)

print("Top 5 Important Features:\n", top_5)



Top 5 Important Features:
 worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [8]:
#Q7: Write a Python program to:
#● Train a Bagging Classifier using Decision Trees on the Iris dataset
#● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging
bag = BaggingClassifier( n_estimators=100, random_state=42)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Accuracy:", bag_acc)



Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


In [6]:
#Q8: Write a Python program to:
#● Train a Random Forest Classifier
#● Tune hyperparameters max_depth and n_estimators using GridSearchCV
#● Print the best parameters and final accuracy

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10]
}

rf = RandomForestClassifier(random_state=42)

grid = GridSearchCV(rf, params, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': None, 'n_estimators': 50}
Best Accuracy: 0.9666666666666668


In [7]:
#Q9: Write a Python program to:
#● Train a Bagging Regressor and a Random Forest Regressor on the California  Housing dataset
#● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

bag = BaggingRegressor(n_estimators=100)
rf = RandomForestRegressor(n_estimators=100)

bag.fit(X_train, y_train)
rf.fit(X_train, y_train)

print("Bagging MSE:", mean_squared_error(y_test, bag.predict(X_test)))
print("Random Forest MSE:", mean_squared_error(y_test, rf.predict(X_test)))


Bagging MSE: 0.24783658522634328
Random Forest MSE: 0.24856295146428647


Q10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world  context.  

- Step-by-Step Approach
1. Choose Bagging or Boosting

   - Start with Random Forest (Bagging) for stability

    - Use Boosting if bias is high

2. Handle Overfitting

    - Cross-validation

    - Limit tree depth

    - Increase number of estimators

3. Select Base Models

    - Decision Trees (interpretable)

    - Logistic Regression (baseline)

4. Evaluate Performance

    - Cross-validation

    - ROC-AUC

    - Precision-Recall

    - Confusion Matrix

5. Business Justification

    - Better default detection

    - Reduced financial risk

    - Fairer credit decisions

    - Regulatory-friendly explainability



