### Ensemble Learning

1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.
- Ensemble learning is a technique where multiple models are combined to make predictions, with the core idea that diverse models, when aggregated, reduce errors better than any single model. By averaging or voting (bagging), training sequentially to correct prior mistakes (boosting), or learning how to combine different model outputs (stacking), ensembles improve generalization by reducing variance and/or bias, leading to higher accuracy, robustness to noise, and more reliable performance across datasets.

2. What is the difference between Bagging and Boosting?
- Bagging trains many base learners independently on different bootstrap samples and aggregates their predictions to reduce variance and combat overfitting; Random Forest is a classic example. Boosting trains base learners sequentially, each focusing on correcting the previous model’s errors, and combines them in a weighted manner to reduce bias, e.g., AdaBoost/Gradient Boosting. In short: bagging = parallel, variance reduction via averaging; boosting = sequential, bias reduction via reweighting.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
- Bootstrap sampling is the process of drawing training examples with replacement from the original dataset to create multiple “bootstrap” datasets of the same size. In bagging methods like Random Forest, each tree is trained on its own bootstrap sample, which makes the trees diverse. Aggregating their predictions reduces variance and overfitting, while out-of-bag (OOB) samples—those not selected for a given tree—enable an internal, unbiased estimate of generalization performance without a separate validation set.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
- Out-of-Bag (OOB) samples are the training instances not included in a given bootstrap sample when building a particular tree in a bagging ensemble (about 36.8% on average per tree). For each data point, predictions from only the trees where that point was OOB are aggregated to produce an OOB prediction; the OOB score is the resulting performance metric. This provides an internal, approximately unbiased estimate of generalization performance without a separate validation set and helps guide model selection and hyperparameter tuning.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
- A single decision tree’s importance reflects each feature’s total impurity reduction across its own splits, which can be unstable and biased toward high-cardinality or noisy features, especially if the tree overfits on training data. In a Random Forest, importances are averaged over many trees trained on bootstrap samples and feature subsampling, yielding more stable, generalizable scores, though the same impurity-based bias can persist unless using alternative measures like permutation importance computed on held-out data. Practically, forests reduce variance in importance estimates compared with one tree, and pairing them with permutation importance helps counter biases and overfitting artifacts for more reliable attribution.

In [1]:
""" 6. Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores."""

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

importances = rf.feature_importances_
top_idx = np.argsort(importances)[::-1][:5]

print("Top 5 features by importance:")
for i in top_idx:
    print(f"{feature_names[i]}: {importances[i]:.4f}")

Top 5 features by importance:
worst perimeter: 0.1373
worst area: 0.1373
worst concave points: 0.1155
mean concave points: 0.0918
worst radius: 0.0841


In [2]:
""" 7. Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree """

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    max_samples=1.0,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print(f"Accuracy - Decision Tree: {dt_acc:.4f}")
print(f"Accuracy - Bagging (100 trees): {bag_acc:.4f}")

Accuracy - Decision Tree: 0.9333
Accuracy - Bagging (100 trees): 0.9667


In [3]:
""" 8. Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy """

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(random_state=42, n_jobs=-1)
param_grid = {
    "n_estimators": [50, 100, 200, 300],
    "max_depth": [None, 5, 10, 20]
}

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1,
    refit=True
)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

print("Best parameters:", grid.best_params_)
print(f"CV Best Score (mean accuracy): {grid.best_score_:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

Best parameters: {'max_depth': None, 'n_estimators': 200}
CV Best Score (mean accuracy): 0.9604
Test Accuracy: 0.9561


In [4]:
""" 9. Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE) """

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

bag = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=100,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)
rf = RandomForestRegressor(
    n_estimators=300,
    n_jobs=-1,
    random_state=42
)

bag.fit(X_train, y_train)
rf.fit(X_train, y_train)

bag_mse = mean_squared_error(y_test, bag.predict(X_test))
rf_mse = mean_squared_error(y_test, rf.predict(X_test))

print(f"MSE - BaggingRegressor: {bag_mse:.4f}")
print(f"MSE - RandomForestRegressor: {rf_mse:.4f}")

MSE - BaggingRegressor: 0.2559
MSE - RandomForestRegressor: 0.2534


10.  You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

- Choose method: start with bagging (Random Forest) if variance/overfitting is the main issue and features are mixed/tabular; choose boosting (XGBoost/LightGBM/CatBoost) when signal is weak and complex interactions matter, prioritizing recall for defaults; run quick baselines for both and select by cross-validated PR-AUC with business-weighted costs.
- Handle overfitting: use out-of-fold tuning of depth, min_samples_leaf, subsampling (rows/cols), early stopping (for boosting), and monotonic constraints if needed; apply regularization (L1/L2 for boosted trees), and keep feature engineering within pipelines to avoid leakage.
- Base models: for bagging use shallow-to-moderate depth decision trees; for boosting use gradient-boosted decision trees with appropriate categorical handling (e.g., CatBoost) and robust missing-value treatment; consider a simple logistic regression as a calibrated benchmark for stability.
- Evaluation via CV: stratified K-fold with grouped/time-aware splits if temporal leakage risk exists; optimize PR-AUC/recall at fixed precision, report confusion matrix at the chosen threshold, and validate calibration (Brier score, calibration curve) plus segment-wise fairness (age, geography, product).
- Business impact: improved risk discrimination reduces missed defaults (lower credit losses) while controlling false positives (customer experience and growth), enables risk-based pricing/limits, prioritizes collections, and supports explainable decisions (feature importances/SHAP) for compliance and auditability.