# Ensemble Learning | Assignment

## Assignment Code: DA-AG-014

### Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.
- Ensemble Learning is a technique where multiple learning models (called base models or weak learners) are combined to produce a single stronger model. The core idea is that a group of models working together will perform better and be more robust than any single model alone — similar to “wisdom of the crowd.”
- Key ideas:
  - Combine multiple models to reduce variance, bias, or improve predictions.
  - Models can be of same type (e.g., many decision trees) or different types (e.g., SVM + Logistic + Tree).
  - Aggregation methods: voting (classification), averaging (regression), weighted voting, or stacking (meta-learner).
  - Ensemble reduces overfitting (bagging) or reduces bias (boosting) depending on approach.
- Example: Random Forest combines many decision trees trained on bootstrapped samples and averages/majority-votes their outputs.



### Question 2: What is the difference between Bagging and Boosting?
- Bagging (Bootstrap Aggregating):
  - Purpose: Reduce variance and avoid overfitting.
  - How: Train multiple base learners independently on different bootstrap samples (random sampling with replacement) of the training data, then aggregate (majority vote or average).
  - Base learners: typically deep trees (unpruned).
  - Example: Random Forest (bagging + random feature selection).

- Boosting:
 - Purpose: Reduce bias (and sometimes variance) by sequentially training weak learners where each new learner focuses on examples previous ones mispredicted.
 - How: Train learners sequentially; each learner tries to correct mistakes of the ensemble so far; predictions combined via weighted sum/vote.
 - Algorithms: AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM, CatBoost.
 - Boosting often gives higher accuracy than bagging but can be more prone to overfitting (needs regularization).

| Aspect   |                Bagging | Boosting                           |
| -------- | ---------------------: | ---------------------------------- |
| Training | Parallel (independent) | Sequential                         |
| Samples  |     Bootstrap (random) | Weighted focusing on hard examples |
| Reduces  |               Variance | Bias (and sometimes variance)      |
| Example  |          Random Forest | AdaBoost, XGBoost                  |


### Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
- Bootstrap sampling: sampling technique where we draw samples with replacement from the original dataset to create many different training sets (each the same size as the original). Some original instances may repeat in a bootstrap sample; some may be left out.
- Role in Bagging / Random Forest:
 - Creates diverse training subsets so base models (trees) see different data — increases model variety and reduces correlation between base learners.
 - Because base models are trained on different samples, averaging their predictions reduces variance and improves generalization.
 - In Random Forest, extra randomness (random feature selection at splits) + bootstrap sampling increases diversity among trees and boosts ensemble performance.

### Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
- Out-of-Bag (OOB) samples: For each bootstrap sample, about ~37% of the original training instances are expected not to be included (because sampling with replacement leaves some items out). These left-out instances are OOB samples for that particular base model.
- OOB score usage:
 - We can evaluate each sample using only the base learners that did not see it during training (i.e., those where it was OOB). Aggregating predictions across those learners gives an OOB-predicted label for that sample.
 - OOB score approximates cross-validation performance without a separate hold-out set, useful for Random Forests and bagging methods.
 - Provides internal validation estimate and can be used for model selection / early stopping in some implementations.

### Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
- Single Decision Tree:
  - Feature importance often measured by the total reduction of impurity (e.g., Gini or entropy) contributed by splits using that feature.
  - Can be unstable — small data changes can change the tree structure and importances.
  - Single tree may overemphasize features that fit training idiosyncrasies.

- Random Forest:
  - Feature importance averaged over many trees → more stable and reliable.

- Two common measures:
 - Mean decrease in impurity (MDI): average impurity reduction across trees when splitting on feature.
 - Permutation importance (mean decrease accuracy): measure drop in performance when feature values are shuffled — gives model-agnostic, more trustworthy view.
- Random Forest importances reduce bias from single-tree overfitting and account for interactions across different bootstrap samples.

- Summary: Random Forest produces more robust and less noisy feature importance estimates than a single decision tree.


### Question 6: Write a Python program to:
● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.

### Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree

### Question 8: Write a Python program to:
● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy

### Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset

● Compare their Mean Squared Errors (MSE)

### Question 10: You are working as a data scientist at a financial institution to predict loandefault. You have access to customer demographic and transaction history data.You decide to use ensemble techniques to increase model performance.Explain your step-by-step approach to:
● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.

- Scenario: Predict loan default using customer demographic and transaction history. You plan to use ensemble methods to improve performance.

- Step-by-step approach:
  - 1. Understand data & problem
     - Target: default (binary) — classification problem.
     - Features: demographics (age, income, employment), transactions (balances, payments), credit history, behavioral signals.
     - Check class balance (defaults may be minority).
 - 2. Preprocessing & feature engineering
     - Data cleaning: missing values, outliers.
     - Feature engineering: calculate credit utilization ratio, transaction frequency, trend features (last 3 months), aggregation features.
     - Encoding categorical variables (one-hot, target encoding if many categories).
     - Scaling numeric features if using distance-based learners.
 - 3. Choose between Bagging or Boosting
     - If model variance is high (overfitting) and base learner is unstable (e.g., deep trees), Bagging / Random Forest can help reduce variance.
     - If bias is high (underfitting) and you need high predictive accuracy, Boosting (XGBoost / LightGBM / CatBoost) is often better — especially with tabular structured data.
     - For loan default, boosting methods (LightGBM/XGBoost/CatBoost) are often chosen in practice for high accuracy, but RandomForest is a strong baseline and more tolerant to noise.
 - 4. Handle class imbalance
   - Techniques:
     - Resampling: SMOTE (synthetic oversampling) or ADASYN.
     - Use class_weight='balanced' in models.
     - Use evaluation metrics robust to imbalance: AUC-ROC, Precision-Recall AUC, F1, recall at specific thresholds.
    - Prefer combining methods (e.g., boosting + class weighting) if needed.
   
  - 5. Select base models
    - For Boosting: LightGBM / XGBoost / CatBoost.
    - For Bagging: RandomForest or Bagging with DecisionTree base.
    - Also try stacking: combine multiple diverse models (e.g., LogisticRegression, RandomForest, XGBoost) and train a meta-learner (often a simple model like Logistic Regression) on their predictions.
  - 6. Hyperparameter tuning
    - Use GridSearchCV / RandomizedSearchCV or specialized libraries (Optuna) with cross-validation (e.g., stratified K-fold).
    - For time-sensitive data, use TimeSeriesSplit or careful validation scheme.
  - 7. Evaluate with cross-validation
    - Use Stratified K-Fold CV to preserve class ratio.
    - Track metrics: ROC-AUC, Precision-Recall AUC, F1, also confusion matrix at chosen threshold.
    - Use OOB score for RandomForest as additional check.

  - 8. Address overfitting
    - Regularize models (learning rate, max_depth, min_child_weight).
    - Early stopping for boosting (monitor validation AUC).
    - Use simpler model as baseline and feature selection to remove noisy features.
  - 9. Model interpretability & fairness
    - Use SHAP or LIME for local/global explainability — important in finance to justify decisions.
    - Check for bias across demographic groups (fairness audits).
  - 10. Deployment & monitoring
    - Deploy model as a service; implement prediction pipelines.
    - Monitor performance drift: population shift, concept drift — retrain periodically.
    - Keep human-in-the-loop: manual review for high-impact decisions.
- Why ensemble learning improves decision-making (real-world):
   - Higher predictive accuracy: Boosting often yields state-of-the-art performance on tabular data.
   - Robustness: Bagging reduces variance; ensemble is less sensitive to noise/outliers.
   - Better generalization: Combing multiple learners reduces the chance of a single-model failure.
   - Explainability options: Techniques like SHAP work on ensembles, allowing feature-level explanations necessary in regulated domains like finance.
   - Risk control: Better predictions help reduce credit risk and losses; accurate ranking of applicants helps prioritize manual review.

In [None]:
# Solution 06

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importances
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Top 5 features:")
print(importances.head(5))

Top 5 features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [None]:
# Solution 07

from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Single decision tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

# Bagging with decision trees
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print(f"Decision Tree accuracy: {dt_acc:.4f}")
print(f"Bagging (DecisionTree base) accuracy: {bag_acc:.4f}")

In [8]:
# Solution 08

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 3, 5, 10]
}

rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)

# evaluate best estimator on hold-out split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
best_rf = grid.best_estimator_
best_rf.fit(X_train, y_train)
print("Final holdout accuracy:", accuracy_score(y_test, best_rf.predict(X_test)))

Best parameters: {'max_depth': None, 'n_estimators': 50}
Best CV score: 0.9666666666666668
Final holdout accuracy: 0.9666666666666667


In [9]:
# Solution 09

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

cal = fetch_california_housing()
X = cal.data
y = cal.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging with DecisionTreeRegressor
bag_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bag_reg.fit(X_train, y_train)
pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, pred_bag)

# RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, pred_rf)

print(f"BaggingRegressor MSE: {mse_bag:.4f}")
print(f"RandomForestRegressor MSE: {mse_rf:.4f}")