# Ensemble Learning

**Q1.** What is Ensemble Learning in machine learning? Explain the key idea
behind it.

- Ensemble Learning in machine learning is a technique where multiple models (often called "weak learners") are combined to create a stronger, more accurate, and more robust predictive model.
Instead of relying on a single model, ensemble learning leverages the strengths of many models to reduce errors and improve generalization.

- Key Idea Behind Ensemble Learning

 The central idea is:
“A group of weak models, when combined properly, can perform better than any single strong model.”

 This works because:

1. Error Reduction – Different models may make different mistakes; combining them cancels out individual errors.

2. Variance Reduction – Some models (like decision trees) are highly sensitive to small data changes. Using an ensemble averages out these fluctuations.

3. Bias Reduction – Combining diverse models can reduce the bias of individual learners.

**Q2.** What is the difference between Bagging and Boosting?

 - Bagging (Bootstrap Aggregating)

1. Trains models independently and in parallel.

2. Uses bootstrap sampling (random sampling with replacement).

3. Combines predictions using majority voting (classification) or averaging (regression).

4. Focuses on reducing variance (good for overfitting models).

5. Works best with high variance, low bias models like decision trees.

6. Less prone to overfitting.

7. Example: Random Forest.

- Boosting

1. Trains models sequentially, each new model learns from the errors of the previous ones.

2. Uses the entire dataset, but gives higher weight to misclassified samples.

3. Combines predictions using a weighted sum of models.

4. Focuses on reducing bias (good for underfitting models).

5. Works best with weak learners like shallow decision stumps.

6. More prone to overfitting if not tuned properly.

7. Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM.

**Q3.** What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

- Bootstrap sampling is a statistical method where we create new datasets by randomly selecting data points **with replacement** from the original dataset. This means the same data point can appear multiple times in one sample, while some points may be left out. In Bagging methods like Random Forest, bootstrap sampling ensures that each model (tree) is trained on a slightly different dataset, creating diversity among models, reducing variance, and improving overall prediction accuracy.

- Role in Bagging (e.g., Random Forest)

1. Bootstrap Sampling Creates Diversity

 - Each model (tree) is trained on a different bootstrap sample of the data.

 - This ensures that not all models see the same data, reducing correlation among them.

2. Reduces Variance

 - Since trees are trained on slightly different data, their predictions vary.

 - Combining them (by averaging/voting) cancels out individual errors → lowers variance.

3. Improves Generalization

 - Models are less likely to overfit to the training data, since each tree only sees part of it.

 - The "wisdom of crowds" effect makes the final prediction more robust.

4. Out-of-Bag (OOB) Estimation

 - The ~37% of data not included in a bootstrap sample (called Out-of-Bag data) can be used as a built-in validation set to estimate accuracy without a separate test set.

**Q4.** What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

 - Out-of-Bag (OOB) Samples

  - In bootstrap sampling, each model (e.g., decision tree in Random Forest) is trained on a random sample of the data with replacement.

  - On average, about 63% of the original dataset is included in the bootstrap sample for training a tree.

  - The remaining ~37% of data that is not chosen for that tree is called the Out-of-Bag (OOB) samples.

 - How OOB Score is Used:

  - After a tree is trained, it can be tested on its OOB samples (since those data points were not used in training).

  -  By aggregating predictions from all trees for which a sample was OOB, we can evaluate the model’s accuracy.

  -  This accuracy is called the OOB Score.  

**Q5.** Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.  

- Feature Importance in a Single Decision Tree

1. How it’s calculated:

  - At each split, the tree chooses the feature that provides the highest information gain (e.g., Gini reduction, entropy reduction, variance reduction).

  - Feature importance = sum of the improvements (impurity reduction) that feature provides across all splits, normalized to 1.

2. Characteristics:

  - Importance depends heavily on how the tree is structured.

  - A feature used near the top of the tree usually appears more important.

   - Can be unstable: small changes in data can change the tree structure → importance scores shift.

- Feature Importance in a Random Forest

1. How it’s calculated:

  - Each tree in the forest computes its own feature importance (same method as above).

  - The final importance = average of all trees’ importances.

2. Characteristics:

  - More stable and reliable than a single tree because results are averaged across many trees.

  - Less biased toward features that dominate a single tree.

  - Provides a better global view of which features consistently matter for prediction.   

- Comparison (Tree vs. Forest)

1. Decision Tree:

  - Importance = impurity reduction in that single tree.

  - Can be unstable, biased by tree depth/structure.

2. Random Forest:

  - Importance = averaged impurity reductions across many trees.

  - More robust, less sensitive to noise, more reliable measure.  

In [1]:
#Q6. Write a Python program to:
# ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

importances = clf.feature_importances_

feat_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

feat_importances = feat_importances.sort_values(by='Importance', ascending=False)

print("Top 5 Most Important Features:")
print(feat_importances.head(5))



Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [4]:
# Q7. Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
import sklearn

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

if sklearn.__version__ >= "1.2":
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(random_state=42),
        n_estimators=50,
        random_state=42
    )
else:
    bagging = BaggingClassifier(
        base_estimator=DecisionTreeClassifier(random_state=42),
        n_estimators=50,
        random_state=42
    )

bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_pred)

print("Accuracy of Single Decision Tree:", dt_acc)
print("Accuracy of Bagging Classifier with Decision Trees:", bagging_acc)


Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier with Decision Trees: 1.0


In [5]:
#Q8. Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

y_pred = best_rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Final Test Accuracy:", acc)


Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Test Accuracy: 0.9707602339181286


In [6]:
#Q9. Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
# ● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Mean Squared Error - Bagging Regressor:", bagging_mse)
print("Mean Squared Error - Random Forest Regressor:", rf_mse)


Mean Squared Error - Bagging Regressor: 0.25787382250585034
Mean Squared Error - Random Forest Regressor: 0.25772464361712627


**Q10.** You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.


 - If I were working on predicting loan defaults, I’d approach it like this:

**Step 1: Choosing Bagging vs Boosting**
Since missing a defaulter is very costly for a bank, I’d lean towards Boosting methods (like XGBoost or LightGBM). Boosting focuses more on the difficult-to-predict customers and reduces bias, which is important here. Bagging is great for reducing variance, but in this case, catching those hard-to-detect defaults is more critical.

**Step 2: Handling Overfitting**
Boosting models can overfit if they grow too complex. To control this, I’d tune parameters like the learning rate, maximum tree depth, and number of estimators. I’d also use techniques like early stopping and feature selection to make sure the model doesn’t just memorize the training data.

**Step 3: Selecting Base Models**
For Bagging or Boosting, the usual base model is a decision tree. For Boosting, I’d use shallow decision trees (sometimes called stumps) because they’re weak learners that work well when combined. If I explore Stacking, I’d consider mixing logistic regression (for linear patterns) with tree-based models (for nonlinear patterns).

**Step 4: Evaluating Performance**
Since defaults are usually rare compared to non-defaults, I’d use stratified k-fold cross-validation to keep the class balance. Accuracy alone isn’t enough here, so I’d look at AUC-ROC, precision-recall, and F1-score to properly judge performance.

**Step 5: Why Ensemble Learning Helps**
Using ensembles gives more reliable predictions. Boosting in particular makes the model focus on tough cases and improves detection of defaulters. Overall, it reduces both bias and variance, which means the bank gets a model that generalizes better. In real terms, this leads to fewer risky loans slipping through, smarter lending decisions, and reduced financial loss.
