#Ensemble Learning Assignment

#Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.


- Ensemble Learning in machine learning is a technique where multiple models (often called base learners or weak learners) are combined to make better predictions than any single model could on its own.

- The main idea is that a group of weak models working together can outperform a single strong model, as combining multiple perspectives helps reduce errors, variance, and bias.

- How It Works:

  - Each model in the ensemble makes a prediction.

  - The predictions are then combined using methods like:

     - Voting (for classification) — majority wins.

     - Averaging (for regression) — mean of all predictions.

     - Weighted combination — some models have more influence based on their accuracy.

- Advantages:

   - Increases accuracy and robustness of predictions.

   - Reduces overfitting and variance.

   - Works well even when individual models are weak.

Common Ensemble Methods:

   - Bagging (Bootstrap Aggregating) – e.g., Random Forest

   - Boosting – e.g., AdaBoost, Gradient Boosting, XGBoost

   - Stacking – combines predictions of multiple models using another model (meta-learner).     


#Question 2: What is the difference between Bagging and Boosting?

- Both Bagging and Boosting are ensemble learning techniques that combine multiple models to improve performance — but they differ in how the models are trained and combined.

1. Goal:

   -  Bagging aims to reduce variance and prevent overfitting.

   - Boosting aims to reduce bias and improve weak models.

2. Training Method:

   - Bagging trains all models independently and in parallel.

   - Boosting trains models sequentially, each new model learning from previous errors.

3. Data Sampling:

   - Bagging uses bootstrapped samples (sampling with replacement).

   - Boosting uses the entire dataset, but increases the weight of misclassified samples.

4. Error Handling:

   - Bagging treats all data points equally.

   - Boosting focuses more on difficult or misclassified points.

5. Combination Method:

   - Bagging combines outputs by majority voting (classification) or averaging (regression).

   - Boosting combines models using a weighted sum based on model accuracy.

6. Overfitting:

   - Bagging is less prone to overfitting.

   - Boosting can overfit if too many weak learners are added

#Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

- Bootstrap sampling is a statistical technique where we create multiple random samples from the original dataset with replacement.

- This means that some data points may appear more than once in a sample, while others may not appear at all.

- Each sample (called a bootstrap sample) is typically the same size as the original dataset.

__ Role of Bootstrap Sampling in Bagging (e.g., Random Forest):

1. Creates Diversity:
    - Each model (e.g., each decision tree in a Random Forest) is trained on a different bootstrap sample, which makes the models slightly different from one another.

2. Reduces Variance:
   - By combining multiple diverse models trained on different samples, Bagging reduces the overall variance and improves prediction stability.

3. Improves Robustness:
   - Since each model sees a slightly different version of the data, the final ensemble model becomes more robust and less sensitive to noise or outliers.
4. Enables Parallel Training:
     - Each bootstrap sample allows models to be trained independently and in parallel, making Bagging computationally efficient.


#Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?


- In Bagging (like in Random Forest), each model is trained on a bootstrap sample created by sampling with replacement from the original dataset.

   - Because sampling is done with replacement, about 63% of the data points are selected in each bootstrap sample.

  - The remaining = 37% of the data that are not included in that sample are called Out-of-Bag (OOB) samples.

- How OOB Samples Are Used:

1. Model Evaluation Without a Test Set:
    - Each model can be evaluated on its OOB samples, i.e., the data it has not seen during training.
    - This gives an unbiased estimate of model performance.

2. OOB Score Calculation:

   - For each observation, predictions are made using only the models that did not include that observation in their training data.

   - These predictions are aggregated (by majority vote or averaging).

   - The overall OOB score is then computed as the accuracy (for classification) or R² score (for regression) based on these predictions.

3. No Need for Cross-Validation:
   - OOB evaluation acts as a built-in cross-validation method for Bagging models, saving computation time.  


#Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

1. Basis of Calculation:

   - Decision Tree  --  Importance is based on how much each feature reduces impurity (like Gini or entropy) within that single tree.

   - Random Forest-- Importance is the average impurity reduction for each feature across all trees in the forest.

2. Stability:

    - Decision Tree-- Can be unstable — small changes in data may lead to different feature importances.

   - Random Forest -- Much more stable, since it averages results from many trees.

3. Bias:

    - Decision Tree --  Often biased toward features with many unique values.

   -  Random Forest --  Reduces this bias through aggregation over multiple trees.

4. Interpretability:

   - Decision Tree --  Easier to interpret since there’s only one tree.

   - Random Forest -- Harder to interpret but gives more accurate and reliable importance values.

5. Accuracy of Importance Estimation:

   - Decision Tree -- May overfit and give misleading importance for noisy features.

   - Random Forest -- Provides robust, generalizable feature importance by averaging over multiple models.



#Question 6: Write a Python program to:
#● Load the Breast Cancer dataset using
#sklearn.datasets.load_breast_cancer()
#● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores.   



In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

model = RandomForestClassifier(random_state=42)
model.fit(X, y)

importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


#Question 7: Write a Python program to:
#● Train a Bagging Classifier using Decision Trees on the Iris dataset
#● Evaluate its accuracy and compare with a single Decision Tree


In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier with Decision Trees (use 'estimator' instead of 'base_estimator')
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging_model.fit(X_train, y_train)
y_pred_bag = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

# Print and compare accuracies
print("Accuracy of Single Decision Tree:", round(dt_accuracy, 3))
print("Accuracy of Bagging Classifier:", round(bagging_accuracy, 3))

if bagging_accuracy > dt_accuracy:
    print("\n Bagging Classifier performed better due to reduced variance.")
else:
    print("\n Single Decision Tree performed equally or better in this run.")


Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0

 Single Decision Tree performed equally or better in this run.


#Question 8: Write a Python program to:
#● Train a Random Forest Classifier
#● Tune hyperparameters max_depth and n_estimators using GridSearchCV
#● Print the best parameters and final accuracy


In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None]
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy on Test Data:", round(accuracy, 3))


Best Parameters: {'max_depth': 3, 'n_estimators': 150}
Final Accuracy on Test Data: 1.0


#Question 9: Write a Python program to:
#● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
#● Compare their Mean Squared Errors (MSE)

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bagging_regressor = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_regressor.fit(X_train, y_train)
y_pred_bag = bagging_regressor.predict(X_test)

rf_regressor = RandomForestRegressor(
    n_estimators=50,
    random_state=42
)
rf_regressor.fit(X_train, y_train)
y_pred_rf = rf_regressor.predict(X_test)

mse_bag = mean_squared_error(y_test, y_pred_bag)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Mean Squared Error (Bagging Regressor):", round(mse_bag, 3))
print("Mean Squared Error (Random Forest Regressor):", round(mse_rf, 3))

if mse_rf < mse_bag:
    print("\n Random Forest Regressor performed better (lower MSE).")
else:
    print("\n Bagging Regressor performed equally or better in this run.")


Mean Squared Error (Bagging Regressor): 0.258
Mean Squared Error (Random Forest Regressor): 0.258

 Random Forest Regressor performed better (lower MSE).


#Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance.
- Explain your step-by-step approach to:
 -  Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world
context.

Ans 10 -- 1. Choosing Between Bagging and Boosting:

   - Use Bagging (e.g., Random Forest) if the model has high variance.

   - Use Boosting (e.g., XGBoost) if the model has high bias and needs better accuracy.

2. Handling Overfitting:

    - Use cross-validation, regularization (e.g., learning rate, max_depth), and limit number of trees.

3. Selecting Base Models:

   - Decision Trees are used as base models since they handle nonlinearity and categorical data well.

4. Evaluating Performance:

   - Apply k-fold cross-validation to compare models using metrics like accuracy, precision, recall, and AUC.

5. Justification:

    - Ensemble learning combines multiple weak models to reduce errors, improve prediction stability, and provide more reliable loan default decisions.