<a href="https://colab.research.google.com/github/Himani954/Data-types-and-structure/blob/main/Ensemble_Learning_%7C_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**


# **Answer 1:**
# Ensemble Learning Overview
- Definition : Ensemble Learning is a technique in machine learning where multiple models (called "weak learners") are combined to create a stronger model (the "ensemble") that performs better than any individual model.
- Key Idea : The main idea behind ensemble methods is that by combining the predictions of several models, you can often get better performance than using a single model. This leverages the strengths of different models or different configurations of the same model type.

Common Ensemble Techniques
-  Bagging : Bootstrap aggregating (like in Random Forests) reduces variance.
- Boosting : Sequentially adding models to correct  of previous models (like AdaBoost, Gradient Boosting) reduces bias.

# **Question 2: What is the difference between Bagging and Boosting?**

# **Answer 2:**
# Bagging vs. Boosting
- Bagging :
    - Trains models in parallel on different data subsets.
    - Reduces variance.
    - Example: Random Forest.

- Boosting :
    - Trains models sequentially, focusing on previous errors.
    - Reduces bias.
    - Examples: AdaBoost, Gradient Boosting.

# **Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

# **Answer 3:**
# Bootstrap Sampling Overview
- Definition : Bootstrap sampling is a technique where you create subsets of data by sampling with replacement from the original dataset. Each subset is the same size as the original dataset.
- Result : Some data points may be repeated in a subset, while others might be left out (out-of-bag data).

Role in Bagging (like Random Forest)
- In Bagging : Each tree in a Random Forest is trained on a bootstrap sample of the data.
- Benefits :
    - Reduces overfitting by averaging predictions across trees trained on different samples.
    - Allows estimation of out-of-bag error for validation.


# **Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

# **Answer 4:**
# Out-of-Bag (OOB) Samples
- Definition : In Bagging methods like Random Forest, for each tree, OOB samples are the data points that were not included in the bootstrap sample used to train that tree.
- Usage : OOB samples can be used to estimate the model's performance without needing a separate validation set.

OOB Score for Evaluation
- OOB Score : For each data point, predictions are made using trees where that point was OOB. The OOB score is the aggregate of these predictions for evaluating model performance.
- Usefulness : Provides an unbiased estimate of model performance using training data itself.

# **Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

# **Answer 5:**
# Feature Importance Analysis: Decision Tree vs. Random Forest
-  Single Decision Tree :
    - Feature importance can be calculated based on how much each feature contributes to reducing impurity (like Gini impurity).
    - Can be less reliable due to overfitting or variance in a single tree.

- Random Forest :
    - Feature importance is often calculated by averaging the importance across all trees.
    - More robust and reliable due to averaging over many trees, reducing variance.

Key Difference
- Stability : Random Forest feature importances are generally more stable and trustworthy because they're based on an ensemble of trees.

# **Question 6: Write a Python program to:**
# **● Load the Breast Cancer dataset using**
# **sklearn.datasets.load_breast_cancer()**
# **● Train a Random Forest Classifier**
# **● Print the top 5 most important features based on feature importance scores.**

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Create and train the Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X, y)

# Get feature importance scores
importances = clf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print top 5 features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))

Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


# **Question 7: Write a Python program to:**
# **● Train a Bagging Classifier using Decision Trees on the Iris dataset**
# **● Evaluate its accuracy and compare with a single Decision Tree**

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import sklearn
import sys

# Print versions for debugging
print(f"Python version: {sys.version}")
print(f"Scikit-learn version: {sklearn.__version__}")

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# 4. Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed from base_estimator to estimator
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

# 5. Print results
print("Accuracy of Single Decision Tree: {:.4f}".format(acc_dt))
print("Accuracy of Bagging Classifier : {:.4f}".format(acc_bag))

Python version: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
Scikit-learn version: 1.6.1
Accuracy of Single Decision Tree: 0.9333
Accuracy of Bagging Classifier : 0.9333


# **Question 8: Write a Python program to:**
# **● Train a Random Forest Classifier**
# **● Tune hyperparameters max_depth and n_estimators using GridSearchCV**
# **● Print the best parameters and final accuracy**

In [None]:
# Importing libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test data
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Final Accuracy on Test Set:", accuracy)

Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Accuracy on Test Set: 0.9649122807017544


# **Question 9: Write a Python program to:**
# **● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset**
# **● Compare their Mean Squared Errors (MSE)**

In [None]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging Regressor (using DecisionTree as base estimator)
bagging = BaggingRegressor(n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Calculate MSE
mse_bagging = mean_squared_error(y_test, y_pred_bag)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)

# Compare
if mse_bagging < mse_rf:
    print("Bagging Regressor performed better.")
else:
    print("Random Forest Regressor performed better.")

Mean Squared Error (Bagging Regressor): 0.25592438609899626
Mean Squared Error (Random Forest Regressor): 0.2553684927247781
Random Forest Regressor performed better.


# **Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**
# **You decide to use ensemble techniques to increase model performance.**
# **Explain your step-by-step approach to:**
# **● Choose between Bagging or Boosting**
# **● Handle overfitting**
# **● Select base models**
# **● Evaluate performance using cross-validation**
# **● Justify how ensemble learning improves decision-making in this real-world context.**


# **Answer10:**
Step-by-Step Approach for Predicting Loan Default Using Ensemble Techniques
● Choose between Bagging or Boosting
- For predicting loan default, *Boosting* might be more suitable because it sequentially corrects errors, focusing on harder-to-classify instances (like customers on the edge of defaulting). Boosting techniques like Gradient Boosting or XGBoost are effective in handling complex patterns in financial data and reducing bias.

● Handle Overfitting
- Regularization : Use regularization techniques available in boosting algorithms like XGBoost ( gamma , lambda , alpha) to prevent overfitting.
- Early Stopping : Implement early stopping based on a validation set to stop training when performance stops improving.
- Cross-Validation : Use cross-validation to assess how the model generalizes to unseen data.

● Select Base Models
- For Boosting: Decision trees are commonly used as base learners because they are weak learners that Boosting can improve upon.
- Choose shallow trees (like max_depth=3-5 ) to keep individual trees weak but let Boosting build a strong ensemble.

● Evaluate Performance using Cross-Validation
- Use k-fold cross-validation to get a robust estimate of model performance on unseen data.
- Evaluate using metrics relevant for loan default prediction like AUC-ROC , precision , recall , or F1-score , considering class imbalance (since defaults are typically rare).

● Justify How Ensemble Learning Improves Decision-Making
- Improved Accuracy : Ensemble methods like Boosting often lead to higher accuracy by combining multiple models.
- Handling Complex Relationships : Ensemble techniques are effective in capturing complex interactions between customer demographics and transaction history.
- Better Risk Assessment : In loan default prediction, improved accuracy means better risk assessment, leading to smarter lending decisions, reduced losses, and optimized portfolio management.

Summary
In predicting loan default using ensemble techniques:
- Boosting is chosen for its ability to reduce bias.
- Overfitting is handled via regularization and early stopping.
- Weak learners like shallow decision trees are used.
- Cross-validation evaluates performance robustly.
- Ensemble learning improves decision-making by enhancing prediction accuracy for better risk management.