Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Ensemble learning is a machine learning technique that improves predictive performance by combining the outputs of multiple individual models, called "weak learners," into a single, more robust "strong learner". The key idea is that the collective wisdom of a group of diverse models, each with its own strengths and weaknesses, can produce more accurate, stable, and generalizable predictions than any single model operating alone

Key Idea: Diversity and Aggregation
Diversity of Models:
The individual models in an ensemble are often different from each other, either through different algorithms, training data subsets, or perspectives.
Aggregation:
The predictions from these individual models are then combined using a specific strategy, such as:
Voting: For classification tasks, models can vote on the most likely class, and the majority vote wins.
Averaging or Weighting: For regression or continuous output, predictions can be averaged, or models can be assigned weights based on their performance.
Sequential Training (Boosting): Models can be trained one after another, with each new model focusing on correcting the mistakes made by its predecessors.

Question 2: What is the difference between Bagging and Boosting?

Bagging involves training multiple independent models on random, resampled subsets of the data in parallel, focusing on reducing variance and preventing overfitting by aggregating their predictions. Boosting, conversely, trains models sequentially, with each new model focusing on and correcting the errors made by the previous ones, aiming to reduce bias and improve accuracy. Key differences include Bagging's focus on independent models and equal weighting, while Boosting uses dependent models and assigns weights based on performance, making it more suited for high-bias problems

Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Bootstrap sampling is a resampling technique where multiple subsets (bootstrap samples) are created from an original dataset by randomly selecting data points with replacement. This means that a single data point can appear multiple times within a bootstrap sample, while other data points from the original dataset may not be included at all in a particular sample. Each bootstrap sample is typically the same size as the original dataset.
In Bagging (Bootstrap Aggregating) methods, such as Random Forest, bootstrap sampling plays a crucial role in promoting diversity among the individual models (e.g., decision trees in a Random Forest).
Role in Bagging methods like Random Forest:
Creating Diverse Training Sets:
Each individual tree in a Random Forest is trained on a different bootstrap sample. This ensures that each tree learns from a slightly different perspective of the data, as some samples are repeated and others are omitted in each bootstrap sample. This diversity helps reduce the correlation between individual trees.
Reducing Variance and Preventing Overfitting:
By training multiple trees on diverse subsets of the data and then aggregating their predictions (e.g., through majority voting for classification or averaging for regression), Bagging methods effectively reduce the variance of the overall model. This helps to prevent overfitting, as the ensemble model is less sensitive to the specific noise or outliers present in any single training set.
Enhancing Model Stability:
The aggregation of predictions from multiple models, each trained on a bootstrapped sample, leads to a more stable and robust final prediction compared to a single model trained on the entire dataset. This stability is crucial for generalization to unseen data.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Out-of-Bag (OOB) samples are the data points not included in the bootstrap sample used to train a specific tree within an ensemble model like a Random Forest. The OOB score evaluates the model's performance by using these OOB samples as a built-in, unbiased validation set, allowing for an assessment of the model's generalization capabilities without needing a separate validation dataset. The score is calculated by aggregating predictions from all the trees that did not see the OOB sample during their training, providing an estimate of how the model would perform on truly unseen data.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Feature importance analysis in a single Decision Tree and a Random Forest both aim to identify the most influential features in a dataset, but they differ in their stability, reliability, and interpretation.
Single Decision Tree:
Calculation:
Feature importance in a single Decision Tree is typically calculated based on the reduction in impurity (e.g., Gini impurity or entropy) achieved by splitting on a particular feature. The higher the impurity reduction, the more important the feature is considered.
Stability and Reliability:
Feature importance in a single Decision Tree can be unstable and less reliable due to its susceptibility to small changes in the data, leading to different tree structures and potentially varying feature importance rankings. It can also overemphasize features that appear early in the tree, even if other features are also highly predictive.
Interpretation:
The importance scores are directly linked to the splits made within that specific tree, offering a clear, albeit potentially biased, view of feature influence within that single model.
Random Forest:
Calculation:
Random Forest calculates feature importance by averaging the impurity reduction across all individual Decision Trees in the forest. This ensemble approach provides a more robust and stable measure of importance.
Stability and Reliability:
Random Forests provide more stable and reliable feature importance scores because they aggregate information from multiple trees, reducing the impact of individual tree variations and overfitting. This helps in identifying genuinely important features.
Interpretation:
While offering a more reliable global view of feature importance, the aggregated nature of Random Forest importance scores can make direct interpretation of individual feature contributions within a single tree less straightforward. It also needs to be considered that highly correlated features might share importance, potentially making each appear less individually important than they truly are.



In [1]:
'''Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.'''

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# Get feature importances
importances = clf.feature_importances_

# Create a DataFrame for better readability
feature_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort by importance and select top 5
top_features = feature_importances.sort_values(by="Importance", ascending=False).head(5)

# Print results
print("Top 5 most important features:")
print(top_features.to_string(index=False))


Top 5 most important features:
             Feature  Importance
          worst area    0.139357
worst concave points    0.132225
 mean concave points    0.107046
        worst radius    0.082848
     worst perimeter    0.080850


In [3]:
'''Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
import sklearn

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# Bagging Classifier (handle sklearn version compatibility)
if sklearn.__version__ >= "1.2":
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=50,
        random_state=42
    )
else:
    bagging = BaggingClassifier(
        base_estimator=DecisionTreeClassifier(),
        n_estimators=50,
        random_state=42
    )

bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

# Print results
print("scikit-learn version:", sklearn.__version__)
print("Accuracy of Single Decision Tree:", dt_accuracy)
print("Accuracy of Bagging Classifier:", bagging_accuracy)


scikit-learn version: 1.6.1
Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


In [4]:
'''Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset (Iris for demo)
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5, 7]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Best model
best_model = grid_search.best_estimator_

# Evaluate on test set
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Final Accuracy on Test Set:", final_accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 1.0


In [5]:
'''Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)'''

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Bagging Regressor (base = Decision Tree)
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
y_pred_bag = bagging_reg.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Results
print("Mean Squared Error - Bagging Regressor:", mse_bagging)
print("Mean Squared Error - Random Forest Regressor:", mse_rf)


Mean Squared Error - Bagging Regressor: 0.25787382250585034
Mean Squared Error - Random Forest Regressor: 0.25772464361712627


In [6]:
'''Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.
'''
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report

# -------------------------------
# 1. Simulate Loan Default Dataset
# -------------------------------
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                           n_redundant=5, n_clusters_per_class=2,
                           weights=[0.9, 0.1], # imbalanced: 10% defaults
                           random_state=42)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# -------------------------------
# 2. Define Models
# -------------------------------
# Bagging: Random Forest
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=6,            # limit depth to reduce overfitting
    max_features="sqrt",    # feature sampling (regularization)
    random_state=42,
    n_jobs=-1
)

# Boosting: XGBoost
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=4,
    learning_rate=0.1,
    subsample=0.8,          # regularization
    colsample_bytree=0.8,   # feature sampling
    eval_metric="logloss",
    use_label_encoder=False,
    random_state=42,
    n_jobs=-1
)

# -------------------------------
# 3. Cross-Validation Evaluation
# -------------------------------
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf_cv_auc = cross_val_score(rf, X_train, y_train, cv=cv, scoring="roc_auc").mean()
xgb_cv_auc = cross_val_score(xgb, X_train, y_train, cv=cv, scoring="roc_auc").mean()

print("Cross-Validation ROC-AUC Scores:")
print(f"Random Forest (Bagging): {rf_cv_auc:.4f}")
print(f"XGBoost (Boosting): {xgb_cv_auc:.4f}")

# -------------------------------
# 4. Final Evaluation on Test Set
# -------------------------------
rf.fit(X_train, y_train)
xgb.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
y_pred_xgb = xgb.predict(X_test)

rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
xgb_auc = roc_auc_score(y_test, xgb.predict_proba(X_test)[:, 1])

print("\nTest Set ROC-AUC:")
print(f"Random Forest: {rf_auc:.4f}")
print(f"XGBoost: {xgb_auc:.4f}")

print("\nClassification Report (XGBoost):")
print(classification_report(y_test, y_pred_xgb))


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Cross-Validation ROC-AUC Scores:
Random Forest (Bagging): 0.9309
XGBoost (Boosting): 0.9566


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Test Set ROC-AUC:
Random Forest: 0.9131
XGBoost: 0.9557

Classification Report (XGBoost):
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      1344
           1       0.93      0.65      0.76       156

    accuracy                           0.96      1500
   macro avg       0.94      0.82      0.87      1500
weighted avg       0.96      0.96      0.95      1500

