In [1]:
#1What is Ensemble Learning in machine learning? Explain the key idea behind it. 
#>>Ensemble Learning in machine learning is a technique that combines multiple individual models (often called weak learners or base models) to build a more powerful and accurate predictive model.
#Train multiple models on the same (or slightly different) data.
#Combine their predictions using a specific strategy (e.g., voting, averaging, or stacking).
#The final output is more accurate than most individual models.

In [2]:
#2.What is the difference between Bagging and Boosting?  in para in brief
#Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they differ in how they build and combine models.
#In Bagging, multiple models (like decision trees) are trained independently on different random subsets of the training data (sampled with replacement). Their predictions are then combined by averaging or voting, which helps reduce variance and prevent overfitting.
#In contrast, Boosting builds models sequentially, where each new model focuses on correcting the errors made by the previous ones. The models are combined in a weighted manner, giving more importance to those with better performance. Boosting aims to reduce bias and improve accuracy but can be more sensitive to noisy data compared to Bagging.

In [3]:
#3 What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
#>>Bootstrap sampling is a technique where multiple random samples are drawn with replacement from the original dataset, meaning some data points may appear more than once while others may be left out.
#In Bagging methods like Random Forest, bootstrap sampling allows each model (e.g., each decision tree) to train on a slightly different subset of the data. This introduces diversity among the models, reducing the chance that all of them make the same errors. When their predictions are combined (by averaging or voting), this diversity helps to reduce variance and improve the overall accuracy and stability of the ensemble model.

In [4]:
#4.What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
#>>Out-of-Bag (OOB) samples are the data points not included in a particular bootstrap sample when training models in ensemble methods like Random Forest. Since bootstrap sampling is done with replacement, about one-third of the original data is typically left out (not selected) for each tree — these are the OOB samples for that tree.
#The OOB score is a way to evaluate the model’s performance without using a separate validation or test set. After training, each tree predicts the output for its own OOB samples, and the combined predictions from all trees are compared to the true values. The proportion of correctly predicted instances (for classification) or the prediction accuracy (for regression) gives the OOB score, which serves as an unbiased estimate of the model’s generalization performance.

In [5]:
#5Compare feature importance analysis in a single Decision Tree vs. a Random Forest
#>>In a single Decision Tree,feature importance is calculated based on how much each feature reduces impurity (like Gini impurity or entropy) across all the tree’s splits. Features used near the root or in many informative splits tend to have higher importance. However, because a single tree is sensitive to data variations, its feature importance can be unstable and may overfit to the training data.
#In contrast, a Random Forest calculates feature importance by averaging the importance scores of each feature across all trees in the forest. This aggregation makes the importance estimates more reliable, stable, and generalizable, since the randomization in data (bootstrap sampling) and feature selection reduces bias toward any specific feature. Thus, Random Forest provides a more robust and accurate measure of which features truly influence the model’s predictions.

In [6]:
#6  Write a Python program to: ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores. 
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importances
feature_importances = pd.Series(model.feature_importances_, index=X.columns)

# Sort and print the top 5 features
top_features = feature_importances.sort_values(ascending=False).head(5)
print("Top 5 Most Important Features:\n")
print(top_features)

Top 5 Most Important Features:

worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [8]:
#7 Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree 
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees as base estimators
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Updated parameter name
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

# Print the comparison
print("Accuracy Comparison:")
print(f"Single Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy:   {bagging_accuracy:.4f}")

Accuracy Comparison:
Single Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy:   1.0000


In [9]:
#8  Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy 
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None]
}

# Use GridSearchCV to find the best combination
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", best_params)
print(f"Final Test Accuracy: {accuracy:.4f}")

Best Hyperparameters: {'max_depth': 5, 'n_estimators': 150}
Final Test Accuracy: 0.9708


In [10]:
#9 : Write a Python program to: ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset ● Compare their Mean Squared Errors (MSE) 
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Bagging Regressor with Decision Trees as base estimators
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
y_pred_bag = bagging_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Train a Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print the comparison
print("Mean Squared Error (MSE) Comparison:")
print(f"Bagging Regressor MSE:       {mse_bag:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")

Mean Squared Error (MSE) Comparison:
Bagging Regressor MSE:       0.2579
Random Forest Regressor MSE: 0.2577


In [13]:
#10 You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to: ● Choose between Bagging or Boosting ● Handle overfitting ● Select base models ● Evaluate performance using cross-validation ● Justify how ensemble learning improves decision-making in this real-world context. 
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, f1_score
from imblearn.over_sampling import SMOTE

# -------------------------
# 1. Load or create dataset
# -------------------------
# Example synthetic dataset (replace with your real customer data)
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, 
                           n_redundant=5, n_clusters_per_class=2,
                           weights=[0.9, 0.1], flip_y=0.01, random_state=42)

# -------------------------
# 2. Train-test split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=42)

# -------------------------
# 3. Handle imbalance
# -------------------------
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# -------------------------
# 4. Define models
# -------------------------
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)

# -------------------------
# 5. Cross-validation
# -------------------------
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

def evaluate_model(model, X, y):
    roc_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
    f1_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
    print(f"{model.__class__.__name__} - ROC-AUC: {roc_scores.mean():.4f} ± {roc_scores.std():.4f}")
    print(f"{model.__class__.__name__} - F1-score: {f1_scores.mean():.4f} ± {f1_scores.std():.4f}\n")

print("Cross-Validation Results:")
evaluate_model(rf_model, X_train_res, y_train_res)
evaluate_model(gb_model, X_train_res, y_train_res)

# -------------------------
# 6. Train final model and evaluate on test set
# -------------------------
rf_model.fit(X_train_res, y_train_res)
gb_model.fit(X_train_res, y_train_res)

for model in [rf_model, gb_model]:
    y_pred_proba = model.predict_proba(X_test)[:,1]
    y_pred = model.predict(X_test)
    
    roc = roc_auc_score(y_test, y_pred_proba)
    f1 = f1_score(y_test, y_pred)
    
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    pr_auc = auc(recall, precision)
    
    print(f"{model.__class__.__name__} Test Set:")
    print(f"ROC-AUC: {roc:.4f}, F1: {f1:.4f}, PR-AUC: {pr_auc:.4f}\n")

Cross-Validation Results:
RandomForestClassifier - ROC-AUC: 0.9957 ± 0.0004
RandomForestClassifier - F1-score: 0.9697 ± 0.0021

GradientBoostingClassifier - ROC-AUC: 0.9845 ± 0.0011
GradientBoostingClassifier - F1-score: 0.9414 ± 0.0052

RandomForestClassifier Test Set:
ROC-AUC: 0.9419, F1: 0.7902, PR-AUC: 0.8435

GradientBoostingClassifier Test Set:
ROC-AUC: 0.9183, F1: 0.6751, PR-AUC: 0.7936

