**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

**Answer:**It is a machine learning technique in which multiple models (often called base learners or weak learners) are trained and then combined to solve a problem. Instead of relying on a single model, ensemble methods aggregate the predictions of multiple models to achieve better performance, accuracy, and robustness.

The key idea behind ensemble learning is the principle of the “wisdom of the crowd”. Just like a group of people with diverse perspectives can collectively make better decisions than a single person, combining multiple models reduces the risk of individual errors and improves generalization.

* If one model makes a mistake, others may correct it.
* Different models may capture different aspects of the data.
* The combined prediction is usually more stable and less prone to overfitting compared to individual models.

**Question 2: What is the difference between Bagging and Boosting?**

**Answer:**Bagging and Boosting are both ensemble learning techniques, but they differ in how they build and combine models:

Bagging (Bootstrap Aggregating):

* Models are trained independently on different random subsets of the training data (sampled with replacement).
* The final prediction is made by averaging (for regression) or majority voting (for classification).
* Goal: Reduce variance and prevent overfitting.
* Example: Random Forest.

Boosting:

* Models are trained sequentially, where each new model focuses on correcting the errors made by the previous models.
* The final prediction is a weighted combination of all models.
* Goal: Reduce bias and improve accuracy.
*nExamples: AdaBoost, Gradient Boosting, XGBoost.



**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Answer:**It is a statistical technique where we create multiple random samples from the original dataset with replacement.

* Each bootstrap sample has the same size as the original dataset, but because sampling is done with replacement, some data points may appear multiple times, while others may not appear at all.

Role in Bagging (e.g., Random Forest):

* In Bagging, each base learner (e.g., decision tree) is trained on a different bootstrap sample.
* This introduces diversity among the models because each one sees a slightly different version of the training data.
* When their predictions are combined (via averaging or majority vote), the overall model becomes more robust, stable, and less prone to overfitting compared to a single model trained on the full dataset.



**Question 4:What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**

**Answer:**

When using bootstrap sampling in Bagging (like in Random Forests), each bootstrap sample is created by sampling with replacement. On average, about 63% of the original training data points appear in a bootstrap sample, leaving the remaining  37% unused.

* These unused data points are called **Out-of-Bag (OOB) samples**.

**Role of OOB Samples:**

* OOB samples act like a **built-in validation set**.
* For each model in the ensemble, its OOB samples can be used to test its performance since those samples were not seen during training.

**OOB Score:**

* The **OOB score** is the average accuracy (or other performance metric) computed by aggregating predictions on all OOB samples across the ensemble.
* It provides an **unbiased estimate of model performance** without needing a separate validation or test dataset.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

**Answer: **Feature importance tells us how much each feature contributes to the predictive power of the model.

*In a Single Decision Tree:*

* Feature importance is calculated based on how much a feature reduces impurity (e.g., Gini impurity, entropy, or variance) across all the nodes where it is used for splitting.
* The importance score is biased toward features with more categories or continuous variables, since they provide more splitting opportunities.
* Because it depends on only one tree, it may give unstable results (high variance) if the dataset is small or noisy.

*In a Random Forest:*

* Feature importance is computed by averaging impurity reduction across all trees in the forest.
* This aggregation makes the importance scores more reliable, stable, and less biased compared to a single tree.
* Randomness in feature selection (feature bagging) further reduces bias toward dominant features.

In [1]:
'''
Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
(Include your Python code and output in the code box below.)

Answer:
'''

# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier


# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# 2. Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# 3. Get feature importance scores
importances = clf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
})

# Sort features by importance (descending order)
feature_importance_df = feature_importance_df.sort_values(by="Importance", ascending=False)

# 4. Print Top 5 Most Important Features
print("Top 5 Important Features:")
print(feature_importance_df.head(5))


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [3]:
'''
Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)

Answer:
'''
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# 1. Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a single Decision Tree
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# 4. Train a Bagging Classifier with Decision Trees
bag_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
bag_accuracy = accuracy_score(y_test, y_pred_bag)

# 5. Print accuracy comparison
print("Accuracy of Single Decision Tree:", dt_accuracy)
print("Accuracy of Bagging Classifier:", bag_accuracy)


Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


In [4]:
'''
Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
(Include your Python code and output in the code box below.)

Answer:
'''
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# 4. Define parameter grid for tuning
param_grid = {
    "n_estimators": [50, 100, 150],   # number of trees
    "max_depth": [None, 5, 10, 15]    # depth of trees
}

# 5. Perform GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,                # 5-fold cross-validation
    n_jobs=-1,
    scoring="accuracy"
)
grid_search.fit(X_train, y_train)

# 6. Get best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 7. Evaluate final accuracy on test set
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Final Accuracy on Test Set:", final_accuracy)


Best Parameters: {'max_depth': 5, 'n_estimators': 150}
Final Accuracy on Test Set: 0.9707602339181286


In [6]:
'''
Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)

Answer:
'''
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a Bagging Regressor (with Decision Trees as base estimator)
bag_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# 4. Train a Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# 5. Print comparison
print("Mean Squared Error (Bagging Regressor):", mse_bag)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)


Mean Squared Error (Bagging Regressor): 0.25787382250585034
Mean Squared Error (Random Forest Regressor): 0.25772464361712627


**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance.**

Explain your step-by-step approach to:
* Choose between Bagging or Boosting
* Handle overfitting
* Select base models
* Evaluate performance using cross-validation
* Justify how ensemble learning improves decision-making in this real-world
context.

(Include your Python code and output in the code box below.)

**Answer:**  
*Step 1: Choose between Bagging or Boosting*

* Bagging (e.g., Random Forest) is useful when the main problem is high variance (unstable models like Decision Trees).
* Boosting (e.g., XGBoost, AdaBoost, Gradient Boosting) is useful when the main problem is high bias (weak learners underfitting).

For loan default prediction (imbalanced, complex relationships), Boosting is often preferred because it sequentially improves on mistakes and usually achieves higher accuracy.

*Step 2: Handle Overfitting*

* Use cross-validation to tune hyperparameters (max_depth, n_estimators, learning_rate).
* Apply regularization techniques (e.g., shrinkage/learning rate in Boosting, limiting depth of trees).
* Use early stopping (for Gradient Boosting / XGBoost).

*Step 3: Select Base Models*

* Typically Decision Trees are chosen as base models because they are simple and work well in ensembles.
* For bagging: DecisionTreeClassifier ->Random Forest.
* For boosting: Shallow Decision Trees (stumps) -> Gradient Boosting / XGBoost.

*Step 4: Evaluate Performance with Cross-Validation*
* Use StratifiedKFold Cross-Validation (because dataset is imbalanced).
* Evaluate using metrics beyond accuracy (e.g., Precision, Recall, F1-score, AUC).

*Step 5: Justification in Real-World Context*

* Ensemble models reduce errors by combining multiple weak learners.
* In loan default prediction, wrong decisions can be costly.
* Ensemble methods improve robustness, generalization, and decision confidence, helping the financial institution minimize risk while maximizing approval rates.

    
    

In [8]:
'''
Question 10: You are working as a data scientist at a financial institution to predict loan default.
You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

Choose between Bagging or Boosting
Handle overfitting
Select base models
Evaluate performance using cross-validation
Justify how ensemble learning improves decision-making in this real-world context.

(Include your Python code and output in the code box below.)
'''
# Import libraries
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# 1. Simulate dataset (binary classification: loan default 0/1)
X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=10,
    n_redundant=5, n_classes=2, weights=[0.7, 0.3], random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 2. Define base models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)

# 3. Hyperparameter tuning for Gradient Boosting (Boosting)
param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    estimator=gb, param_grid=param_grid, cv=cv,
    scoring="roc_auc", n_jobs=-1
)
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# 4. Evaluate model
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("Best Parameters (Boosting):", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))


Best Parameters (Boosting): {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96      1047
           1       0.95      0.87      0.91       453

    accuracy                           0.95      1500
   macro avg       0.95      0.92      0.94      1500
weighted avg       0.95      0.95      0.95      1500

ROC-AUC Score: 0.972744580858587
