# Assignment

Q.1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Answer ->

Ensemble Learning in machine learning is a technique where multiple models (called "base learners" or "weak learners") are trained and combined to solve the same problem, with the goal of achieving better performance than any single model alone.

**Key Idea Behind Ensemble Learning :**

The main idea is that a group of weak models working together can outperform a single strong model — similar to the saying “wisdom of the crowd.”

Different models may make different errors, and by combining their predictions, these errors can often cancel each other out, resulting in higher accuracy, better generalization, and improved robustness.

Q.2. What is the difference between Bagging and Boosting?

Answer ->

**Bagging :**
- Models are built independently in parallel
- Goal- Reduce variance
- Data smapling uses random subsets of data with replacement
- Combination method uses simple averaging or voting
- Common algorithm Random Forest, Bagged Trees

**Boosting :**
- Models are built sequentially, each learning from the previous one’s errors
- Goal- Reduce bias
- Data sampling uses the entire dataset, adjusting weights of samples
- Combination method uses weighted averaging of models
- Common algorithm AdaBoost, Gradient Boosting, XGBoost


Q.3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Answer ->

**Bootstrap sampling :**

Bootstrap sampling is a statistical technique that involves randomly selecting samples from a dataset with replacement to create multiple new datasets (called bootstrap samples).

Each bootstrap sample is of the same size as the original dataset.

Because sampling is done with replacement, some data points may appear multiple times, while others may not appear at all in a given sample.

Example:
If you have a dataset with 10 records, a bootstrap sample might look like:

[2, 5, 1, 7, 5, 9, 3, 1, 8, 4]

Here, some elements (like 5 and 1) appear twice, and some (like 6 or 10) might not appear.

**Role of Bootstrap Sampling in Bagging :**

Bagging (Bootstrap Aggregating) uses bootstrap sampling as its foundation.

Process:

1. From the original dataset, multiple bootstrap samples are generated.

2. A separate model (e.g., decision tree) is trained on each bootstrap sample independently.

3. The predictions of all models are combined (by averaging for regression or majority voting for classification).

This aggregation helps reduce the variance of the final model.

Q.4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Answer ->

**Out-of-Bag :**

When performing bootstrap sampling (sampling with replacement) in ensemble methods like Bagging or Random Forest, not all data points from the original dataset are selected in each bootstrap sample.

On average, each bootstrap sample contains about 63–67% of the original data.

The remaining ~33% of the data, which are not selected in that particular sample, are called Out-of-Bag (OOB) samples.

In simple terms:

OOB samples are the data points left out during the creation of a bootstrap sample.

**Role of OOB Samples in Model Evaluation**

The OOB samples serve as a built-in validation set for evaluating the model’s performance without needing a separate test set.

How it works (in Random Forest or Bagging):

1. For each tree (or base model), the OOB samples (those not used for training that tree) are passed through the model to get predictions.

2. For every observation in the dataset, you can collect predictions from all the trees for which that observation was OOB.

3. Compare the aggregated OOB predictions to the true labels to estimate the OOB error or OOB score.

Q.5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Answer ->

**Decision Tree :**
- Based on impurity reduction in one tree
- Unstable — sensitive to data changes
- Bias can favor features with many levels or splits
- Easier to interpret (simple structure)
- Reflects importance in one specific tree

**Random Forest :**
- Average of impurity reductions across all trees
- Stable — averages across multiple trees
- Reduces this bias due to feature randomness
- Harder to interpret (aggregated result)
- Reflects overall importance across the ensemble

Q.6.  Write a Python program to:

● Load the Breast Cancer dataset using

    sklearn.datasets.load_breast_cancer()

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.

In [1]:
# Answer ->>

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Step 1: Load the Breast Cancer dataset
data = load_breast_cancer()

# Convert to DataFrame for easier handling
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Step 2: Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Step 3: Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame of feature names and their importance scores
feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
})

# Step 4: Sort features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Step 5: Print the top 5 most important features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Q.7. Write a Python program to:

● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree

In [None]:
# Answer ->>

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a single Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# Step 4: Train a Bagging Classifier using Decision Trees as base estimators
bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,       # number of trees
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

# Step 5: Print and compare the accuracies
print("Accuracy of Single Decision Tree:", round(dt_accuracy, 3))
print("Accuracy of Bagging Classifier (with Decision Trees):", round(bagging_accuracy, 3))


Q.8. : Write a Python program to:

● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy


In [6]:
# Answer ->>

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Step 2: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Define the base Random Forest model
rf = RandomForestClassifier(random_state=42)

# Step 4: Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None]
}

# Step 5: Use GridSearchCV to find the best combination of parameters
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,                # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1            # Use all available cores for faster computation
)

# Step 6: Fit GridSearchCV on training data
grid_search.fit(X_train, y_train)

# Step 7: Get the best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Step 8: Evaluate the final model on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 9: Print results
print("Best Parameters Found:", best_params)
print("Final Model Test Accuracy:", round(accuracy, 3))


Best Parameters Found: {'max_depth': 3, 'n_estimators': 150}
Final Model Test Accuracy: 1.0


Q.9. Write a Python program to:

● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset

● Compare their Mean Squared Errors (MSE)

In [None]:
# Answer ->>

# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Step 1: Load California Housing dataset
california = fetch_california_housing()
X, y = california.data, california.target

# Step 2: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Train a Bagging Regressor (with Decision Trees)
bagging = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Step 4: Train a Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Step 5: Print and compare MSEs
print("Mean Squared Error (Bagging Regressor):", round(mse_bag, 3))
print("Mean Squared Error (Random Forest Regressor):", round(mse_rf, 3))


Q.10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.

Answer ->

1. Data Preparation : Clean, impute missing values, encode categorical features, scale numeric features if needed.

2. Model Choice : Decide Bagging (Random Forest) vs Boosting (XGBoost/Gradient Boosting).

3. Base Learner Selection : Shallow Decision Trees for boosting; full trees for bagging.

4. Hyperparameter Tuning & Overfitting Control : Cross-validation, regularization, max depth, learning rate.

5. Model Evaluation : Stratified k-fold CV, ROC-AUC, Precision/Recall, F1-score.

6. Deployment & Monitoring : Use ensemble predictions to guide lending decisions; monitor model drift over time.