# Ensemble Learning | Assignment

---

## Assignment Code:
## DA-AG-014

## Arghadeep Misra
### arghadeepmisra@gmail.com

---

**Question 1:** What is Ensemble Learning in machine learning? Explain the key idea behind it.

Answer

Ensemble learning is a method in machine learning where we combine more than one model to make predictions. Instead of depending on a single model, we use a group of models together. This group is called an “ensemble.”

The key idea is that many weak learners (models that are not very strong on their own) can work together to make a strong learner. Each model may make some mistakes, but when we combine their results, the overall prediction becomes more accurate and stable.

Example: Think of a classroom. If one student answers a tough question, he might be wrong. But if the whole class votes, the majority answer is more likely to be correct. In the same way, ensemble learning reduces error and avoids overfitting compared to a single model.

---

**Question 2:** What is the difference between Bagging and Boosting?

Answer

Bagging and Boosting are two types of ensemble methods, but they work in different ways.

Bagging (Bootstrap Aggregating):
It trains many models on random samples of the dataset (taken with replacement). Each model is trained independently. At the end, predictions are combined, usually by majority vote (for classification) or averaging (for regression).
Example: Random Forest is a bagging method.

Boosting:
It trains models one after another. Each new model tries to fix the mistakes made by the previous one. More weight is given to the wrong predictions so the next model focuses on them. At the end, the models are combined in a weighted manner.
Example: AdaBoost, XGBoost.

Key difference: Bagging reduces variance by averaging many independent models. Boosting reduces bias by learning from mistakes step by step.

---

**Question 3:** What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer

Bootstrap sampling means taking random samples from the dataset with replacement.
“With replacement” means the same data point can appear more than once in the sample.

In Bagging, like in Random Forest, each model (tree) is trained on a different bootstrap sample of the data. Since every model sees a slightly different dataset, they all learn differently. Later, their results are combined.

The role of bootstrap sampling is to bring diversity among the models. If all models trained on the exact same data, they would give very similar outputs. By using bootstrap, we get models that make different errors, and when we combine them, the overall error reduces.

---

**Question 4:** What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Answer

When we do bootstrap sampling, some data points don’t get picked in the sample. These leftover points are called Out-of-Bag (OOB) samples.

In Bagging methods like Random Forest, we can use these OOB samples to test the model that was trained on the bootstrap sample. This way, each model gets tested on data it never saw during training.

The OOB score is the average accuracy (or error) measured on these OOB samples. It works like a built-in cross validation, so we don’t always need to keep a separate validation set.

Example: If 30% of the data was not chosen in the bootstrap, those points become OOB for that tree. The model’s prediction on them helps calculate OOB score.

---

**Question 5:** Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Answer

In a single Decision Tree, feature importance is based on how much each feature reduces impurity (like Gini or entropy) when it splits the data. If a feature is used near the top of the tree and reduces impurity a lot, it gets high importance.

But a single tree can be unstable. A small change in data may change the structure of the tree, and so the feature importance can also change a lot.

In a Random Forest, feature importance is calculated by averaging across many trees. Each tree gives its own importance scores, and then we combine them. This makes the importance values more stable and reliable.

So,

Single tree - importance may be biased and unstable.

Random forest - importance is averaged across many trees, so it is more consistent and trustworthy.

---

**Question 6:** Write a Python program to:
● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

feat_imp = pd.DataFrame({'Feature': X.columns, 'Importance': model.feature_importances_})
feat_imp = feat_imp.sort_values(by='Importance', ascending=False)

print(feat_imp.head(5))


                 Feature  Importance
23            worst area    0.153892
27  worst concave points    0.144663
7    mean concave points    0.106210
20          worst radius    0.077987
6         mean concavity    0.068001


---

**Question 7:** Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))

bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bag))


Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


---

**Question 8:** Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
(Include your Python code and output in the code box below.)

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [3, 5, 7, None],
    'n_estimators': [50, 100, 150]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Params: {'max_depth': 7, 'n_estimators': 50}
Accuracy: 0.9649122807017544


---

**Question 9:** Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
● Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bag_model = BaggingRegressor(random_state=42)
bag_model.fit(X_train, y_train)
y_pred_bag = bag_model.predict(X_test)
print("Bagging MSE:", mean_squared_error(y_test, y_pred_bag))

rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print("Random Forest MSE:", mean_squared_error(y_test, y_pred_rf))


Bagging MSE: 0.2824242776841025
Random Forest MSE: 0.2553684927247781


---

**Question 10:** You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making

Answer

Step 1: Choose between Bagging or Boosting

Loan default is a complex problem with many patterns.

Boosting is better because it focuses on hard-to-predict cases and usually gives higher accuracy.

Step 2: Handle Overfitting

Use parameters like max_depth to keep trees small.

Use cross validation and early stopping (for boosting methods) to avoid overfitting.

Step 3: Select Base Models

Decision Trees are the base models because they are simple and fast.

Step 4: Evaluate with Cross Validation

Use cross validation to check if the model performs well across different data splits.

Step 5: Justify

Ensemble methods improve decision-making because they combine many weak learners and reduce both error and variance. This gives more reliable loan default predictions.

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

# sample dataset (breast cancer) is used as placeholder
# because financial loan dataset is not available in sklearn
# process will be the same for loan default data
data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingClassifier(random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5)

print("Cross-Validation Accuracy:", scores.mean())


Cross-Validation Accuracy: 0.9516483516483516


---
