<a href="https://colab.research.google.com/github/HimAir10/Pw-skillsAssignment/blob/main/Ensemble_Techniques_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Theory Questions

## Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

Answer:
Ensemble Learning is a machine learning technique where multiple individual models (often called base learners or weak learners) are combined to build a stronger and more robust model. The key idea is that a group of models, when aggregated properly, often performs better than any single model alone. This works because different models may capture different aspects or patterns in the data, and combining their predictions helps reduce variance, bias, or both. Common ensemble methods include Bagging (Bootstrap Aggregating), Boosting, and Stacking. In essence, ensemble learning leverages the “wisdom of the crowd” to make more accurate and stable predictions.

# Question 2: What is the difference between Bagging and Boosting?

Answer:

Bagging (Bootstrap Aggregating):
Bagging trains multiple base learners independently on different bootstrap samples of the training data. The predictions are then combined (e.g., majority voting for classification, averaging for regression). Its main strength lies in reducing variance and avoiding overfitting. Random Forest is a popular example.

Boosting:
Boosting builds base learners sequentially, where each new model focuses on correcting the errors made by the previous ones. It assigns higher weights to misclassified samples to force the next learner to pay more attention to them. Boosting reduces both bias and variance and produces strong models, but it is more prone to overfitting if not tuned properly. Examples include AdaBoost, Gradient Boosting, and XGBoost.

# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer:
Bootstrap sampling is a technique where we create multiple training datasets by randomly sampling with replacement from the original dataset. Each bootstrap sample has the same size as the original dataset but may contain duplicate observations. In Bagging methods like Random Forest, bootstrap sampling ensures that each base learner is trained on slightly different data subsets, introducing diversity among models. This diversity is crucial because averaging the outputs of diverse models helps reduce overfitting and variance, leading to more stable and accurate predictions.

# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Answer:
Out-of-Bag (OOB) samples are the data points that are not included in a particular bootstrap sample. Since about one-third of the data is left out in each bootstrap, these OOB samples can act as a natural validation set for evaluating the performance of that specific base learner. The OOB score is the average prediction accuracy on all OOB samples across the ensemble. It provides a reliable estimate of the model’s generalization performance without needing a separate validation dataset, making it especially useful when data is limited.

# Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Answer:
In a single Decision Tree, feature importance is determined by how much each feature contributes to reducing impurity (e.g., Gini index or entropy) at its split points. However, a single tree can be highly sensitive to noise and may assign exaggerated importance to certain features.

In a Random Forest, feature importance is averaged over many trees, making it more robust and reliable. Since different trees are trained on different bootstrap samples and random feature subsets, Random Forest captures the overall significance of features across multiple decision boundaries, reducing bias and variance in feature importance estimation.

# Practical Questions

## Question 6: Python Program – Random Forest Feature Importance

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Get feature importance
importances = pd.Series(rf.feature_importances_, index=feature_names)
top_features = importances.sort_values(ascending=False).head(5)

print("Top 5 Important Features:")
print(top_features)


Top 5 Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


# Question 7: Python Program – Bagging Classifier vs. Decision Tree on Iris

In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier
bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)

bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


# Question 8: Python Program – Random Forest with GridSearchCV

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Random Forest
rf = RandomForestClassifier(random_state=42)

# Hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7, None],
    'n_estimators': [50, 100, 200]
}
grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Results
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 5, 'n_estimators': 50}
Final Accuracy: 0.972027972027972


# Question 9: Python Program – Bagging Regressor vs. Random Forest Regressor

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Bagging Regressor
bag = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)

bag.fit(X_train, y_train)
bag_mse = mean_squared_error(y_test, bag.predict(X_test))

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_mse = mean_squared_error(y_test, rf.predict(X_test))

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.2582477439355284
Random Forest Regressor MSE: 0.2542358390056568


# Question 10: Real-world Ensemble Approach (Loan Default Prediction)

Answer:

Choosing between Bagging and Boosting:
Since predicting loan defaults is a high-stakes classification problem where false negatives (predicting non-default when actually defaulting) can be costly, I would choose Boosting (e.g., XGBoost, LightGBM) because it reduces both bias and variance and usually achieves higher accuracy on imbalanced datasets.

Handling Overfitting:
I would use techniques like early stopping, regularization parameters (L1/L2 penalties, learning rate tuning), and cross-validation to ensure the model does not overfit the training data.

Selecting Base Models:
Decision Trees are the most common base learners for both Bagging and Boosting. For financial data, I would start with shallow trees (depth 3–6) to capture non-linear patterns while avoiding over-complexity.

Evaluating Performance using Cross-Validation:
I would perform stratified k-fold cross-validation to ensure balanced class representation across folds. Evaluation metrics would include AUC-ROC, Precision-Recall, and F1-score since the dataset might be imbalanced.

Justifying Ensemble Learning in Decision-Making:
Ensemble models provide more robust and stable predictions compared to a single model. In a financial context, they help minimize risks by capturing diverse patterns in customer behavior and transaction history. This improves loan approval accuracy, reduces default risk, and ensures fairer decision-making across different customer segments.