In [None]:
#1.  What is Ensemble Learning in machine learning? Explain the key idea behind it.
"""
Ensemble Learning is a technique that combines multiple machine learning models to improve overall prediction accuracy and robustness.
It leverages the strengths of different models so that their combined output performs better than any single model.

Types:
    -Bagging
    -Boosting
    -Stacking
    -Voting
"""

In [None]:
# 2. What is the difference between Bagging and Boosting?
"""
Bagging trains multiple models independently and in parallel on different random subsets of the data created through sampling
with replacement. Its main goal is to reduce variance and prevent overfitting. In contrast, Boosting trains models sequentially,
where each new model focuses on correcting the errors made by the previous ones. Boosting aims to reduce bias and improve 
overall accuracy. Common examples of Bagging include Random Forest, while AdaBoost, Gradient Boosting, and XGBoost are popular 
Boosting methods.
"""

In [None]:
# 3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
"""
Bootstrap sampling is a statistical technique where multiple random samples are drawn from the original dataset with 
replacement. This means the same data point can appear more than once in a sample, while some points might not appear at all.

In Bagging methods like Random Forest, bootstrap sampling is used to create different training subsets for each individual 
model (e.g., each decision tree). This ensures that every model learns from a slightly different version of the data, 
introducing diversity among the models. The combined predictions from these diverse models help to reduce variance, prevent 
overfitting, and improve overall model stability and accuracy.
"""

In [None]:
# 4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
"""
Out-of-Bag (OOB) samples are the data points that are not included in a particular bootstrap sample when training a model in 
Bagging methods like Random Forest. Since each tree in the ensemble is trained on a random subset of the data (with replacement), roughly one-third of the data is left out and serves as OOB samples for that tree.
"""

In [None]:
# # 5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
"""
In a single Decision Tree, feature importance is determined by how much each feature contributes to reducing impurity
(such as Gini impurity or entropy) during the splits. The higher the reduction in impurity caused by a feature across all its 
splits, the more important that feature is considered. However, since a single tree can be sensitive to small changes in data, 
its feature importance scores may not be very stable or reliable.

In contrast, a Random Forest calculates feature importance by averaging the importance scores of each feature across all the 
trees in the ensemble. This aggregation makes the feature importance values more robust, stable, and reliable, as it reduces 
the effect of randomness and overfitting that might occur in a single tree.
"""

In [1]:
# 6. Write a Python program to:
# ● Load the Breast Cancer dataset using
# sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

importances = rf.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


C:\Users\aditi\anaconda3\py\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\aditi\anaconda3\py\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [2]:
# 7. Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_acc = accuracy_score(y_test, y_pred_dt)

bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                        n_estimators=50,
                        random_state=42)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)
bag_acc = accuracy_score(y_test, y_pred_bag)

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [3]:
# 8.  Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10, 15]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': 5, 'n_estimators': 150}
Final Accuracy: 0.9707602339181286


In [4]:
# 9. Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California
# Housing dataset
# ● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bagging = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
                           n_estimators=50,
                           random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Bagging Regressor MSE:", mse_bag)
print("Random Forest Regressor MSE:", mse_rf)


Bagging Regressor MSE: 0.2579153056796594
Random Forest Regressor MSE: 0.25638991335459355


In [None]:
# 10. You are working as a data scientist at a financial institution to predict loan
# default. You have access to customer demographic and transaction history data.
# You decide to use ensemble techniques to increase model performance.
# Explain your step-by-step approach to:
# ● Choose between Bagging or Boosting
# ● Handle overfitting
# ● Select base models
# ● Evaluate performance using cross-validation
# ● Justify how ensemble learning improves decision-making in this real-world
# context.
"""
1.Choose between Bagging or Boosting:
    -Use Bagging (e.g., Random Forest) if base models have high variance and you want to reduce overfitting.
    -Use Boosting (e.g., XGBoost, LightGBM) if base models have high bias and you want to improve accuracy by correcting errors sequentially.

2. Handle Overfitting:
    -Use cross-validation, early stopping, and regularization (e.g., depth limits, learning rate).
    -Apply feature selection, data balancing, and pruning to prevent overfitting.

3. Select Base Models:
    -Common base models: Decision Trees, Logistic Regression, or Weak Learners.
    -Use diverse models in stacking for better generalization.

4. Evaluate Performance (Cross-Validation):
    -Use Stratified K-Fold CV for balanced evaluation.
    -Metrics: ROC-AUC, Precision-Recall, F1-score, and calibration.

5. Ensemble Learning Justification:
    -Combines multiple models to reduce variance and bias, giving more accurate, stable, and reliable predictions.
    -In loan default prediction, this improves risk assessment, reduces losses, and supports better credit decisions.
"""