Q1. What is Ensemble Learning in machine learning? Explain the key idea behind it.
-->Ensemble Learning is a technique in machine learning where multiple models (called base learners or weak learners) are combined to make predictions.
Key idea: Instead of relying on a single model, we combine the outputs of multiple models to reduce errors, improve accuracy, and increase robustness.

---
Q2. Difference between Bagging and Boosting
-->Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they work in different ways. Bagging trains multiple models in parallel using different bootstrap samples of the training data, and then combines their predictions (by voting in classification or averaging in regression) to reduce variance and prevent overfitting. Boosting, on the other hand, trains models sequentially, where each new model focuses on correcting the errors made by the previous ones. This process reduces bias and can improve accuracy, but it is more prone to overfitting if not properly regularized. While Bagging is mainly used to make unstable models like decision trees more stable, Boosting aims to build a strong learner by turning many weak learners into a powerful model.

---
Q3.Q3. Bootstrap sampling and its role in Bagging

--->Bootstrap sampling is the process of creating multiple datasets by sampling with replacement from the original dataset.
Role in Bagging: It ensures each model gets slightly different training data, making them diverse and reducing overfitting.

---

Q4. Out-of-Bag (OOB) samples and OOB score
--->OOB samples: Data points not included in a bootstrap sample for a given model.

OOB score: Measures model accuracy by testing it on its OOB samples without using a separate validation se

---
Q5. Feature importance in Decision Tree vs Random Forest

-->Decision Tree: Importance is based on how much each feature reduces impurity across all splits in the tree.

Random Forest: Importance is averaged across all trees, giving a more reliable and less biased measure.

---
Q10. Step-by-step approach – Loan default prediction with ensemble learning

-->Start with Boosting (e.g., XGBoost) if dataset is complex and has many features; Boosting handles bias better.

Handle overfitting:

Use cross-validation, tune learning rate, limit max depth.

Select base models:

Decision Trees for interpretability; Gradient Boosted Trees for performance.

Evaluate performance:

Use Stratified K-Fold cross-validation and metrics like ROC-AUC.

Justification:

Ensemble combines multiple weak learners to capture patterns better, improving loan default prediction accuracy.

In [1]:
#Q6. Python Program – Random Forest top 5 important features
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train Random Forest
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Get top 5 features
importances = model.feature_importances_
feature_names = data.feature_names
top_indices = importances.argsort()[::-1][:5]

for i in top_indices:
    print(f"{feature_names[i]}: {importances[i]:.4f}")


worst area: 0.1394
worst concave points: 0.1322
mean concave points: 0.1070
worst radius: 0.0828
worst perimeter: 0.0808


In [7]:
#Q7. Python Program – Bagging Classifier vs Decision Tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier (updated parameter name)
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),
                            n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bagging.predict(X_test))

print(f"Decision Tree Accuracy: {dt_acc:.4f}")
print(f"Bagging Accuracy: {bag_acc:.4f}")


Decision Tree Accuracy: 1.0000
Bagging Accuracy: 1.0000


In [4]:
#8. Python Program – Random Forest with GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Grid Search
params = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, params, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Accuracy:", grid.score(X_test, y_test))


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Accuracy: 1.0


In [5]:
#Q9. Python Program – Bagging Regressor vs Random Forest Regressor
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging Regressor
bag_reg = BaggingRegressor(n_estimators=50, random_state=42)
bag_reg.fit(X_train, y_train)
bag_mse = mean_squared_error(y_test, bag_reg.predict(X_test))

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=50, random_state=42)
rf_reg.fit(X_train, y_train)
rf_mse = mean_squared_error(y_test, rf_reg.predict(X_test))

print(f"Bagging MSE: {bag_mse:.4f}")
print(f"Random Forest MSE: {rf_mse:.4f}")


Bagging MSE: 0.2573
Random Forest MSE: 0.2573
