#Questions

**Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind it.**
  - It is a technique where multiple models are trained and combined to solve the same problem.
  - The key idea behind it is that a group of models working together can get better performance than any individual model on its own.
  - Pridictions on different models are combined to get better pridictions.

**Question 2: What is the difference between Bagging and Boosting?**
  - Bagging:-
    - It creates different training datasets by random sampling with replacement.
    - All models are trained independently and parallely.
    - It reduce variance of the model.
  - Boosting:-
    - In this technique each new model is trained by focusing more on the misclassified samples from previous models.
    - All models are trained one after another.
    - It reduce bias and variance.

**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**
  - It is a statistical resampling technique used for to create multiple new datasets from the original training data.
  - It randomly select N samples with replacement from the dataset.
  - Some samples will appear multiple times and some will not.
  - Then each bootstrap sample is used to train a different model.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**
  - Out-of-Bag are the samples refer to the subset of the training data not included in a given bootstrap sample.
  - It evaluates ensemble models like Random Forest without needing an external validation or test set.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest. **
  -

In [1]:
#Question 6: Write a Python program to:
#● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
#● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores.
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

model = RandomForestClassifier(n_estimators=100, random_state=2)
model.fit(X, y)

importances = pd.DataFrame({'Feature': X.columns, 'Importance': model.feature_importances_})
print(importances.sort_values(by='Importance', ascending=False).head(5))

                 Feature  Importance
20          worst radius    0.154410
23            worst area    0.128149
27  worst concave points    0.118846
7    mean concave points    0.097132
22       worst perimeter    0.078753


In [3]:
#Question 7: Write a Python program to:
#● Train a Bagging Classifier using Decision Trees on the Iris dataset
#● Evaluate its accuracy and compare with a single Decision Tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

tree = DecisionTreeClassifier(random_state=2)
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
tree_acc = accuracy_score(y_test, tree_pred)

bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=2
)
bag_model.fit(X_train, y_train)
bag_pred = bag_model.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print(f"Decision Tree Accuracy: {tree_acc:.4f}")
print(f"Bagging Classifier Accuracy: {bag_acc:.4f}")


Decision Tree Accuracy: 0.9556
Bagging Classifier Accuracy: 0.9778


In [4]:
#Question 8: Write a Python program to:
#● Train a Random Forest Classifier
#● Tune hyperparameters max_depth and n_estimators using GridSearchCV
#● Print the best parameters and final accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

rf = RandomForestClassifier(random_state=2)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10, 15]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print(f"Final Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9778


In [5]:
#Question 9: Write a Python program to:
#● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
#● Compare their Mean Squared Errors (MSE)
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

bag_model = BaggingRegressor(random_state=2)
bag_model.fit(X_train, y_train)
bag_pred = bag_model.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

rf_model = RandomForestRegressor(random_state=2)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print(f"Bagging Regressor MSE: {bag_mse:.4f}")
print(f"Random Forest Regressor MSE: {rf_mse:.4f}")


Bagging Regressor MSE: 0.2979
Random Forest Regressor MSE: 0.2635


**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance.**
  - Choose between Bagging or Boosting:-
    - Bagging reduces variance by averaging predictions from multiple independent models.
    - Suitable if the dataset is large and diverse.
    - Boosting reduces bias and often improves accuracy by sequentially correcting errors from previous models.