# Theoritical Questions

1.Can we use Bagging for regression problems.
- Yes, Bagging works for both classification and regression.
- It builds multiple regression models on bootstrap samples.
- Final prediction is the average of all model outputs.
- Helps reduce variance and improves stability.

2. What is the difference between multiple model training and single model training
- Single model training uses one model, more variance and risk of overfitting.
- Multiple model training uses many models , predictions are combined for better stability.
- Ensemble reduces errors compared to relying on one weak/unstable model.
- More computationally expensive than training one model.

3. Explain the concept of feature randomness in Random Forest
- At each split, only a random subset of features is considered.
- Helps reduce correlation between trees.
- Increases model diversity and improves generalization.
- Prevents a few strong features from dominating all trees.

4. What is OOB (Out-of-Bag) Score
- Evaluation metric using samples not included in bootstrap training sets.
- Acts like built-in cross-validation for Random Forest/Bagging.
- Measures model accuracy using unused samples per tree.
- Saves time by avoiding separate validation sets.

5. How can you measure the importance of features in a Random Forest model
- Calculate decrease in impurity (Gini/Entropy/MSE) when a feature splits data.
- Higher impurity reduction → more important feature.
- Can also use permutation importance by shuffling feature values.
- Importance scores reflect contribution across all trees.

6. Explain the working principle of a Bagging Classifier
- Creates multiple bootstrap samples from the training data.
- Trains a separate classifier on each sample.
- Aggregates predictions using majority voting.
- Reduces variance and improves stability.

7. How do you evaluate a Bagging Classifier’s performance
- Use accuracy, precision, recall, F1-score on test data.
- Check OOB score if available.
- Use cross-validation for robust evaluation.

8. How does a Bagging Regressor work
- Trains multiple regression models on different bootstrap samples.
- Each model makes a prediction.
- Final output is the mean of all predictions.
- Reduces prediction variance.

9. What is the main advantage of ensemble techniques
- Improves accuracy and robustness.
- Reduces variance, bias, or both depending on technique.
- More stable predictions than single models and Works well with weak learners.

10. What is the main challenge of ensemble methods
- Higher computational cost and memory usage.
- Harder to interpret compared to single models.
- Training many models increases complexity.
- Risk of overfitting if not tuned properly.

11. Explain the key idea behind ensemble techniques
- Combining multiple models improves performance. Each model contributes differently to final prediction.
- Diversity among models leads to error reduction.
- A group of weak models can form a strong learner.

12. What is a Random Forest Classifier
- An ensemble of many decision trees using Bagging + feature randomness.
- Trees vote to make the final class prediction.
- Reduces overfitting compared to a single treeand Works well for large and high-dimensional datasets.

13. What are the main types of ensemble techniques
- Bagging ,Boosting ,Stacking, Voting/Blending methods.

14. What is ensemble learning in machine learning
- Technique that combines predictions of multiple models.
- Helps improve accuracy, stability, and generalization , works by averaging, voting, or meta-learning.
- Reduces weaknesses of individual models.

15. When should we avoid using ensemble methods
- When interpretability is crucial and dataset is very small
- When computational resources are limited.
- When a simple model already performs well.

16. How does Bagging help in reducing overfitting
- Uses multiple versions of the dataset (bootstrap samples).
- Each model learns different data patterns.
- Averaging predictions reduces variance.
- Weakens the impact of noisy samples.

17. Why is Random Forest better than a single Decision Tree
- Reduces variance dramatically → less overfitting.
- Uses feature randomness, improving generalization more robust to noise and outliers.
- Produces higher accuracy in most cases.

18. What is the role of bootstrap sampling in Bagging
- Creates diverse training sets by sampling with replacement.
- Ensures each model sees a different subset of data.
- Increases model diversity → reduces prediction variance.
- Allows OOB evaluation using unused samples.

19. What are some real-world applications of ensemble techniques
- Fraud detection ,Customer churn prediction , Medical diagnosis and risk prediction , Recommendation systems and ranking tasks.

20. What is the difference between Bagging and Boosting?
- Bagging reduces variance; Boosting reduces bias.
- Bagging trains models independently; Boosting trains sequentially.
- Bagging uses bootstrap samples; Boosting adjusts weights based on errors.
- Boosting is more powerful but more prone to overfitting.

In [None]:
# 21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

base_dt = DecisionTreeClassifier(random_state=42)
bag_clf = BaggingClassifier(estimator=base_dt,n_estimators=100,max_samples=0.8,bootstrap=True,random_state=42)

bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Bagging Classifier Accuracy:", acc)


In [None]:
# 22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

data = load_diabetes()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

base_dt_reg = DecisionTreeRegressor(random_state=42)
bag_reg = BaggingRegressor(estimator=base_dt_reg,n_estimators=100,max_samples=0.8,bootstrap=True,random_state=42)

bag_reg.fit(X_train, y_train)

y_pred = bag_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Bagging Regressor MSE:", mse)


In [None]:
# 23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

rf_clf = RandomForestClassifier(n_estimators=100,random_state=42)

rf_clf.fit(X_train, y_train)

importances = rf_clf.feature_importances_

print("Feature Importances:")
for name, imp in sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True):
    print(f"{name}: {imp:.4f}")


In [None]:
# 24.Train a Random Forest Regressor and compare its performance with a single Decision Tree

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = load_diabetes()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
y_pred_dt = dt_reg.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)

rf_reg = RandomForestRegressor(n_estimators=100,random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Decision Tree Regressor MSE:", mse_dt)
print("Random Forest Regressor MSE:", mse_rf)


In [None]:
# 25. Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()
X, y = data.data, data.target

rf_clf_oob = RandomForestClassifier(n_estimators=200,oob_score=True,bootstrap=True,random_state=42)

rf_clf_oob.fit(X, y)

print("OOB Score:", rf_clf_oob.oob_score_)


In [None]:
# 26. Train a Bagging Classifier using SVM as a base estimator and print accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

base_svm = SVC(kernel='rbf', gamma='scale', random_state=42)

bag_svm = BaggingClassifier(
    estimator=base_svm,
    n_estimators=50,
    max_samples=0.8,
    bootstrap=True,
    random_state=42
)

bag_svm.fit(X_train, y_train)
y_pred = bag_svm.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print("Bagging (SVM) Accuracy:", acc)


In [None]:
# 27. Train a Random Forest Classifier with different numbers of trees and compare accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

n_estimators_list = [10, 50, 100, 200]

for n in n_estimators_list:
    rf_clf = RandomForestClassifier(
        n_estimators=n,
        random_state=42
    )
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"n_estimators = {n}, Accuracy = {acc:.4f}")


In [None]:
# 28. Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

base_log_reg = LogisticRegression(
    max_iter=500,
    solver='lbfgs'
)

bag_log_reg = BaggingClassifier(
    estimator=base_log_reg,
    n_estimators=50,
    max_samples=0.8,
    bootstrap=True,
    random_state=42
)

bag_log_reg.fit(X_train, y_train)

y_proba = bag_log_reg.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_proba)
print("Bagging (Logistic Regression) AUC:", auc)


In [None]:
# 29. Train a Random Forest Regressor and analyze feature importance scores

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

data = load_diabetes()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf_reg = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_reg.fit(X_train, y_train)

importances = rf_reg.feature_importances_

print("Random Forest Regressor Feature Importances:")
for name, imp in sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True):
    print(f"{name}: {imp:.4f}")


In [None]:
# 30. Train an ensemble model using both Bagging and Random Forest and compare accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

base_dt = DecisionTreeClassifier(random_state=42)

bag_clf = BaggingClassifier(
    estimator=base_dt,
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    random_state=42
)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

rf_clf = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred_rf)

print("Bagging Classifier Accuracy:", acc_bag)
print("Random Forest Classifier Accuracy:", acc_rf)
