1. Can we use Bagging for regression problems?

-  Yes. Bagging can be applied to both classification and regression.
For regression, predictions from individual models are averaged instead of taking a majority vote.

2. What is the difference between multiple model training and single model training?

- Single model training → one model learns patterns (e.g., one Decision Tree).

- Multiple model training (ensemble) → several models are trained, and their results are combined → usually more accurate and robust.

3. Explain the concept of feature randomness in Random Forest.

- Random Forest introduces randomness by:

- Bootstrapping rows (sampling with replacement).

- At each split, it selects a random subset of features instead of all features.

-  This decorrelates trees and improves generalization.

4. What is OOB (Out-of-Bag) Score?

- In Bagging/Random Forest, ~37% of samples are left out during bootstrapping.

- These are called Out-of-Bag samples.

- They are used to estimate model performance without needing a separate validation set.

5. How can you measure the importance of features in a Random Forest model?

- Mean Decrease in Impurity (Gini importance) → measures how much a feature reduces impurity across all trees.

- Permutation importance → measures performance drop when the feature is randomly shuffled.

6. Explain the working principle of a Bagging Classifier.

- Draw bootstrap samples from dataset.

- Train a base model (like Decision Tree) on each sample.

- Aggregate results: majority vote (classification) / average (regression).

7. How do you evaluate a Bagging Classifier’s performance?

- Using accuracy, precision, recall, F1-score (for classification).

- Using MSE, RMSE, R² (for regression).

- Optionally with OOB score.

8. How does a Bagging Regressor work?

- Same principle as Bagging Classifier, but predictions are averaged instead of - majority vote.

9. What is the main advantage of ensemble techniques?

- Improved accuracy.

- Reduced variance (stability).

- Better generalization than a single model.

10. What is the main challenge of ensemble methods?

- Computational cost (training multiple models).

- Complexity (harder to interpret).

- Risk of overfitting if not properly tuned.

11. Explain the key idea behind ensemble techniques.

- “Wisdom of the crowd” → combining multiple weak learners (diverse models) often produces a stronger overall model.

12. What is a Random Forest Classifier?

- An ensemble of Decision Trees trained with bagging + feature randomness.
Final output = majority vote of all trees.

13. What are the main types of ensemble techniques?

- Bagging (e.g., Random Forest).

- Boosting (e.g., AdaBoost, XGBoost, CatBoost, LightGBM).

- Stacking (meta-learning).

14. What is ensemble learning in machine learning?

- The process of combining multiple models (weak/strong learners) to achieve - better predictive performance.

15. When should we avoid using ensemble methods?

- When interpretability is critical (ensembles are “black-box”).

- When computational resources are limited.

- When a single strong model already performs well.

16. How does Bagging help in reducing overfitting?

- By training models on different bootstrap samples, Bagging reduces variance and prevents a single model from memorizing noise.

17. Why is Random Forest better than a single Decision Tree?

- Decision Trees are prone to overfitting.

- Random Forest averages results across many trees → reduces variance, improves stability & accuracy.

18. What is the role of bootstrap sampling in Bagging?

- Bootstrap sampling ensures diversity among models by providing different subsets of data to each base learner.

19. What are some real-world applications of ensemble techniques?

- Fraud detection (Boosting).

- Medical diagnosis (Random Forest).

- Credit scoring (Bagging/Boosting).

- Recommendation systems.

- Text classification / spam filtering.

21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy

In [6]:
!pip install scikit-learn



In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Classifier
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(),n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)

# Predictions
y_pred = bagging.predict(X_test)
print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred))


Bagging Classifier Accuracy: 1.0


Q22) Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error

In [10]:
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor
bagging_reg = BaggingRegressor(estimator=DecisionTreeRegressor(),
                               n_estimators=50, random_state=42)
bagging_reg.fit(X_train, y_train)

# Predictions
y_pred = bagging_reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

MSE: 2987.0073593984966


Q23.Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores.

In [11]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Feature Importance
importance = pd.DataFrame({'Feature': data.feature_names,
                           'Importance': rf.feature_importances_}).sort_values(by='Importance', ascending=False)

print(importance.head(10))


                 Feature  Importance
7    mean concave points    0.141934
27  worst concave points    0.127136
23            worst area    0.118217
6         mean concavity    0.080557
20          worst radius    0.077975
22       worst perimeter    0.074292
2         mean perimeter    0.060092
3              mean area    0.053810
26       worst concavity    0.041080
0            mean radius    0.032312


Q24) Train a Random Forest Regressor and compare its performance with a single Decision Tree.

In [14]:
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
y_pred_dt = dt_reg.predict(X_test)

print("Random Forest R²:", r2_score(y_test, y_pred_rf))
print("Decision Tree R²:", r2_score(y_test, y_pred_dt))


Random Forest R²: 0.8531093915343916
Decision Tree R²: 0.7486772486772486


Q25) Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier.

In [15]:
rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_oob.fit(X_train, y_train)
print("OOB Score:", rf_oob.oob_score_)


OOB Score: 0.9547738693467337


Q26) Train a Bagging Classifier using SVM as a base estimator and print accuracy.

In [17]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load dataset (assuming you want to use the breast cancer dataset from previous cells)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)


# Bagging with SVM
bag_svm = BaggingClassifier(estimator=SVC(kernel="linear"),
                            n_estimators=10, random_state=42)
bag_svm.fit(X_train, y_train)

y_pred = bag_svm.predict(X_test)
print("Bagging Classifier with SVM Accuracy:", accuracy_score(y_test, y_pred))

Bagging Classifier with SVM Accuracy: 0.9590643274853801


Q27) Train a Random Forest Classifier with different numbers of trees and compare accuracy.

In [18]:
for n in [10, 50, 100, 200]:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf.predict(X_test))
    print(f"Trees: {n}, Accuracy: {acc:.4f}")


Trees: 10, Accuracy: 0.9649
Trees: 50, Accuracy: 0.9708
Trees: 100, Accuracy: 0.9708
Trees: 200, Accuracy: 0.9708


Q28) Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score.

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

bag_log = BaggingClassifier(estimator=LogisticRegression(max_iter=1000),
                            n_estimators=20, random_state=42)
bag_log.fit(X_train, y_train)

y_pred_proba = bag_log.predict_proba(X_test)[:,1]
print("Bagging + Logistic Regression AUC:", roc_auc_score(y_test, y_pred_proba))