*1.  Can we use Bagging for regression problems ?*

Yes, Bagging (Bootstrap Aggregating) can definitely be used for regression problems.


Bagging involves training multiple models (usually the same type, like decision trees) on different bootstrapped subsets of the training data.

Each model makes its own prediction.

For regression, the final prediction is typically the average of the individual model predictions.

*2.  What is the difference between multiple model training and single model training?*

The difference between multiple model training and single model training lies in how many models are used to learn from data and make predictions. Here's a clear breakdown:

🔹 Single Model Training
Definition:
You train one model on the training data and use it for prediction.

Example:
Training a single decision tree, linear regression model, or neural network.

Characteristics:

Simple and easy to implement.

Faster training and inference time.

Limited to the strengths and weaknesses of that one model.

Can overfit or underfit depending on the model's complexity.

Use case:

When the data is clean and the problem is well understood.

When interpretability or speed is important.

🔹 Multiple Model Training (Ensemble Learning)
Definition:
You train several models and combine their predictions to make a final decision.

Types:

Bagging (e.g., Random Forest): Reduces variance.

Boosting (e.g., XGBoost, AdaBoost): Reduces bias.

Stacking: Combines multiple different models.

Characteristics:

More robust and accurate than a single model.

Reduces overfitting by aggregating diverse predictions.

Increases computational cost (more models to train and maintain).

Use case:

When high accuracy is critical.

When the data is noisy or complex.

In competitions (like Kaggle) and real-world deployments.

*3.  Explain the concept of feature randomness in Random Forest?*

Feature randomness is a key concept that makes Random Forests more powerful than basic bagging with decision trees.

In a Random Forest:

At each split of a decision tree, the algorithm does NOT consider all features.

Instead, it randomly selects a subset of features and finds the best split only among those.

This is called feature randomness, or feature bagging.

If all trees consider all features at every split, they might:

Choose the same dominant features over and over.

End up making similar trees despite different training samples (bootstrapping).

Result in a high correlation among trees, which limits the benefit of averaging.

By introducing feature randomness:

Each tree is forced to consider different patterns.

The ensemble becomes more diverse, which improves generalization.

*4.  What is OOB (Out-of-Bag) Score?*

The OOB (Out-of-Bag) score is an internal validation method used in ensemble methods like Random Forests, which rely on bootstrap sampling.

The OOB score is an estimate of the model's performance calculated using only the out-of-bag samples:

For each training instance, collect predictions from the subset of trees that did not see that instance during training.

Aggregate those predictions (e.g., majority vote for classification or average for regression).

Compare them to the true labels.

Compute accuracy (classification) or R² score (regression) using these OOB predictions.

*5.  How can you measure the importance of features in a Random Forest model?*

Random Forests naturally provide a way to estimate feature importance, which tells you how valuable each feature is in making predictions.

1. Gini Importance (Mean Decrease in Impurity)
 What is it?
Based on how much each feature reduces impurity (e.g., Gini index or variance) across all trees.

Features used in high-impact splits near the root of trees contribute more.

 In scikit-learn:
python
Copy
Edit
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

model = RandomForestClassifier()
model.fit(X_train, y_train)

# Get importance scores
importances = model.feature_importances_

# View as a DataFrame
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)
print(feature_importance_df)
 Pros:
Fast and built-in.

Useful for quick insights.

 Cons:
Can be biased toward features with more categories or higher cardinality.

 2. Permutation Importance
 What is it?
Measures the drop in model performance when a feature's values are randomly shuffled.

If shuffling a feature significantly harms the model, that feature is important.

 In scikit-learn:
python
Copy
Edit
from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
perm_df = pd.DataFrame({
    'Feature': X_test.columns,
    'Importance': result.importances_mean
}).sort_values(by='Importance', ascending=False)
print(perm_df)
Pros:
Model-agnostic and more reliable, especially with correlated features.

Reflects true predictive power.

 Cons:
Slower, especially on large datasets.

 3. SHAP Values (SHapley Additive exPlanations)

Uses game theory to explain each feature’s contribution to every single prediction.

Provides local and global interpretability.

⚙️ Example:
python
Copy
Edit
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
 Pros:
Very detailed and accurate.

Useful for individual predictions and overall model behavior.

 Cons:
More complex to implement.

Slower than basic importance methods.

*6.  Explain the working principle of a Bagging Classifier?*

Bagging stands for Bootstrap Aggregating, and a Bagging Classifier is an ensemble method that improves the stability and accuracy of machine learning algorithms — especially unstable ones like decision trees.

Working of a Bagging Classifier:
1. Bootstrap Sampling (Random Subsets)
From the original training data of size N, B new datasets (called bootstrap samples) are created by random sampling with replacement.

Each subset is typically the same size N, but it may have duplicates and omit some original samples.

2. Train Multiple Base Models
A base classifier (like a Decision Tree, SVM, etc.) is trained on each bootstrap sample independently.

This results in multiple trained models (usually the same type, but trained on different data).

3. Voting for Final Prediction
When predicting, each model gives its own output.

For classification:

Majority Voting is used — the class that gets the most votes is the final prediction.

For regression (Bagging Regressor), the predictions are averaged.

*7.  How do you evaluate a Bagging Classifier’s performance ?*

Evaluating a Bagging Classifier is similar to evaluating any supervised classification model. You assess how well it generalizes to unseen data using metrics based on its predictions.

1. Train-Test Split
Use a separate test set to evaluate how well the model performs on unseen data.

python
Copy
Edit
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
⚙️ 2. Fit the Bagging Classifier
python
Copy
Edit
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
model.fit(X_train, y_train)
📊 3. Prediction and Evaluation Metrics
✅ Accuracy
The percentage of correctly classified instances.

python
Copy
Edit
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
🏷️ Classification Report
Includes precision, recall, F1-score for each class.

python
Copy
Edit
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
📉 Confusion Matrix
Gives a detailed view of correct vs incorrect predictions.

python
Copy
Edit
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
📈 ROC AUC Score (for binary classifiers)
Useful for evaluating probabilistic predictions and how well the classifier separates classes.

python
Copy
Edit
from sklearn.metrics import roc_auc_score

y_proba = model.predict_proba(X_test)[:, 1]
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))
🔁 4. Cross-Validation (Recommended)
To get a more reliable performance estimate, especially on small datasets.

python
Copy
Edit
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("Cross-validated accuracy:", scores.mean())
🔍 5. OOB Score (Out-of-Bag Score)
If you're using a BaggingClassifier with bootstrap=True, you can set oob_score=True to get a built-in performance estimate.

python
Copy
Edit
model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    oob_score=True,
    random_state=42
)
model.fit(X_train, y_train)
print("OOB Score:", model.oob_score_)


*8.  How does a Bagging Regressor work?*

A Bagging Regressor is an ensemble method that improves the performance and stability of regression models by combining predictions from multiple models trained on random subsets of the data.

It follows the same Bagging (Bootstrap Aggregating) principle used in classification but adapted for regression tasks.

Step-by-Step Working of Bagging Regressor
1. Bootstrap Sampling
Create multiple random subsets of the training data using sampling with replacement.

Each subset is the same size as the original dataset (typically) but includes duplicates and omits some data points.

2. Train Base Regressors
Train a separate base regressor (commonly a DecisionTreeRegressor) on each bootstrapped sample.

You end up with multiple models, each slightly different due to the different data they were trained on.

3. Aggregate Predictions
For a new input, each regressor makes a prediction.

The final prediction is the average of all the individual predictions.

*9.  What is the main advantage of ensemble techniques ?*

Main Advantage of Ensemble Techniques
The main advantage of ensemble techniques is:

Improved predictive performance — by combining multiple models, ensembles reduce errors like bias, variance, or both, resulting in more accurate and robust predictions than any individual model alone.

*10.  What is the main challenge of ensemble methods?*

The main challenge of ensemble methods is:

 Increased complexity and reduced interpretability — combining multiple models makes ensembles harder to understand, debug, and maintain compared to single, simple models.

 Key Challenges of Ensemble Methods

 | Challenge                           | Explanation                                                             |
| ----------------------------------- | ----------------------------------------------------------------------- |
|  **Lack of Interpretability**      | Hard to explain why the ensemble made a certain prediction              |
|  **Computational Cost**           | Training and predicting with multiple models takes more time and memory |
|  **Overfitting Risk in Boosting** | Boosting can sometimes overfit noisy data if not regularized            |
|  **Tuning Complexity**            | More hyperparameters (e.g., number of estimators, learning rate)        |
|  **Model Management**             | More difficult to deploy and maintain in production                     |
|  **Data Requirements**            | Some ensemble techniques may need more data to perform well             |


*11.  Explain the key idea behind ensemble techniques?*

Key Idea Behind Ensemble Techniques
The fundamental idea of ensemble techniques is:

Combine multiple models to create a stronger, more accurate, and more robust model than any single one.

Different models make different errors. Some might get certain examples wrong, while others get them right.

By aggregating their predictions (e.g., voting or averaging), the ensemble reduces the overall error.

This diversity among models helps in cancelling out individual mistakes.

*12.  What is a Random Forest Classifier*?

A Random Forest Classifier is an ensemble learning method used for classification tasks. It builds a forest of decision trees and combines their predictions to improve accuracy and control overfitting.

How It Works:

Bootstrap Sampling:
Multiple decision trees are trained on different random subsets of the training data (with replacement).

Feature Randomness:
At each split in a tree, only a random subset of features is considered to decide the best split. This introduces diversity among trees.

Aggregation (Voting):
For classification, each tree makes a prediction, and the class with the majority vote across all trees is chosen as the final output.

*13.  What are the main types of ensemble techniques?*

Ensemble techniques combine multiple models to improve performance. The three main types are:

 Bagging (Bootstrap Aggregating)
How it works:
Train multiple base models independently on different random subsets of the training data (created via bootstrap sampling).

Goal:
Reduce variance and avoid overfitting.

Examples:

Random Forest

Bagging Classifier/Regressor

2. Boosting
How it works:
Train models sequentially, each new model focusing on correcting errors made by previous models.

Goal:
Reduce bias and improve overall accuracy by focusing on difficult cases.

Examples:

AdaBoost

Gradient Boosting Machines (GBM)

XGBoost

LightGBM

CatBoost

3. Stacking (Stacked Generalization)
How it works:
Train multiple different base models (can be heterogeneous), then train a meta-model to combine their predictions.

Goal:
Leverage strengths of diverse models to improve predictions.

Examples:

Use logistic regression or another model as a meta-learner on top of base learners.

*14.  What is ensemble learning in machine learning?*

Ensemble Learning is a technique where multiple models (learners) are combined to solve a problem and improve overall performance.

Instead of relying on a single model, ensemble learning aggregates predictions from several models to produce a better, more accurate, and robust result.

Working :

Train multiple models (can be the same type or different).

Combine their predictions via methods like:

Voting (for classification)

Averaging (for regression)

More complex strategies like stacking.

*15.  When should we avoid using ensemble methods?*

While ensemble methods are powerful, there are cases when using them might not be the best choice:

Ensembles (especially boosting or stacking) produce complex models.

If you need clear, explainable decisions (e.g., medical diagnosis, finance), simple models like logistic regression or single decision trees may be better.

1. When Interpretability is Crucial
Ensembles (especially boosting or stacking) produce complex models.

If you need clear, explainable decisions (e.g., medical diagnosis, finance), simple models like logistic regression or single decision trees may be better.

2. Limited Computational Resources
Ensembles require more time and memory for training and prediction.

On low-resource devices or with strict latency requirements, simpler models might be preferable.

3. Small Datasets
If your dataset is very small, ensembles might overfit or not provide significant benefit.

Sometimes a simple model trained carefully can perform just as well.

4. When You Need Quick Prototyping
Ensembles add complexity.

For fast iteration or proof-of-concept, start with simple models.

5. When Base Models Already Perform Well
If a single model achieves high accuracy and robustness, ensembles might not add much value but will increase complexity.

6. High Maintenance Costs
Ensembles can be hard to deploy, debug, and maintain in production.

If ease of maintenance is a priority, avoid complex ensembles.

*16.  How does Bagging help in reducing overfitting?*

Bagging (Bootstrap Aggregating) reduces overfitting mainly by reducing the variance of high-variance models like decision trees.

Imagine several weather forecasts from different meteorologists (models). Each may overreact to local noise (like a sudden gust of wind), but averaging their predictions gives a more reliable forecast.

Explanation:
High Variance Models Overfit Easily:
Models like decision trees tend to fit noise in training data, which causes them to perform poorly on unseen data.

Bootstrap Sampling Creates Diverse Training Sets:
Bagging trains each model on a different random subset (with replacement) of the original data. This means each base model learns a slightly different pattern.

Multiple Models Average Out Noise:
Since each model makes different errors, averaging their predictions smooths out the noise and reduces fluctuations.

Result: Lower Variance Without Increasing Bias Much
By averaging many “overfit” models, bagging stabilizes the overall prediction and prevents any single model’s noise from dominating.

| Effect                 | How Bagging Helps                                                                  |
| ---------------------- | ---------------------------------------------------------------------------------- |
| Overfitting (variance) | Reduces by averaging multiple overfitting models trained on different data subsets |
| Bias                   | Remains roughly the same                                                           |
| Stability              | Increases, leading to more robust predictions                                      |


*17.  Why is Random Forest better than a single Decision Tree?*

Here’s why a Random Forest is generally better than a single Decision Tree:

| Aspect                | Single Decision Tree                    | Random Forest                                                        |
| --------------------- | --------------------------------------- | -------------------------------------------------------------------- |
| **Overfitting**       | Prone to overfitting the training data  | Reduces overfitting by averaging many trees                          |
| **Variance**          | High variance (sensitive to data noise) | Lower variance through ensemble averaging                            |
| **Bias**              | Can have low bias but high variance     | Slightly higher bias but much lower variance                         |
| **Robustness**        | Sensitive to small changes in data      | More stable and robust to noise and outliers                         |
| **Accuracy**          | Usually less accurate on unseen data    | Generally more accurate and generalizes better                       |
| **Feature Selection** | Uses all features at splits             | Considers random subsets of features at splits, increasing diversity |
| **Interpretability**  | Easier to interpret                     | Harder to interpret due to many trees                                |

*18.  What is the role of bootstrap sampling in Bagging?*

Role of Bootstrap Sampling in Bagging
Bootstrap sampling is the core technique that makes Bagging (Bootstrap Aggregating) work effectively. Here’s its role:


It’s a method of random sampling with replacement.

From the original dataset of size N, you create a new dataset by randomly picking N samples, allowing duplicates.

Each bootstrap sample is slightly different from the original data.

Why Bootstrap Sampling is Important in Bagging:
Creates Diverse Training Sets:
Each base model in Bagging is trained on a different bootstrap sample, so models see slightly different data.

Introduces Model Diversity:
This variation in training data causes each model to learn different patterns and make different errors.

Reduces Correlation Between Models:
Lower correlation among models is key to effective averaging—if models make uncorrelated errors, averaging reduces overall error.

Enables Out-of-Bag (OOB) Estimation:
Since about 1/3 of samples are left out of each bootstrap sample, these “left out” samples can be used as validation data to estimate performance without a separate test set.

*19.  What are some real-world applications of ensemble techniques ?*

Ensemble techniques are widely used across industries because of their strong performance and robustness. Here are some real-world applications where ensemble methods shine:

Real-World Applications of Ensemble Techniques

| Application Area                      | Example Use Cases                                       | Ensemble Technique Often Used         |
| ------------------------------------- | ------------------------------------------------------- | ------------------------------------- |
| **Finance**                           | Fraud detection, credit scoring, stock price prediction | Random Forest, Gradient Boosting      |
| **Healthcare**                        | Disease diagnosis, medical image analysis               | Bagging, Boosting, Stacking           |
| **E-commerce**                        | Product recommendation, customer churn prediction       | Gradient Boosting (XGBoost, LightGBM) |
| **Marketing**                         | Customer segmentation, campaign response prediction     | Random Forest, AdaBoost               |
| **Natural Language Processing (NLP)** | Sentiment analysis, spam detection                      | Voting classifiers, Stacking          |
| **Computer Vision**                   | Object detection, facial recognition                    | Ensemble of CNNs, Boosting techniques |
| **Weather Forecasting**               | Predicting temperature, rainfall                        | Bagging, Random Forest                |
| **Cybersecurity**                     | Intrusion detection, malware classification             | Random Forest, Boosting               |
| **Sports Analytics**                  | Player performance prediction, game outcome forecasting | Stacking, Boosting                    |
| **Manufacturing**                     | Fault detection, predictive maintenance                 | Random Forest, Gradient Boosting      |

*20.  What is the difference between Bagging and Boosting?*

Here’s a clear comparison between Bagging and Boosting:

| Aspect                      | Bagging (Bootstrap Aggregating)                                                 | Boosting                                                                      |
| --------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| **Goal**                    | Reduce variance by averaging many independent models                            | Reduce bias by sequentially improving weak learners                           |
| **Model Training**          | Models trained **independently and in parallel** on different bootstrap samples | Models trained **sequentially**, each focusing on errors of previous models   |
| **Data Sampling**           | Random sampling **with replacement** to create diverse datasets                 | Uses the entire dataset but **adjusts weights** to focus on hard examples     |
| **Error Correction**        | No focus on correcting previous errors; each model votes equally                | Later models focus on mistakes made by earlier ones, boosting their influence |
| **Aggregation Method**      | Simple majority voting (classification) or averaging (regression)               | Weighted sum of model predictions based on their accuracy                     |
| **Model Complexity**        | Each model can be complex (e.g., deep trees) but combined to reduce variance    | Usually uses weak learners (e.g., shallow trees) to reduce bias gradually     |
| **Common Algorithms**       | Random Forest, Bagging Classifier/Regressor                                     | AdaBoost, Gradient Boosting, XGBoost, LightGBM                                |
| **Susceptibility to Noise** | Less sensitive due to averaging over many models                                | More sensitive since boosting focuses on hard (possibly noisy) samples        |
| **Parallelization**         | Easy to parallelize since models are independent                                | Harder to parallelize due to sequential training                              |


In [None]:
# 21.  Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy?
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create base estimator - Decision Tree
base_estimator = DecisionTreeClassifier(random_state=42)

# Create Bagging Classifier using the Decision Tree as base estimator
bagging_clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=50, random_state=42)

# Train the model
bagging_clf.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Accuracy: {accuracy:.4f}")


In [None]:
# 22.  Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE).

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create base estimator - Decision Tree Regressor
base_estimator = DecisionTreeRegressor(random_state=42)

# Create Bagging Regressor using the Decision Tree as base estimator
bagging_reg = BaggingRegressor(base_estimator=base_estimator, n_estimators=50, random_state=42)

# Train the model
bagging_reg.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_reg.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Bagging Regressor Mean Squared Error: {mse:.4f}")


In [None]:
# 23.  Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)


In [None]:
# 24.  Train a Random Forest Regressor and compare its performance with a single Decision Tree

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize single Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
y_pred_dt = dt_reg.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)

# Initialize Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print(f"Decision Tree Regressor MSE: {mse_dt:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")


In [None]:
# 25.  Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets (optional, OOB uses training data only)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Classifier with OOB enabled
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)

# Train on training data
rf.fit(X_train, y_train)

# Access OOB score
print(f"OOB Score: {rf.oob_score_:.4f}")


In [None]:
# 26.  Train a Bagging Classifier using SVM as a base estimator and print accuracy


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create base estimator - Support Vector Classifier
base_svc = SVC(probability=True, random_state=42)

# Create Bagging Classifier with SVM as base estimator
bagging_svm = BaggingClassifier(base_estimator=base_svc, n_estimators=20, random_state=42)

# Train the model
bagging_svm.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_svm.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier with SVM Accuracy: {accuracy:.4f}")


In [None]:
# 27.  Train a Random Forest Classifier with different numbers of trees and compare accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different number of trees to try
n_trees_list = [1, 5, 10, 50, 100, 200]

print("Number of Trees | Accuracy")
print("----------------|---------")

for n_trees in n_trees_list:
    rf = RandomForestClassifier(n_estimators=n_trees, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{n_trees:15} | {accuracy:.4f}")


In [None]:
# 28.  Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base estimator: Logistic Regression (use solver that supports probability)
base_lr = LogisticRegression(solver='liblinear', random_state=42)

# Bagging Classifier with Logistic Regression
bagging_clf = BaggingClassifier(base_estimator=base_lr, n_estimators=50, random_state=42)

# Train the model
bagging_clf.fit(X_train, y_train)

# Predict probabilities for the positive class
y_probs = bagging_clf.predict_proba(X_test)[:, 1]

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_probs)
print(f"Bagging Classifier with Logistic Regression AUC Score: {auc_score:.4f}")


In [None]:
# 29. Train a Random Forest Regressor and analyze feature importance scores.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_reg.fit(X_train, y_train)

# Get feature importance scores
importances = rf_reg.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)

# Optional: Plot feature importances
plt.figure(figsize=(10,6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.gca().invert_yaxis()
plt.title('Feature Importance in Random Forest Regressor')
plt.xlabel('Importance Score')
plt.show()


In [None]:
# 30.  Train an ensemble model using both Bagging and Random Forest and compare accuracy.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Classifier with Decision Tree base estimator
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    random_state=42
)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print(f"Bagging Classifier Accuracy: {accuracy_bagging:.4f}")
print(f"Random Forest Classifier Accuracy: {accuracy_rf:.4f}")


In [None]:
# 31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=2)

# Fit GridSearch to training data
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)

# Evaluate the best estimator on test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy with best RF: {accuracy:.4f}")


In [None]:
# 32.  Train a Bagging Regressor with different numbers of base estimators and compare performance

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different numbers of base estimators to try
n_estimators_list = [1, 5, 10, 50, 100]

print("Number of Estimators | Mean Squared Error")
print("---------------------|--------------------")

for n_estimators in n_estimators_list:
    bagging_reg = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=n_estimators,
        random_state=42
    )
    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"{n_estimators:21} | {mse:.4f}")


In [None]:
# 33.  Train a Random Forest Classifier and analyze misclassified samples

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict test data
y_pred = rf.predict(X_test)

# Identify misclassified samples
misclassified_indices = np.where(y_pred != y_test)[0]

print(f"Number of misclassified samples: {len(misclassified_indices)}")

# Create a DataFrame for misclassified samples for inspection
misclassified_data = pd.DataFrame(X_test[misclassified_indices], columns=feature_names)
misclassified_data['True Label'] = y_test[misclassified_indices]
misclassified_data['Predicted Label'] = y_pred[misclassified_indices]

print("\nMisclassified samples:")
print(misclassified_data)


In [None]:
# 34.  Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train single Decision Tree Classifier
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Train Bagging Classifier with Decision Trees
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,
    random_state=42
)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

print(f"Single Decision Tree Accuracy: {accuracy_dt:.4f}")
print(f"Bagging Classifier Accuracy: {accuracy_bagging:.4f}")


In [None]:
# 35.  Train a Random Forest Classifier and visualize the confusion matrix

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict on test set
y_pred = rf.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix with seaborn heatmap
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=data.target_names, yticklabels=data.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Random Forest Classifier')
plt.show()


In [None]:
# 36.  Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base estimators
estimators = [
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(solver='liblinear', random_state=42))
]

# Stacking Classifier with Logistic Regression as final estimator
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(random_state=42)
)

# Train stacking classifier
stacking_clf.fit(X_train, y_train)
y_pred_stack = stacking_clf.predict(X_test)
accuracy_stack = accuracy_score(y_test, y_pred_stack)

# Train individual models for comparison
accuracies = {}
for name, model in estimators:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracies[name] = accuracy_score(y_test, y_pred)

print("Individual Model Accuracies:")
for name, acc in accuracies.items():
    print(f"{name}: {acc:.4f}")

print(f"\nStacking Classifier Accuracy: {accuracy_stack:.4f}")


In [None]:
# 37.  Train a Random Forest Classifier and print the top 5 most important features

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Create a DataFrame with feature names and their importance scores
feat_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort by importance descending and get top 5
top5_features = feat_imp_df.sort_values(by='Importance', ascending=False).head(5)

print("Top 5 Most Important Features:")
print(top5_features)


In [None]:
# 38.  Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Bagging Classifier with Decision Trees
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,
    random_state=42
)

# Train model
bagging_clf.fit(X_train, y_train)

# Predict on test set
y_pred = bagging_clf.predict(X_test)

# Evaluate Precision, Recall, and F1-score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-score:  {f1:.4f}")


In [None]:
# 39.  Train a Random Forest Classifier and analyze the effect of max_depth on accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

max_depth_values = [1, 2, 4, 6, 8, 10, 15, 20, None]
accuracies = []

for depth in max_depth_values:
    rf = RandomForestClassifier(max_depth=depth, n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"max_depth={str(depth):>4} --> Accuracy: {acc:.4f}")

# Plotting the effect of max_depth on accuracy
plt.figure(figsize=(8,5))
depth_labels = [str(d) for d in max_depth_values]
plt.plot(depth_labels, accuracies, marker='o')
plt.title('Effect of max_depth on Random Forest Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()


In [None]:
# 40. Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare performance.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Bagging Regressor with Decision Tree base estimator
bagging_tree = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=50,
    random_state=42
)
bagging_tree.fit(X_train, y_train)
y_pred_tree = bagging_tree.predict(X_test)
mse_tree = mean_squared_error(y_test, y_pred_tree)

# Initialize Bagging Regressor with K-Neighbors base estimator
bagging_knn = BaggingRegressor(
    base_estimator=KNeighborsRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_knn.fit(X_train, y_train)
y_pred_knn = bagging_knn.predict(X_test)
mse_knn = mean_squared_error(y_test, y_pred_knn)

print(f"Bagging with Decision Tree Regressor MSE: {mse_tree:.4f}")
print(f"Bagging with K-Neighbors Regressor MSE: {mse_knn:.4f}")


In [None]:
# 41.  Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict probabilities for positive class
y_proba = rf.predict_proba(X_test)[:, 1]

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")


In [None]:
# 42.  Train a Bagging Classifier and evaluate its performance using cross-validatio

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Initialize Bagging Classifier with Decision Trees
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,
    random_state=42
)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(bagging_clf, X, y, cv=5, scoring='accuracy')

print(f"Cross-validation accuracy scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.4f}")
print(f"Standard deviation: {cv_scores.std():.4f}")


In [None]:
# 43.  Train a Random Forest Classifier and plot the Precision-Recall curv

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict probabilities for positive class
y_scores = rf.predict_proba(X_test)[:, 1]

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)

# Compute average precision score
avg_precision = average_precision_score(y_test, y_scores)

# Plot Precision-Recall curve
plt.figure(figsize=(8,6))
plt.plot(recall, precision, label=f'Average Precision = {avg_precision:.4f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Random Forest Classifier')
plt.legend(loc='best')
plt.grid(True)
plt.show()


In [None]:
 # 44.  Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base estimators
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=42))
]

# Stacking Classifier with Logistic Regression as final estimator
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(random_state=42)
)

# Train stacking classifier
stacking_clf.fit(X_train, y_train)
y_pred_stack = stacking_clf.predict(X_test)
accuracy_stack = accuracy_score(y_test, y_pred_stack)

# Train individual models for comparison
accuracies = {}
for name, model in estimators:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracies[name] = accuracy_score(y_test, y_pred)

print("Individual Model Accuracies:")
for name, acc in accuracies.items():
    print(f"{name}: {acc:.4f}")

print(f"\nStacking Classifier Accuracy: {accuracy_stack:.4f}")


In [None]:
 # 45.  Train a Bagging Regressor with different levels of bootstrap samples and compare performance.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different levels of bootstrap samples to try (as fractions of the training set)
max_samples_values = [0.3, 0.5, 0.7, 1.0]

mse_scores = []

for max_samples in max_samples_values:
    bagging_reg = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=50,
        max_samples=max_samples,
        bootstrap=True,
        random_state=42
    )
    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_samples={max_samples:.1f} --> MSE: {mse:.4f}")

# Plot MSE vs max_samples
plt.figure(figsize=(8,5))
plt.plot([str(ms) for ms in max_samples_values], mse_scores, marker='o')
plt.title("Effect of max_samples on Bagging Regressor Performance")
plt.xlabel("max_samples (fraction of training data)")
plt.ylabel("Mean Squared Error")
plt.grid(True)
plt.show()
