**Theoretical**




1 Can we use Bagging for regression problems?
-Yes, Bagging (Bootstrap Aggregating) can absolutely be used for regression problems—and it's actually quite effective in many cases!



2 What is the difference between multiple model training and single model training?
-Great question! The difference between **multiple model training** and **single model training** is foundational to understanding how ensemble methods like **bagging, boosting**, or **stacking** work.

---

### 🧠 **Single Model Training**
This is the traditional approach:

- You train **one model** (e.g., a decision tree, linear regression, neural network) on the entire training dataset.
- This model is solely responsible for making predictions.

#### ✅ Pros:
- Simpler to implement and understand.
- Faster to train and evaluate.
- Requires fewer resources.

#### ❌ Cons:
- If the model is high variance (like decision trees), it might overfit.
- If it's too simple, it might underfit.
- Less robust to noise or data quirks.

---

### 🤖 **Multiple Model Training (Ensemble Learning)**
Here, you train **multiple models** (often of the same type) and combine their predictions.

**Two popular forms:**
1. **Bagging** (e.g., Random Forest): Models trained on different random subsets of data.
2. **Boosting** (e.g., XGBoost, AdaBoost): Models trained sequentially, each correcting the previous.

#### ✅ Pros:
- **Reduces variance and/or bias.**
- More accurate and robust.
- Handles complex data better.

#### ❌ Cons:
- More computationally expensive.
- Harder to interpret.
- Slower to train and predict.

---

### 🔍 Quick Comparison Table

| Feature                  | Single Model             | Multiple Model (Ensemble)     |
|--------------------------|--------------------------|-------------------------------|
| Number of Models         | 1                        | Many                          |
| Prediction               | From one model           | Combined from many models     |
| Complexity               | Simpler                  | More complex                  |
| Accuracy (often)         | Lower                    | Higher                        |
| Overfitting Risk         | Higher (for complex models) | Lower (especially in bagging) |
| Interpretability         | Easier                   | Harder                        |

---




3 Explain the concept of feature randomness in Random Forest2
-Normally, when a decision tree splits a node, it looks through all available features and chooses the best one to split on.

But in a Random Forest, when building each tree:

At each split, the algorithm considers only a random subset of the features, not all of them.

From that subset, it picks the best feature to split the node.

🔧 Example:
Suppose you have a dataset with 10 features, and you're growing a tree in a Random Forest.

At each split, instead of checking all 10 features, the algorithm might randomly select, say, 3 features, and only evaluate those.

This random selection is repeated at every split in the tree.





4 What is OOB (Out-of-Bag) Score?
-In Bagging (Bootstrap Aggregating), each base model (like a decision tree) is trained on a bootstrap sample—a random sample with replacement from the training data.

Because of this sampling method:

Some training instances are selected multiple times, and

Some are not selected at all for a particular tree.


5  How can you measure the importance of features in a Random Forest model?
-Awesome question! Measuring **feature importance** in a **Random Forest** helps us understand which features are contributing the most to the model’s predictions. There are a couple of common ways to do this:

---

## 🌳 1. **Mean Decrease in Impurity (MDI)** – a.k.a. Gini Importance

This is the **default method** in libraries like `scikit-learn`.

### 🔍 How it works:
- Every time a feature is used to split a node in any of the trees, the **impurity (e.g., Gini or MSE)** is reduced.
- The reduction in impurity is **accumulated** for each feature across all trees.
- The average total reduction per feature is then normalized → **feature importance score**.

### 🧠 Think of it as:
> "How much does this feature help reduce uncertainty (impurity) when splitting the data?"

### ✅ Pros:
- Fast and built-in
- Works well for many cases

### ❌ Cons:
- Biased toward features with more categories or higher variance

---

## 🎲 2. **Permutation Importance** (a.k.a. Mean Decrease in Accuracy)

This is a **model-agnostic** method, often considered more reliable.

### 🔍 How it works:
1. Evaluate the model's performance (e.g., accuracy or R²) on a validation or OOB set.
2. **Randomly shuffle** the values of one feature across the dataset.
3. Re-evaluate the model performance.
4. The **drop in performance** tells you how important that feature was.

### ✅ Pros:
- Less biased
- Can work with any model, not just Random Forest

### ❌ Cons:
- Slower (requires retraining or reevaluation multiple times)

---

## 🔧 Code Example (Scikit-learn: MDI)

```python
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import matplotlib.pyplot as plt

model = RandomForestClassifier()
model.fit(X_train, y_train)

importances = model.feature_importances_
feature_names = X_train.columns

# Plot
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh')
plt.title("Feature Importances")
plt.show()
```

---

## 🔧 Code Example (Scikit-learn: Permutation Importance)

```python
from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_val, y_val, n_repeats=10, random_state=42)
importances = pd.Series(result.importances_mean, index=X_val.columns)
importances.sort_values().plot(kind="barh")
plt.title("Permutation Feature Importances")
plt.show()
```

---

## 🧪 Summary

| Method                  | Description                                       | Bias?           | Speed |
|-------------------------|---------------------------------------------------|------------------|--------|
| MDI (Gini Importance)   | Total impurity reduction from splits              | Biased (slightly)| Fast   |
| Permutation Importance  | Measures drop in accuracy after shuffling a feature | Less biased     | Slower |

---




6 Explain the working principle of a Bagging Classifier
-🔧 Working Principle: Step-by-Step
Let’s say you’re training a Bagging Classifier on a labeled dataset.

1. Bootstrap Sampling
Create multiple random subsets of your training data.

Each subset is created by sampling with replacement from the original dataset.

Each subset is the same size as the original dataset (but with repeated instances).

Example: If you have 100 training samples, each model gets a random 100-sample subset (some duplicates, some left out).

2. Train Base Learners
Train a separate model (usually of the same type, like decision trees) on each bootstrap sample.

These models are called base estimators.

3. Aggregate Predictions
During prediction, each base model makes a prediction on the new (unseen) data.

For classification: Use majority voting to decide the final class label.

For regression (in Bagging Regressor): Take the average of all outputs.

4. (Optional) Out-of-Bag (OOB) Evaluation
You can use the data not included in each bootstrap sample (about 1/3 of the data) to get an unbiased estimate of model performance—this is called the OOB score.





7 How do you evaluate a Bagging Classifier’s performance?
-Great question! Evaluating a **Bagging Classifier’s performance** is similar to how you'd evaluate any classification model, but there are also some **special tools unique to bagging** that can give you deeper insight—like the **OOB (Out-of-Bag) score**.

Let’s break it down 👇

---

## ✅ **1. Standard Evaluation Metrics**
Use these on your **test set** or through **cross-validation**:

### 📊 Classification Metrics:
- **Accuracy** – overall correct predictions
- **Precision / Recall / F1-score** – good for imbalanced classes
- **Confusion Matrix** – shows breakdown of TP, FP, FN, TN
- **ROC AUC Score** – good for binary classifiers, especially with imbalance
- **Log Loss** – evaluates predicted probabilities

### Example (scikit-learn):
```python
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = bagging.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
```

---

## 🧪 **2. Out-of-Bag (OOB) Score** – Bagging Bonus!

The OOB score is a **built-in validation technique** available with Bagging.

### 🔍 How it works:
- Each base model is trained on a bootstrap sample.
- About **1/3 of the data is left out** of each sample.
- For each data point, you **average predictions** from only the models that **didn't train on it**.
- This gives a reliable internal accuracy estimate—no need for separate validation set.

### How to Use:
```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    oob_score=True,
    random_state=42
)

model.fit(X_train, y_train)
print("OOB Score:", model.oob_score_)  # This is like validation accuracy
```

---

## 🔁 **3. Cross-Validation (Optional)**

You can also use **k-fold cross-validation** for more reliable performance estimates:

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_train, y_train, cv=5)
print("Cross-validation Accuracy: ", scores.mean())
```

---

## 📌 Summary Table

| Evaluation Method   | Purpose                                       | Notes                         |
|---------------------|-----------------------------------------------|-------------------------------|
| Accuracy / F1       | Basic classification performance              | Use test set or CV            |
| ROC AUC / Log Loss  | For probabilistic output & imbalance          | Use `predict_proba()`         |
| Confusion Matrix    | Visual error breakdown                        | Helpful for interpretation    |
| OOB Score           | Internal validation (unique to Bagging)       | No separate test set needed   |
| Cross-Validation    | More robust estimate over multiple splits     | Slower but reliable           |

---




8 How does a Bagging Regressor work?
-1. 🧺 Bootstrap Sampling
Create multiple bootstrap samples from the original training data.

Each sample is created by random sampling with replacement.

Each sample is typically the same size as the original dataset.

2. 🌳 Train Base Regressors
Train a separate regressor (e.g., DecisionTreeRegressor) on each bootstrap sample.

All base models are trained independently.

3. 📈 Make Predictions
For a new (unseen) data point:

Each model makes its own prediction.

The final output is the average of all individual predictions.

🎯 Prediction = Mean of base regressor outputs

4. 🧪 (Optional) Out-of-Bag (OOB) Evaluation
For each data point, you can evaluate its prediction using only the models that did not train on it.

The OOB R² score can serve as an internal validation metric.





9 What is the main advantage of ensemble techniques?
-Improved performance (accuracy, robustness, and generalization) by reducing bias, variance, or both.




10 What is the main challenge of ensemble methods?
-📌 Top Challenges of Ensemble Methods:
Challenge	Description
🧠 Interpretability	Harder to explain predictions, especially with many base learners. (e.g., Random Forest vs a single decision tree)
⚙️ Training Time	Multiple models = more compute time and memory usage.
🚧 Model Size & Deployment	Ensembles can be large and slow to predict in real-time.
🔍 Hyperparameter Tuning	Many parameters to set (number of estimators, learning rate, max depth, etc.)
⚖️ Overfitting (especially Boosting)	If not tuned well, ensembles can still overfit, especially in noisy datasets.
🔁 Data Leakage Risk (if misused)	Complex workflows make it easier to accidentally leak test info into training.




11 Explain the key idea behind ensemble techniques
-✅ Combine multiple individual models ("weak learners") to create a single, stronger predictive model.

This strategy improves performance by reducing errors like bias, variance, or both.



12 What is a Random Forest Classifier?
-A Random Forest Classifier is an ensemble learning algorithm used for classification tasks. It builds a “forest” of decision trees and combines their outputs to make more accurate, stable, and robust predictions than a single tree would.





13 What are the main types of ensemble techniques?
-Great question! There are **three main types** of ensemble techniques, each with its own strategy for combining models to improve performance:

---

## 🔧 1. **Bagging (Bootstrap Aggregating)**

### 📌 Key Idea:
> Train multiple models **in parallel** on different **random subsets** of the data, then **combine their outputs** (e.g., by voting or averaging).

### 🔥 Famous Example:
- **Random Forest** (uses bagging with decision trees)

### ✅ Strengths:
- Reduces **variance**
- Helps prevent **overfitting**
- Works well with **high-variance models** (e.g., decision trees)

---

## 🎯 2. **Boosting**

### 📌 Key Idea:
> Train models **sequentially**, where each new model **focuses on correcting the errors** made by the previous ones.

### 🔥 Famous Examples:
- **AdaBoost**
- **Gradient Boosting**
- **XGBoost**, **LightGBM**, **CatBoost** (high-performance versions)

### ✅ Strengths:
- Reduces **bias**
- Builds a strong model from many **weak learners**
- Great for **structured/tabular data**

---

## 🧠 3. **Stacking (Stacked Generalization)**

### 📌 Key Idea:
> Combine **different types of models** (e.g., decision trees, SVMs, logistic regression) and use a **meta-model** to learn how to best combine their predictions.

### 🔥 How it works:
- Base models make predictions.
- A **meta-model** (like logistic regression) is trained on those predictions to make the final decision.

### ✅ Strengths:
- Very **flexible**
- Can combine the **strengths of multiple algorithms**

---

## 🧪 Summary Table

| Ensemble Type | Combines Models How?         | Goal              | Example                   |
|----------------|------------------------------|-------------------|---------------------------|
| **Bagging**     | Parallel voting/averaging    | ↓ Variance        | Random Forest             |
| **Boosting**    | Sequential error correction  | ↓ Bias            | AdaBoost, XGBoost         |
| **Stacking**    | Meta-model on base outputs   | ↑ Flexibility     | Any combo of models       |

---




14 What is ensemble learning in machine learning?
-Great foundational question! Let's break it down nice and clear:

---

## 🧠 What is **Ensemble Learning** in Machine Learning?

> **Ensemble learning** is a technique where **multiple models (learners)** are trained and combined to **solve the same problem** and produce **better results** than any single model alone.

It’s like **“wisdom of the crowd”** — many models working together are often smarter and more accurate than just one.

---

## 🎯 Why Use Ensemble Learning?

- ✅ To **increase accuracy**
- ✅ To **reduce overfitting or underfitting**
- ✅ To make models more **stable and robust**
- ✅ To **combine strengths** of different algorithms

---

## 🧰 Common Types of Ensemble Learning

| Type       | Description                                      | Goal         |
|------------|--------------------------------------------------|--------------|
| **Bagging**| Train models in **parallel** on random data subsets | Reduce **variance** |
| **Boosting**| Train models **sequentially**, focusing on errors | Reduce **bias** |
| **Stacking**| Combine **different model types** with a meta-model | Improve **flexibility & performance** |

---

## 🔍 Simple Analogy:
> Imagine you're trying to guess the number of candies in a jar.  
> One person might be off — but if 50 people guess and you take the average, you're probably close.  
That’s **ensemble learning** — combining multiple perspectives (models) to get a better answer.

---

## 🔥 Real-World Examples

- **Random Forest** = Bagging with decision trees
- **XGBoost / LightGBM** = Powerful boosting algorithms
- **Voting Classifier** = Simple ensemble using majority voting

---

## 🧪 In Code (Tiny Example with Scikit-learn)

```python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Combine three different models
ensemble = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('dt', DecisionTreeClassifier()),
        ('svm', SVC(probability=True))
    ],
    voting='soft'
)
```

---




15 When should we avoid using ensemble methods
-Awesome question — because while **ensemble methods** are powerful, they’re **not always the right choice**.

Here’s when you might want to **avoid** using them:

---

## 🚫 **1. When Interpretability is Crucial**

Ensemble models like Random Forest or XGBoost are often **black boxes**.  
If you're working in fields where you need **clear, human-understandable logic** (like medicine, finance, or law):

- Use simpler models like **logistic regression** or **decision trees**.
- Ensembles may not provide the transparency needed.

---

## 🐌 **2. When Speed (Training or Prediction) is a Priority**

Ensembles:
- Take **longer to train** (especially boosting methods).
- Are **slower at predicting**, since they aggregate outputs from many models.

If your task needs **real-time prediction**, you might prefer a **faster, lightweight model**.

---

## 🧪 **3. When You Have Very Little Data**

Ensemble models usually perform best with **moderate to large datasets**.  
If your dataset is **small**, you risk:

- **Overfitting** (especially with boosting)
- Poor generalization

Sometimes, a **single well-regularized model** will outperform an ensemble on small data.

---

## 🛠️ **4. When a Single Model Already Performs Well**

If a simple model like a logistic regression or single tree gives you great results:

- Adding an ensemble may only give **marginal improvement**.
- Not worth the **added complexity and computation**.

---

## 📉 **5. When the Problem is Already Low-Variance or Low-Bias**

Ensemble methods are meant to **correct bias or variance**.  
If your model already has:

- **Low variance** → Bagging won’t help much.
- **Low bias** → Boosting might just overfit.

---

## 💬 TL;DR — Avoid Ensembles When:

| Situation                             | Better Option                        |
|--------------------------------------|--------------------------------------|
| Need **explainable** decisions       | Use interpretable models             |
| Need **fast** predictions or training| Use simpler algorithms               |
| **Small** dataset                    | Use regularized single models        |
| Already **high performance**         | May not need complexity of ensembles |
| Risk of **overfitting** is high      | Avoid aggressive boosting            |

---




16 How does Bagging help in reducing overfitting?
-Great question — and super important in understanding why **bagging** works so well!

---

## 🎯 Quick Answer:

> **Bagging reduces overfitting by lowering the variance** of high-variance models (like decision trees) through **random sampling** and **model averaging**.

---

## 📉 The Overfitting Problem:

A single **decision tree**, for example, tends to:
- Fit the training data **very closely**
- Capture **noise** along with signal
- Perform poorly on new, unseen data (i.e., **overfit**)

---

## 🌳 How Bagging Helps:

### ✅ 1. **Bootstrap Sampling**
- Bagging creates **multiple different versions** of the training dataset using **random sampling with replacement**.
- Each model sees **different data**, so they **learn different patterns**.

### ✅ 2. **Independent Training**
- Each base model (usually a high-variance learner like a tree) is trained **independently** on its own data.
- As a result, they make **uncorrelated errors**.

### ✅ 3. **Averaging Predictions**
- For regression: take the **average** of predictions.
- For classification: take a **majority vote**.
- This averaging process **smooths out the noise** and cancels out extreme predictions.

---

## 📊 Analogy: Crowd Wisdom

Imagine asking 100 people to guess the weight of a watermelon:
- One person might be very wrong.
- But the **average of all guesses** is likely close to the true value.

Bagging does the same: reduces **extreme variance** by blending predictions.

---

## 🔍 Example: Decision Tree vs Bagged Trees

| Feature                | Single Tree       | Bagging (Random Forest) |
|------------------------|-------------------|--------------------------|
| Accuracy (train)       | 🔥 Very high       | ✅ High                  |
| Accuracy (test)        | 😬 Low (overfit)   | ✅ Higher                |
| Sensitivity to noise   | ❌ High            | ✅ Low                   |

---

## 💡 TL;DR

> **Bagging = Diversify + Average**  
> It makes high-variance models more stable by averaging out their noise — and that’s how it helps reduce **overfitting**.

---




17  Why is Random Forest better than a single Decision Tree?
-Great question — and one of the most common ones in machine learning! Let's break it down clearly:

---

## 🌲 Why is **Random Forest** Better Than a Single **Decision Tree**?

### ✅ **Short Answer:**
> A **Random Forest** is usually better because it reduces **overfitting**, increases **accuracy**, and improves **generalization** by combining the predictions of **many diverse decision trees**.

---

## 🧠 Deep Dive: Key Differences

| Feature | 🧱 **Single Decision Tree** | 🌳 **Random Forest** |
|--------|----------------------------|----------------------|
| **Accuracy** | May overfit the training data | More accurate and stable |
| **Overfitting** | High (especially deep trees) | Much lower |
| **Bias** | Low | Slightly higher |
| **Variance** | High | Lower (because of averaging) |
| **Robustness** | Sensitive to small data changes | Robust and consistent |
| **Interpretability** | Easy to interpret | Harder to interpret |
| **Performance on unseen data** | Often poor | Typically much better |

---

## 🔧 How Random Forest Improves Performance:

### 1. **Bagging (Bootstrap Aggregating)**
- Trains each tree on a **random subset** of the data.
- Reduces **variance** and **overfitting**.

### 2. **Feature Randomness**
- Each tree considers only a **random subset of features** at each split.
- Increases **diversity** between trees → better overall predictions.

### 3. **Aggregation**
- Final prediction is made via **majority vote** (classification) or **averaging** (regression).
- Helps **smooth out errors** of individual trees.

---

## 📉 Visual Analogy:

Imagine a single decision tree as one very opinionated person.  
A random forest is like a **committee** of diverse people — each gives their opinion, and the final decision is a **group vote**, which is more balanced.

---

## ✅ Summary:

> Random Forest = **Many weak models** (trees) + **Smart combination**  
→ Results in a **strong**, **robust**, and **accurate** predictor.

---





18 What is the role of bootstrap sampling in Bagging?
-Great one! Bootstrap sampling is **at the heart of Bagging**, so let’s break it down simply:

---

## 🎯 **Role of Bootstrap Sampling in Bagging**

> **Bootstrap sampling** allows Bagging to train **multiple diverse models** on slightly **different datasets**, even though they all come from the **same original data**.

This helps Bagging reduce **variance** and avoid **overfitting**.

---

## 🧺 What is Bootstrap Sampling?

- It's a technique where we **randomly sample** from the dataset **with replacement**.
- Each sample is the **same size as the original dataset**, but some points may appear multiple times, and others may be left out.

📌 So each model gets a **slightly different version** of the training data.

---

## 🧠 Why Is This Important?

### ✅ 1. **Introduces Diversity**
- Each base model (like a decision tree) learns from **different data**, so they make **different errors**.
- Diversity among models is key to ensemble success.

### ✅ 2. **Reduces Variance**
- Individual trees are high-variance learners (they change a lot with small data changes).
- By training them on different datasets and **averaging their outputs**, we smooth out their predictions.

### ✅ 3. **Supports OOB (Out-of-Bag) Evaluation**
- Since some data points are left out in each bootstrap sample, they can be used to **evaluate model performance** without needing a separate validation set.

---

## 🔍 Real-Life Analogy:

Imagine you ask 5 different friends to solve a puzzle, but each one is given a **slightly shuffled version** of the clues.

- Individually, their answers might vary.
- But if you take the **average or majority vote**, you're likely to get the correct solution.
- That's what bootstrap sampling enables.

---

## 🧪 Summary

> 🔁 **Bootstrap Sampling** in Bagging helps:
- Create **diverse models**
- Reduce **overfitting**
- Increase **generalization**
- Enable **out-of-bag evaluation**

---





19 What are some real-world applications of ensemble techniques?
-Awesome question — ensemble techniques are everywhere in the real world because they **boost accuracy, reliability, and robustness**. Let’s check out some top applications across different industries:

---

## 🌍 Real-World Applications of Ensemble Techniques

---

### 🏦 **1. Finance**
- **Credit Scoring** – Predict loan defaults using models like Random Forest or XGBoost.
- **Fraud Detection** – Ensemble models catch subtle patterns in transaction data.
- **Stock Price Prediction** – Boosting models help forecast trends by combining signals.

---

### 🏥 **2. Healthcare**
- **Disease Diagnosis** – Ensemble models like stacking can outperform doctors in some diagnostic tasks.
- **Medical Image Classification** – Detect tumors or anomalies in X-rays or MRIs using ensembles of CNNs.
- **Risk Prediction** – Predict patient outcomes like diabetes or heart failure using Random Forests or Gradient Boosting.

---

### 🛍️ **3. E-commerce & Retail**
- **Recommendation Systems** – Stacking different algorithms to give better product recommendations.
- **Customer Churn Prediction** – Use ensemble models to identify users likely to leave.
- **Price Optimization** – Forecast demand and dynamically adjust prices with boosted regressors.

---

### 🤖 **4. Autonomous Vehicles**
- **Object Detection** – Ensemble methods combine multiple vision models to improve detection accuracy.
- **Path Planning** – Use voting-based ensembles to ensure safe navigation choices.

---

### 🧠 **5. Natural Language Processing (NLP)**
- **Sentiment Analysis** – Combine different models (like RNNs + transformers) for better sentiment detection.
- **Spam Filtering** – Bagging and boosting help detect spam emails more accurately.

---

### 🎮 **6. Gaming & AI**
- **Game Bot Decision Making** – Use ensembles to create smarter, adaptive agents.
- **Cheat Detection** – Spot suspicious behavior using ensemble classifiers.

---

### 🚨 **7. Cybersecurity**
- **Intrusion Detection** – Ensemble models monitor network traffic and detect anomalies.
- **Malware Classification** – Boosted trees can classify malware based on behavioral signatures.

---

## 🔧 Summary

| Industry      | Use Case Example                          | Ensemble Type Used        |
|---------------|--------------------------------------------|---------------------------|
| Finance       | Fraud detection                            | Random Forest, XGBoost    |
| Healthcare    | Disease prediction                         | Stacking, Gradient Boosting|
| Retail        | Product recommendations                    | Voting, Stacking          |
| Autonomous Cars| Object detection, planning                | Bagging + Deep Ensembles  |
| NLP           | Sentiment/spam filtering                   | Stacking, Boosting        |
| Cybersecurity | Anomaly detection                          | Bagging, Isolation Forest |

---





20 What is the difference between Bagging and Boosting?
-🆚 Bagging vs Boosting
Feature	🧺 Bagging	🚀 Boosting
Full Name	Bootstrap Aggregating	—
Main Goal	Reduce variance	Reduce bias and variance
Model Training	Models trained independently in parallel	Models trained sequentially, one after another
Data Sampling	Uses bootstrap samples (with replacement)	Each model focuses on mistakes of the previous one
Weighting	All models are equal in voting/averaging	Later models get more weight (adaptive)
Overfitting	Helps prevent overfitting	Can overfit if not properly regularized
Best For	High-variance models (e.g., decision trees)	Weak learners needing bias correction
Popular Algorithms	Random Forest	AdaBoost, Gradient Boosting, XGBoost



In [None]:
**Practical**



21 Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy
-import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Bagging Classifier with Decision Trees as base estimators
# n_estimators specifies the number of trees in the ensemble.
# random_state for reproducibility.
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)

# Train the Bagging Classifier
bagging_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_clf.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Bagging Classifier Accuracy: {accuracy:.4f}")

# Example of how to change the base estimator parameters.
# For example, to limit the maximum depth of the decision trees:

bagging_clf_depth_limited = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=5),
    n_estimators=100,
    random_state=42
)

bagging_clf_depth_limited.fit(X_train, y_train)
y_pred_depth_limited = bagging_clf_depth_limited.predict(X_test)
accuracy_depth_limited = accuracy_score(y_test, y_pred_depth_limited)

print(f"Bagging Classifier Accuracy (max_depth=5): {accuracy_depth_limited:.4f}")



22 Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
-import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Bagging Regressor with Decision Trees as base estimators
bagging_reg = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)

# Train the Bagging Regressor
bagging_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_reg.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Print the MSE
print(f"Bagging Regressor Mean Squared Error (MSE): {mse:.4f}")

#Example of changing base estimator parameters, such as max_depth.
bagging_reg_depth_limited = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(max_depth=5),
    n_estimators=100,
    random_state=42
)

bagging_reg_depth_limited.fit(X_train, y_train)
y_pred_depth_limited = bagging_reg_depth_limited.predict(X_test)
mse_depth_limited = mean_squared_error(y_test, y_pred_depth_limited)

print(f"Bagging Regressor Mean Squared Error (MSE) (max_depth=5): {mse_depth_limited:.4f}")



23 Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores
-import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
feature_names = cancer.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier
rf_classifier.fit(X_train, y_train)

# Get feature importance scores
feature_importance = rf_classifier.feature_importances_

# Create a DataFrame to display feature importance
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the feature importance scores
print("Feature Importance Scores:")
print(feature_importance_df)

#Example of accessing a specific features importance
print("\nImportance of 'mean radius':")
print(feature_importance_df[feature_importance_df['Feature'] == 'mean radius']['Importance'].values[0])


24 Train a Random Forest Regressor and compare its performance with a single Decision Tree
-import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Create a Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)

# Train the Random Forest Regressor
rf_regressor.fit(X_train, y_train)

# Train the Decision Tree Regressor
dt_regressor.fit(X_train, y_train)

# Make predictions on the test set
rf_y_pred = rf_regressor.predict(X_test)
dt_y_pred = dt_regressor.predict(X_test)

# Calculate the Mean Squared Error (MSE) for both models
rf_mse = mean_squared_error(y_test, rf_y_pred)
dt_mse = mean_squared_error(y_test, dt_y_pred)

# Print the MSE for both models
print(f"Random Forest Regressor MSE: {rf_mse:.4f}")
print(f"Decision Tree Regressor MSE: {dt_mse:.4f}")

# Compare the performance
if rf_mse < dt_mse:
    print("\nRandom Forest Regressor performs better than Decision Tree Regressor.")
else:
    print("\nDecision Tree Regressor performs better than Random Forest Regressor (or they perform equally).")

# Example of how to change the Random Forest parameters:
rf_regressor_modified = RandomForestRegressor(n_estimators = 50, max_depth = 5, random_state = 42)
rf_regressor_modified.fit(X_train, y_train)
rf_y_pred_modified = rf_regressor_modified.predict(X_test)
rf_mse_modified = mean_squared_error(y_test, rf_y_pred_modified)
print(f"Random Forest Regressor MSE (n_estimators=50, max_depth=5): {rf_mse_modified:.4f}")




25 Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Create a Random Forest Classifier with oob_score=True
rf_classifier = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)

# Train the Random Forest Classifier
rf_classifier.fit(X, y)

# Get the Out-of-Bag (OOB) score
oob_score = rf_classifier.oob_score_

# Print the OOB score
print(f"Out-of-Bag (OOB) Score: {oob_score:.4f}")

#Example of accessing the oob_decision_function_
oob_decision_function = rf_classifier.oob_decision_function_

print("\nOOB Decision Function shape:", oob_decision_function.shape)
print("Example oob_decision_function for the first 5 samples:\n", oob_decision_function[:5])



26 Train a Bagging Classifier using SVM as a base estimator and print accuracy
-import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Bagging Classifier with SVM as the base estimator
bagging_svm = BaggingClassifier(
    base_estimator=SVC(probability=True),  # probability=True needed for predict_proba
    n_estimators=10,  # Reduced estimators due to SVM's computational cost
    random_state=42
)

# Train the Bagging Classifier
bagging_svm.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_svm.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Bagging SVM Classifier Accuracy: {accuracy:.4f}")

#Example of changing base estimator parameters.
bagging_svm_modified = BaggingClassifier(
    base_estimator=SVC(probability=True, C=0.5, kernel='linear'),
    n_estimators=10,
    random_state=42
)

bagging_svm_modified.fit(X_train, y_train)
y_pred_modified = bagging_svm_modified.predict(X_test)
accuracy_modified = accuracy_score(y_test, y_pred_modified)

print(f"Bagging SVM Classifier Accuracy (C=0.5, kernel='linear'): {accuracy_modified:.4f}")


27 Train a Random Forest Classifier with different numbers of trees and compare accuracy
-import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the range of number of trees to test
n_estimators_range = [10, 50, 100, 200, 300, 400, 500]

# Store the accuracies for each number of trees
accuracies = []

# Train and evaluate the Random Forest Classifier for each number of trees
for n_estimators in n_estimators_range:
    rf_classifier = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    rf_classifier.fit(X_train, y_train)
    y_pred = rf_classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"Number of Trees: {n_estimators}, Accuracy: {accuracy:.4f}")

# Plot the accuracy vs. number of trees
plt.plot(n_estimators_range, accuracies, marker='o')
plt.title("Random Forest Accuracy vs. Number of Trees")
plt.xlabel("Number of Trees (n_estimators)")
plt.ylabel("Accuracy")
plt.grid(True)
plt.show()

# Find the best number of trees and corresponding accuracy
best_n_estimators = n_estimators_range[np.argmax(accuracies)]
best_accuracy = max(accuracies)

print(f"\nBest Number of Trees: {best_n_estimators}, Best Accuracy: {best_accuracy:.4f}")

#Example of changing other Random Forest parameters.
rf_classifier_modified = RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 42)
rf_classifier_modified.fit(X_train, y_train)
y_pred_modified = rf_classifier_modified.predict(X_test)
accuracy_modified = accuracy_score(y_test, y_pred_modified)

print(f"Random Forest Classifier Accuracy (n_estimators=100, max_depth=5): {accuracy_modified:.4f}")



28 Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Bagging Classifier with Logistic Regression as the base estimator
bagging_lr = BaggingClassifier(
    base_estimator=LogisticRegression(solver='liblinear'), # solver is important for small datasets
    n_estimators=10,
    random_state=42
)

# Train the Bagging Classifier
bagging_lr.fit(X_train, y_train)

# Make probability predictions on the test set
y_prob = bagging_lr.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class

# Calculate the AUC score
auc = roc_auc_score(y_test, y_prob)

# Print the AUC score
print(f"Bagging Logistic Regression AUC: {auc:.4f}")

# Example of changing Logistic Regression parameters.
bagging_lr_modified = BaggingClassifier(
    base_estimator=LogisticRegression(solver='liblinear', C=0.5),
    n_estimators=10,
    random_state=42
)

bagging_lr_modified.fit(X_train, y_train)
y_prob_modified = bagging_lr_modified.predict_proba(X_test)[:, 1]
auc_modified = roc_auc_score(y_test, y_prob_modified)

print(f"Bagging Logistic Regression AUC (C=0.5): {auc_modified:.4f}")



29 Train a Random Forest Regressor and analyze feature importance scores
-import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Random Forest Regressor
rf_regressor.fit(X_train, y_train)

# Get feature importance scores
feature_importance = rf_regressor.feature_importances_

# Create a DataFrame to display feature importance
feature_importance_df = pd.DataFrame({'Feature': range(X.shape[1]), 'Importance': feature_importance})

# Sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the feature importance scores
print("Feature Importance Scores:")
print(feature_importance_df)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel("Feature Index")
plt.ylabel("Importance Score")
plt.title("Random Forest Regressor Feature Importance")
plt.xticks(feature_importance_df['Feature']) #set ticks to feature numbers
plt.show()

# Example of accessing a specific features importance
print(f"\nImportance score of feature index 0: {feature_importance_df[feature_importance_df['Feature'] == 0]['Importance'].values[0]}")

#Example of changing RF parameters and analyzing feature importance.
rf_regressor_modified = RandomForestRegressor(n_estimators=50, max_depth=5, random_state=42)
rf_regressor_modified.fit(X_train, y_train)
feature_importance_modified = rf_regressor_modified.feature_importances_
feature_importance_df_modified = pd.DataFrame({'Feature': range(X.shape[1]), 'Importance': feature_importance_modified})
feature_importance_df_modified = feature_importance_df_modified.sort_values(by='Importance', ascending=False)

print("\nFeature Importance Scores (Modified RF):")
print(feature_importance_df_modified)



30 Train an ensemble model using both Bagging and Random Forest and compare accuracy
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create individual models
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train individual models
bagging_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)

# Make predictions with individual models
bagging_y_pred = bagging_clf.predict(X_test)
rf_y_pred = rf_clf.predict(X_test)

# Calculate accuracy for individual models
bagging_accuracy = accuracy_score(y_test, bagging_y_pred)
rf_accuracy = accuracy_score(y_test, rf_y_pred)

print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}")

# Create a Voting Classifier (ensemble of Bagging and Random Forest)
ensemble_clf = VotingClassifier(
    estimators=[('bagging', bagging_clf), ('random_forest', rf_clf)],
    voting='hard'  # 'hard' voting uses predicted class labels
)

# Train the ensemble model
ensemble_clf.fit(X_train, y_train)

# Make predictions with the ensemble model
ensemble_y_pred = ensemble_clf.predict(X_test)

# Calculate accuracy for the ensemble model
ensemble_accuracy = accuracy_score(y_test, ensemble_y_pred)

print(f"Ensemble Classifier Accuracy: {ensemble_accuracy:.4f}")

#Compare accuracies
if ensemble_accuracy > max(bagging_accuracy, rf_accuracy):
  print("\nThe Ensemble model performed better than the individual models.")
else:
  print("\nThe Ensemble model did not outperform the best individual model.")

#Example of changing the voting method to soft, which requires predict_proba.
ensemble_clf_soft = VotingClassifier(
    estimators=[('bagging', bagging_clf), ('random_forest', rf_clf)],
    voting='soft' #soft voting uses predicted probabilities
)
ensemble_clf_soft.fit(X_train, y_train)
ensemble_y_pred_soft = ensemble_clf_soft.predict(X_test)
ensemble_accuracy_soft = accuracy_score(y_test, ensemble_y_pred_soft)
print(f"Ensemble Classifier Accuracy (soft voting): {ensemble_accuracy_soft:.4f}")


31 Train a Random Forest Classifier and tune hyperparameters using GridSearchCV
-import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=3, scoring='accuracy', verbose=2, n_jobs=-1)

# Perform the grid search
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

# Get the best model
best_rf_classifier = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_rf_classifier.predict(X_test)

# Calculate the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"Best Model Accuracy: {accuracy:.4f}")

#Example of accessing the grid search results.
print("\nGrid Search Results:")
print(grid_search.cv_results_)



32 Train a Bagging Regressor with different numbers of base estimators and compare performance
-import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the range of number of estimators to test
n_estimators_range = [10, 50, 100, 200, 300, 400, 500]

# Store the MSE for each number of estimators
mses = []

# Train and evaluate the Bagging Regressor for each number of estimators
for n_estimators in n_estimators_range:
    bagging_regressor = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(),
        n_estimators=n_estimators,
        random_state=42
    )
    bagging_regressor.fit(X_train, y_train)
    y_pred = bagging_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mses.append(mse)
    print(f"Number of Estimators: {n_estimators}, MSE: {mse:.4f}")

# Plot the MSE vs. number of estimators
plt.plot(n_estimators_range, mses, marker='o')
plt.title("Bagging Regressor MSE vs. Number of Estimators")
plt.xlabel("Number of Estimators (n_estimators)")
plt.ylabel("Mean Squared Error (MSE)")
plt.grid(True)
plt.show()

# Find the best number of estimators (lowest MSE)
best_n_estimators = n_estimators_range[np.argmin(mses)]
best_mse = min(mses)

print(f"\nBest Number of Estimators: {best_n_estimators}, Best MSE: {best_mse:.4f}")

#Example of changing base estimator parameters.
bagging_regressor_modified = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(max_depth=5),
    n_estimators=100,
    random_state=42
)
bagging_regressor_modified.fit(X_train, y_train)
y_pred_modified = bagging_regressor_modified.predict(X_test)
mse_modified = mean_squared_error(y_test, y_pred_modified)
print(f"Bagging Regressor MSE (max_depth=5): {mse_modified:.4f}")



33 Train a Random Forest Classifier and analyze misclassified samples
-import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
feature_names = cancer.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Analyze misclassified samples
misclassified_indices = np.where(y_pred != y_test)[0]
misclassified_samples = X_test[misclassified_indices]
misclassified_true_labels = y_test[misclassified_indices]
misclassified_predicted_labels = y_pred[misclassified_indices]

# Print misclassified samples and their true/predicted labels
print("Misclassified Samples:")
for i in range(len(misclassified_indices)):
    print(f"Sample Index: {misclassified_indices[i]}")
    print(f"True Label: {misclassified_true_labels[i]}, Predicted Label: {misclassified_predicted_labels[i]}")
    print(f"Features: {misclassified_samples[i]}")
    print("-" * 20)

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=cancer.target_names, yticklabels=cancer.target_names)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

# Example of accessing the features of a single misclassified sample.
if len(misclassified_indices) > 0:
    first_misclassified_sample_features = misclassified_samples[0]
    print(f"\nFeatures of the first misclassified sample:\n{first_misclassified_sample_features}")
else:
    print("\nNo misclassified samples.")



34 Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a single Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Create a Bagging Classifier with Decision Trees as base estimators
bagging_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    random_state=42
)

# Train the Decision Tree Classifier
dt_classifier.fit(X_train, y_train)

# Train the Bagging Classifier
bagging_classifier.fit(X_train, y_train)

# Make predictions on the test set
dt_y_pred = dt_classifier.predict(X_test)
bagging_y_pred = bagging_classifier.predict(X_test)

# Calculate the accuracy of both models
dt_accuracy = accuracy_score(y_test, dt_y_pred)
bagging_accuracy = accuracy_score(y_test, bagging_y_pred)

# Print the accuracies
print(f"Decision Tree Classifier Accuracy: {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

# Compare the performance
if bagging_accuracy > dt_accuracy:
    print("\nBagging Classifier performs better than Decision Tree Classifier.")
elif bagging_accuracy < dt_accuracy:
    print("\nDecision Tree Classifier performs better than Bagging Classifier.")
else:
    print("\nBoth classifiers have the same accuracy.")

#Example of changing the base estimator parameters.
bagging_classifier_modified = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=5, random_state=42),
    n_estimators=100,
    random_state=42
)
bagging_classifier_modified.fit(X_train, y_train)
bagging_y_pred_modified = bagging_classifier_modified.predict(X_test)
bagging_accuracy_modified = accuracy_score(y_test, bagging_y_pred_modified)
print(f"Bagging Classifier Accuracy (max_depth=5): {bagging_accuracy_modified:.4f}")



35 Train a Random Forest Classifier and visualize the confusion matrix
-import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
class_names = cancer.target_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix for Random Forest Classifier')
plt.show()

# Example of accessing the confusion matrix values.
print("\nConfusion Matrix Values:")
print(f"True Negatives (TN): {cm[0, 0]}")
print(f"False Positives (FP): {cm[0, 1]}")
print(f"False Negatives (FN): {cm[1, 0]}")
print(f"True Positives (TP): {cm[1, 1]}")



36 Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the base estimators
estimators = [
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),  # probability=True for predict_proba
    ('lr', LogisticRegression(solver='liblinear', random_state=42))
]

# Define the final estimator (meta-learner)
final_estimator = LogisticRegression(solver='liblinear', random_state=42)

# Create the Stacking Classifier
stacking_clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

# Train the Stacking Classifier
stacking_clf.fit(X_train, y_train)

# Make predictions on the test set
stacking_y_pred = stacking_clf.predict(X_test)

# Calculate the accuracy of the Stacking Classifier
stacking_accuracy = accuracy_score(y_test, stacking_y_pred)

print(f"Stacking Classifier Accuracy: {stacking_accuracy:.4f}")

# Train and evaluate individual models for comparison
dt_clf = DecisionTreeClassifier(random_state=42)
svm_clf = SVC(random_state=42)
lr_clf = LogisticRegression(solver='liblinear', random_state=42)

dt_clf.fit(X_train, y_train)
svm_clf.fit(X_train, y_train)
lr_clf.fit(X_train, y_train)

dt_y_pred = dt_clf.predict(X_test)
svm_y_pred = svm_clf.predict(X_test)
lr_y_pred = lr_clf.predict(X_test)

dt_accuracy = accuracy_score(y_test, dt_y_pred)
svm_accuracy = accuracy_score(y_test, svm_y_pred)
lr_accuracy = accuracy_score(y_test, lr_y_pred)

print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"SVM Accuracy: {svm_accuracy:.4f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")

# Compare the performance
if stacking_accuracy > max(dt_accuracy, svm_accuracy, lr_accuracy):
    print("\nStacking Classifier performs better than individual models.")
else:
    print("\nStacking Classifier did not outperform the best individual model.")

#Example of changing final estimator.
final_estimator_modified = DecisionTreeClassifier(random_state = 42)
stacking_clf_modified = StackingClassifier(estimators=estimators, final_estimator=final_estimator_modified)
stacking_clf_modified.fit(X_train, y_train)
stacking_y_pred_modified = stacking_clf_modified.predict(X_test)
stacking_accuracy_modified = accuracy_score(y_test, stacking_y_pred_modified)
print(f"Stacking Classifier Accuracy (final estimator = Decision Tree): {stacking_accuracy_modified:.4f}")



37  Train a Random Forest Classifier and print the top 5 most important features
-import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
feature_names = cancer.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier
rf_classifier.fit(X_train, y_train)

# Get feature importance scores
feature_importance = rf_classifier.feature_importances_

# Create a DataFrame to display feature importance
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top 5 most important features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


38 Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Bagging Classifier with Decision Trees as base estimators
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)

# Train the Bagging Classifier
bagging_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_clf.predict(X_test)

# Calculate Precision, Recall, and F1-score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

#Example of changing averaging method for multiclass problems.
#Consider a multiclass problem:

X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3, random_state=42)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.3, random_state=42)
bagging_clf_multi = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bagging_clf_multi.fit(X_train_multi, y_train_multi)
y_pred_multi = bagging_clf_multi.predict(X_test_multi)
precision_macro = precision_score(y_test_multi, y_pred_multi, average='macro')
recall_macro = recall_score(y_test_multi, y_pred_multi, average='macro')
f1_macro = f1_score(y_test_multi, y_pred_multi, average='macro')

print(f"\nMulticlass Precision (macro): {precision_macro:.4f}")
print(f"Multiclass Recall (macro): {recall_macro:.4f}")
print(f"Multiclass F1-score (macro): {f1_macro:.4f}")


39 Train a Random Forest Classifier and analyze the effect of max_depth on accuracy
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the range of max_depth values to test
max_depth_range = [None, 5, 10, 15, 20, 30, 40, 50]

# Store the accuracies for each max_depth
accuracies = []

# Train and evaluate the Random Forest Classifier for each max_depth
for max_depth in max_depth_range:
    rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=max_depth, random_state=42)
    rf_classifier.fit(X_train, y_train)
    y_pred = rf_classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"Max Depth: {max_depth}, Accuracy: {accuracy:.4f}")

# Plot the accuracy vs. max_depth
plt.plot(max_depth_range, accuracies, marker='o')
plt.title("Random Forest Accuracy vs. Max Depth")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.grid(True)
plt.show()

# Find the best max_depth and corresponding accuracy
best_max_depth = max_depth_range[np.argmax(accuracies)]
best_accuracy = max(accuracies)

print(f"\nBest Max Depth: {best_max_depth}, Best Accuracy: {best_accuracy:.4f}")

#Example of changing n_estimators while keeping max_depth constant.
rf_classifier_modified = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42)
rf_classifier_modified.fit(X_train, y_train)
y_pred_modified = rf_classifier_modified.predict(X_test)
accuracy_modified = accuracy_score(y_test, y_pred_modified)
print(f"Random Forest Classifier Accuracy (n_estimators=50, max_depth=10): {accuracy_modified:.4f}")



40 Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare
performance=
-import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create Bagging Regressor with Decision Tree as base estimator
bagging_dt = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=100,
    random_state=42
)

# Create Bagging Regressor with KNeighbors Regressor as base estimator
bagging_knn = BaggingRegressor(
    base_estimator=KNeighborsRegressor(),
    n_estimators=100,
    random_state=42
)

# Train the Bagging Regressors
bagging_dt.fit(X_train, y_train)
bagging_knn.fit(X_train, y_train)

# Make predictions on the test set
dt_y_pred = bagging_dt.predict(X_test)
knn_y_pred = bagging_knn.predict(X_test)

# Calculate the Mean Squared Error (MSE) for both models
dt_mse = mean_squared_error(y_test, dt_y_pred)
knn_mse = mean_squared_error(y_test, knn_y_pred)

# Print the MSE for both models
print(f"Bagging Regressor (Decision Tree) MSE: {dt_mse:.4f}")
print(f"Bagging Regressor (KNeighbors) MSE: {knn_mse:.4f}")

# Compare the performance
if dt_mse < knn_mse:
    print("\nBagging Regressor (Decision Tree) performs better.")
elif dt_mse > knn_mse:
    print("\nBagging Regressor (KNeighbors) performs better.")
else:
    print("\nBoth models have the same performance.")

#Example of changing the KNeighbors Regressor parameters within the Bagging regressor.
bagging_knn_modified = BaggingRegressor(
    base_estimator = KNeighborsRegressor(n_neighbors = 5, weights = 'distance'),
    n_estimators = 100,
    random_state = 42
)

bagging_knn_modified.fit(X_train, y_train)
knn_y_pred_modified = bagging_knn_modified.predict(X_test)
knn_mse_modified = mean_squared_error(y_test, knn_y_pred_modified)
print(f"Bagging Regressor (KNeighbors, n_neighbors=5, weights='distance') MSE: {knn_mse_modified:.4f}")



41 Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score
-import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier
rf_classifier.fit(X_train, y_train)

# Make probability predictions on the test set
y_prob = rf_classifier.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class

# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)

# Print the ROC-AUC score
print(f"Random Forest Classifier ROC-AUC: {roc_auc:.4f}")

# Example of changing the number of estimators and recalculating the ROC-AUC.
rf_classifier_modified = RandomForestClassifier(n_estimators = 50, random_state = 42)
rf_classifier_modified.fit(X_train, y_train)
y_prob_modified = rf_classifier_modified.predict_proba(X_test)[:,1]
roc_auc_modified = roc_auc_score(y_test, y_prob_modified)

print(f"Random Forest Classifier ROC-AUC (n_estimators=50): {roc_auc_modified:.4f}")



42 Train a Bagging Classifier and evaluate its performance using cross-validation.
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Create a Bagging Classifier
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    random_state=42
)

# Perform cross-validation using KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # 5-fold cross-validation

# Calculate cross-validation scores (accuracy)
cv_scores = cross_val_score(bagging_clf, X, y, cv=kf, scoring='accuracy')

# Print the cross-validation scores
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean Cross-Validation Accuracy: {np.mean(cv_scores):.4f}")
print(f"Standard Deviation of Cross-Validation Accuracy: {np.std(cv_scores):.4f}")

# Example of changing the scoring method.
cv_scores_f1 = cross_val_score(bagging_clf, X, y, cv=kf, scoring='f1')

print(f"\nCross-Validation F1-scores: {cv_scores_f1}")
print(f"Mean Cross-Validation F1-score: {np.mean(cv_scores_f1):.4f}")


43 Train a Random Forest Classifier and plot the Precision-Recall curve
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, auc

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier
rf_classifier.fit(X_train, y_train)

# Make probability predictions on the test set
y_prob = rf_classifier.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class

# Calculate Precision-Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

# Calculate the area under the Precision-Recall curve (AUC-PR)
auc_pr = auc(recall, precision)

# Plot the Precision-Recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'AUC-PR = {auc_pr:.4f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.grid(True)
plt.show()

#Example of changing RF parameters and replotting.
rf_classifier_modified = RandomForestClassifier(n_estimators = 50, max_depth = 5, random_state = 42)
rf_classifier_modified.fit(X_train, y_train)
y_prob_modified = rf_classifier_modified.predict_proba(X_test)[:, 1]
precision_modified, recall_modified, thresholds_modified = precision_recall_curve(y_test, y_prob_modified)
auc_pr_modified = auc(recall_modified, precision_modified)

plt.figure(figsize=(8, 6))
plt.plot(recall_modified, precision_modified, label=f'Modified AUC-PR = {auc_pr_modified:.4f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve (Modified RF)')
plt.legend(loc='lower left')
plt.grid(True)
plt.show()



44 Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy
-import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, LogisticRegression
from sklearn.metrics import accuracy_score

# Generate a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the base estimators
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(solver='liblinear', random_state=42))
]

# Define the final estimator (meta-learner)
final_estimator = LogisticRegression(solver='liblinear', random_state=42)

# Create the Stacking Classifier
stacking_clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

# Train the Stacking Classifier
stacking_clf.fit(X_train, y_train)

# Make predictions on the test set
stacking_y_pred = stacking_clf.predict(X_test)

# Calculate the accuracy of the Stacking Classifier
stacking_accuracy = accuracy_score(y_test, stacking_y_pred)

print(f"Stacking Classifier Accuracy: {stacking_accuracy:.4f}")

# Train and evaluate individual models for comparison
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
lr_clf = LogisticRegression(solver='liblinear', random_state=42)

rf_clf.fit(X_train, y_train)
lr_clf.fit(X_train, y_train)

rf_y_pred = rf_clf.predict(X_test)
lr_y_pred = lr_clf.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_y_pred)
lr_accuracy = accuracy_score(y_test, lr_y_pred)

print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")

# Compare the performance
if stacking_accuracy > max(rf_accuracy, lr_accuracy):
    print("\nStacking Classifier performs better than individual models.")
else:
    print("\nStacking Classifier did not outperform the best individual model.")

#Example of changing the final estimator.
final_estimator_modified = RandomForestClassifier(n_estimators=50, random_state=42)
stacking_clf_modified = StackingClassifier(estimators=estimators, final_estimator=final_estimator_modified)
stacking_clf_modified.fit(X_train, y_train)
stacking_y_pred_modified = stacking_clf_modified.predict(X_test)
stacking_accuracy_modified = accuracy_score(y_test, stacking_y_pred_modified)
print(f"Stacking Classifier Accuracy (final estimator = Random Forest): {stacking_accuracy_modified:.4f}")



45 Train a Bagging Regressor with different levels of bootstrap samples and compare performance.
-import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the range of bootstrap sample ratios to test
bootstrap_ratios = [0.5, 0.7, 1.0, 1.2, 1.5]  # 1.0 is the default

# Store the MSE for each bootstrap ratio
mses = []

# Train and evaluate the Bagging Regressor for each bootstrap ratio
for ratio in bootstrap_ratios:
    bagging_regressor = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=100,
        max_samples=ratio,  # Control bootstrap sample size
        random_state=42
    )
    bagging_regressor.fit(X_train, y_train)
    y_pred = bagging_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mses.append(mse)
    print(f"Bootstrap Ratio: {ratio}, MSE: {mse:.4f}")

# Plot the MSE vs. bootstrap ratio
plt.plot(bootstrap_ratios, mses, marker='o')
plt.title("Bagging Regressor MSE vs. Bootstrap Sample Ratio")
plt.xlabel("Bootstrap Sample Ratio (max_samples)")
plt.ylabel("Mean Squared Error (MSE)")
plt.grid(True)
plt.show()

# Find the best bootstrap ratio (lowest MSE)
best_ratio = bootstrap_ratios[np.argmin(mses)]
best_mse = min(mses)

print(f"\nBest Bootstrap Ratio: {best_ratio}, Best MSE: {best_mse:.4f}")

#Example of changing base estimator parameters.
bagging_regressor_modified = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(max_depth=5, random_state=42),
    n_estimators=100,
    max_samples=1.0,
    random_state=42
)
bagging_regressor_modified.fit(X_train, y_train)
y_pred_modified = bagging_regressor_modified.predict(X_test)
mse_modified = mean_squared_error(y_test, y_pred_modified)
print(f"Bagging Regressor MSE (max_depth=5): {mse_modified:.4f}")