

## 🧠 Ensemble Learning – Quick Notes

### 🔍 Definition:

Combining **multiple models** to improve prediction accuracy compared to a single model.

---

### ✅ Why Use It?

* Reduces **errors**
* Improves **accuracy**
* Handles **overfitting/underfitting**

---

### 🎯 Main Types:

1. **Bagging (Bootstrap Aggregating)**

   * Trains models **independently** on random subsets
   * Example: **Random Forest**

2. **Boosting**

   * Trains models **sequentially**, each one corrects the previous
   * Example: **AdaBoost, XGBoost**

3. **Voting**

   * Combines different models' predictions by **majority vote** or **average**
   * Example: **Voting Classifier**

---

### 📦 How it Works:

* Same data → Multiple models → Combine results

---

### 🧪 Real-life Examples:

* Spam detection
* Credit scoring
* Stock price prediction

---

### 💡 Key Point:

> Ensemble = **Team of models working together** for better results

---


In [6]:
# 1. Import libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

print("Step 1: All libraries imported successfully.")
# 2. Load the Iris dataset
X, y = load_iris(return_X_y=True)

print("Step 2: Loaded the Iris dataset.")
print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)
#print data set values 
print("First 5 rows of features (X):\n", X[:5])
print("First 5 rows of target (y):\n", y[:5])

Step 1: All libraries imported successfully.
Step 2: Loaded the Iris dataset.
Features (X) shape: (150, 4)
Target (y) shape: (150,)
First 5 rows of features (X):
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
First 5 rows of target (y):
 [0 0 0 0 0]


In [4]:
# 3. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Step 3: Data split into training and testing sets.")
print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])


Step 3: Data split into training and testing sets.
Training set size: 105
Testing set size: 45


In [7]:
# 4. Define base models
model1 = LogisticRegression(max_iter=200)
model2 = DecisionTreeClassifier()
model3 = SVC(probability=True)  # Important: probability=True needed for soft voting

print("Step 4: Defined 3 base models:")
print("- Logistic Regression")
print("- Decision Tree")
print("- Support Vector Machine (SVM)")


Step 4: Defined 3 base models:
- Logistic Regression
- Decision Tree
- Support Vector Machine (SVM)


In [8]:
# 5. Create the VotingClassifier ensemble
ensemble = VotingClassifier(estimators=[
    ('lr', model1),   # 'lr' is just a name label
    ('dt', model2),
    ('svc', model3)
], voting='soft')  # Use 'soft' for probability averaging

print("Step 5: Created VotingClassifier with soft voting.")


Step 5: Created VotingClassifier with soft voting.


In [9]:
# 6. Train the ensemble model
ensemble.fit(X_train, y_train)

print("Step 6: Trained the VotingClassifier on training data.")
# 7. Predict using the ensemble model
y_pred = ensemble.predict(X_test)

print("Step 7: Made predictions on the test set.")
print("Predicted labels:", y_pred)
# 8. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Step 8: Calculated accuracy of the ensemble model.")
print("Accuracy:", accuracy)


Step 6: Trained the VotingClassifier on training data.
Step 7: Made predictions on the test set.
Predicted labels: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]
Step 8: Calculated accuracy of the ensemble model.
Accuracy: 1.0




## 🔶 1. **Max Voting (Majority Voting)**

**Use case**: Mostly for **classification** problems.

### 🧠 Idea:

Each model votes for a class (like “yes” or “no”), and the class that gets the most votes is the final result.

### 🧾 Example:

Suppose you have 3 models, and they predict:

* Model A: **Yes**
* Model B: **No**
* Model C: **Yes**

Now, count the votes:

* "Yes" → 2 votes
* "No" → 1 vote

✅ **Final prediction** = **Yes**

### ➕ Pros:

* Simple and intuitive.
* Works well if individual models are diverse and moderately accurate.

---

## 🔶 2. **Average Voting (Averaging)**

**Use case**: Mostly for **regression** problems (predicting numbers), but can be used in classification if using probabilities.

### 🧠 Idea:

Take the average of the predictions from all models.

### 🧾 Example (Regression):

Model predictions:

* Model A: 4.2
* Model B: 5.0
* Model C: 4.8

Average = (4.2 + 5.0 + 4.8) / 3 = **4.67**

✅ **Final prediction** = **4.67**

### 🧾 Example (Classification with probabilities):

Class "Yes" probabilities:

* Model A: 0.60
* Model B: 0.80
* Model C: 0.70

Average probability for “Yes” = (0.60 + 0.80 + 0.70)/3 = **0.70**

✅ Final class = “Yes” (if we use a threshold like 0.5)

### ➕ Pros:

* Smooths out extreme predictions.
* Works well when models are reasonably calibrated.

---

## 🔶 3. **Weighted Voting (or Weighted Averaging)**

**Use case**: Both **classification and regression**.

### 🧠 Idea:

Same as max or average voting—but models are not treated equally. Models that perform better get more **weight**.

### 🧾 Example (Weighted Average - Regression):

Model predictions:

* Model A: 4.2 (weight = 0.2)
* Model B: 5.0 (weight = 0.5)
* Model C: 4.8 (weight = 0.3)

Final prediction:
\= (4.2×0.2 + 5.0×0.5 + 4.8×0.3)
\= (0.84 + 2.5 + 1.44) = **4.78**

✅ Final prediction = **4.78**

### 🧾 Example (Weighted Majority Voting - Classification):

Class "Yes" probabilities and weights:

* Model A: 0.60 (weight = 0.2)
* Model B: 0.80 (weight = 0.5)
* Model C: 0.70 (weight = 0.3)

Weighted probability:
\= (0.60×0.2 + 0.80×0.5 + 0.70×0.3) = 0.12 + 0.4 + 0.21 = **0.73**

✅ Final class = "Yes"

### ➕ Pros:

* Gives more power to better models.
* More flexible and often more accurate than equal-weight methods.

---

## 🔚 Summary

| Method              | Use case                   | Idea                                   |
| ------------------- | -------------------------- | -------------------------------------- |
| **Max Voting**      | Classification             | Pick the class that most models choose |
| **Average Voting**  | Regression / Probabilities | Take the average of all predictions    |
| **Weighted Voting** | Both                       | Give more weight to better models      |



In [20]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from collections import Counter
import pandas as pd

# -----------------------------
# Step 1: Load and Inspect the Data
# -----------------------------

# Load the Iris dataset
data = load_iris()

# Convert to DataFrame for easy inspection
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("### Iris Dataset (First 5 Rows):")
print(df.head())  # Prints the first 5 rows of the dataset

# ------------------------------
# Step 2: Prepare the Data
# ------------------------------

X = data.data  # Feature matrix
y = data.target  # Target vector (labels)

# Split data into training and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("\n### Shape of Train and Test Data:")
print(f"Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")

# ------------------------------
# Step 3: Initialize Models
# ------------------------------

# Initialize three different models
model1 = DecisionTreeClassifier(random_state=42)
model2 = SVC(probability=True, random_state=42)
model3 = KNeighborsClassifier()

print("\n### Models Initialized:")
print("1. Decision Tree Classifier")
print("2. Support Vector Machine (SVC)")
print("3. k-Nearest Neighbors (KNN)")

# ------------------------------
# Step 4: Train Models
# ------------------------------

# Train the models on the training data
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)

print("\n### Models Trained on Training Data")

# ------------------------------
# Step 5: Make Predictions
# ------------------------------

# Predict with each model on the test data
pred1 = model1.predict(X_test)
pred2 = model2.predict(X_test)
pred3 = model3.predict(X_test)

print("\n### Predictions from each model on Test Data:")
print("Model 1 (Decision Tree) Predictions:", pred1)
print("Model 2 (SVC) Predictions:", pred2)
print("Model 3 (KNN) Predictions:", pred3)

# ------------------------------
# Step 6: Max Voting (Majority Voting)
# ------------------------------

print("\n### Applying Max Voting (Majority Voting):")

max_voting_preds = []
for p1, p2, p3 in zip(pred1, pred2, pred3):
    votes = [p1, p2, p3]
    vote_counts = Counter(votes)
    final_pred = vote_counts.most_common(1)[0][0]
    max_voting_preds.append(final_pred)

print("Max Voting Final Predictions:", max_voting_preds)

# ------------------------------
# Step 7: Average Voting
# ------------------------------

print("\n### Applying Average Voting:")

avg_voting_preds = []
for p1, p2, p3 in zip(pred1, pred2, pred3):
    avg_pred = np.mean([p1, p2, p3])
    avg_voting_preds.append(round(avg_pred))  # Round to nearest class (0, 1, or 2)

print("Average Voting Final Predictions:", avg_voting_preds)

# ------------------------------
# Step 8: Weighted Voting
# ------------------------------

print("\n### Applying Weighted Voting:")

# Define weights for each model (Model 1 has the highest weight)
weights = [0.5, 0.3, 0.2]

weighted_preds = []
for p1, p2, p3 in zip(pred1, pred2, pred3):
    weighted_pred = (p1 * weights[0] + p2 * weights[1] + p3 * weights[2])
    weighted_preds.append(round(weighted_pred))

print("Weighted Voting Final Predictions:", weighted_preds)

# ------------------------------
# Step 9: Evaluate Model Performance
# ------------------------------

print("\n### Evaluating the Accuracy of Each Voting Method:")

# Calculate accuracy for each voting method
accuracy_max = accuracy_score(y_test, max_voting_preds)
accuracy_avg = accuracy_score(y_test, avg_voting_preds)
accuracy_weighted = accuracy_score(y_test, weighted_preds)

print(f"Accuracy of Max Voting: {accuracy_max}")
print(f"Accuracy of Average Voting: {accuracy_avg}")
print(f"Accuracy of Weighted Voting: {accuracy_weighted}")

# ------------------------------
# Final Notes
# ------------------------------
print("\n### Summary of Results:")
print("Max Voting uses majority class voting for each sample.")
print("Average Voting calculates the average of predictions and rounds to the nearest class.")
print("Weighted Voting assigns a higher weight to the more reliable models.")


### Iris Dataset (First 5 Rows):
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

### Shape of Train and Test Data:
Training set shape: (105, 4), Test set shape: (45, 4)

### Models Initialized:
1. Decision Tree Classifier
2. Support Vector Machine (SVC)
3. k-Nearest Neighbors (KNN)

### Models Trained on Training Data

### Predictions from each model on Test Data:
Model 1 (Decision Tree) Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]
Mod


---

# 🧾 Bagging (Bootstrap Aggregating) – Complete Interview Notes

---

## ✅ 1. Definition

**Bagging (Bootstrap Aggregating)** is an ensemble machine learning technique used to improve the **accuracy** and **stability** of models by combining predictions from **multiple learners** trained on random subsets of the training data.

---

## ✅ 2. Purpose

* Reduce **variance** of a model (especially decision trees).
* Prevent **overfitting**.
* Improve **generalization** on unseen data.

---

## ✅ 3. Intuition (Simple Explanation)

Instead of training one model on the whole dataset:

* Train several models on **different random samples** (with replacement).
* Let them **vote** (for classification) or **average** (for regression).
* Final prediction = combined result of all models.

🎯 Think of asking multiple people the same question and taking the **majority vote** – more reliable than trusting just one person.

---

## ✅ 4. How Bagging Works (Step-by-Step)

1. **Bootstrap Sampling**:

   * Randomly create `n` new training sets from the original dataset.
   * Sampling is done **with replacement**, so some data points may repeat, others may be left out.

2. **Model Training**:

   * Train a **separate base learner** (e.g., decision tree) on each bootstrapped dataset.

3. **Prediction**:

   * **Classification**: Use **majority voting** from all models.
   * **Regression**: Take the **average** of predictions.

---

## ✅ 5. Diagram (Conceptual View)

```
Original Data
     ↓
Bootstrap Samples (with replacement)
     ↓          ↓          ↓
 Model 1    Model 2    Model 3   ... Model n
     ↓          ↓          ↓
   Prediction1 Prediction2 Prediction3 ...
     ↓
Final Output (Majority Vote / Average)
```

---

## ✅ 6. Example in Python (Classification)

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging with Decision Tree
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```

---

## ✅ 7. When to Use Bagging

* When a **model overfits** the training data (high variance).
* When working with **unstable models** (e.g., decision trees).
* When you want **better performance** through model averaging.

---

## ✅ 8. Common Base Learners

* **Decision Tree** – most common (used in Random Forest)
* KNN
* Naive Bayes (less common due to low variance)

---

## ✅ 9. Real-Life Example: Random Forest

* **Random Forest** = Bagging + Random feature selection.
* It’s a collection of decision trees trained using bagging + a twist: at each split, it chooses a **random subset of features**.

---

## ✅ 10. Pros and Cons

### ✅ Advantages:

* Reduces **variance** → better generalization.
* **Simple** to implement.
* Can run models **in parallel**.
* Improves **accuracy** without increasing bias.

### ❌ Disadvantages:

* Doesn't reduce **bias** (only variance).
* Less interpretable (many models).
* Computationally more expensive.

---

## ✅ 11. Bagging vs Boosting (Comparison)

| Feature        | Bagging                              | Boosting                             |
| -------------- | ------------------------------------ | ------------------------------------ |
| Goal           | Reduce variance                      | Reduce bias (and variance)           |
| Sampling       | Random sampling **with replacement** | Sequential training with reweighting |
| Model Training | In parallel                          | Sequential (dependent on previous)   |
| Overfitting    | Less prone to overfitting            | More prone (if not regularized)      |
| Common Example | Random Forest                        | AdaBoost, Gradient Boosting          |

---

## ✅ 12. Summary (Interview Quick Talk)

* **Bagging** = **Bootstrap + Aggregation**.
* Train models on **random subsets** of data and **combine** predictions.
* Best used with **high-variance models** like decision trees.
* Popular example: **Random Forest**.
* Main goal: **reduce variance**, **improve stability**, **avoid overfitting**.

---

Let me know if you'd like this as a **PDF**, **flashcards**, or with **diagrams** to practice visually!
