**Bagging & Boosting KNN & Stacking**

### **Question 1: What is the fundamental idea behind ensemble techniques?**
### **How does bagging differ from boosting in terms of approach and objective?**

#### **Fundamental Idea Behind Ensemble Techniques**
The fundamental idea of **ensemble techniques** is to combine multiple individual models to create a more powerful and accurate final model.  
By aggregating the predictions of several weak learners, ensemble methods help to **reduce errors**, **improve accuracy**, and **increase model robustness**.

Ensemble techniques mainly aim to:
- **Reduce variance** (e.g., Bagging)
- **Reduce bias** (e.g., Boosting)
- **Improve overall model generalization**

---

#### **Bagging (Bootstrap Aggregating)**

**Approach:**
- Bagging trains multiple models **independently** on different random subsets of the training data using **bootstrapping** (sampling with replacement).
- Final predictions are made by **averaging** (for regression) or **majority voting** (for classification).

**Objective:**
- To **reduce variance** and prevent overfitting by averaging out the errors of multiple models.

**Example:**  
Random Forest (an ensemble of Decision Trees using bagging).

---

#### **Boosting**

**Approach:**
- Boosting builds models **sequentially**, where each new model tries to **correct the errors** made by the previous ones.
- It assigns **higher weights** to misclassified samples so that subsequent models focus more on difficult cases.

**Objective:**
- To **reduce bias** and convert weak learners into a strong one by iteratively improving performance.


#### **Key Differences Between Bagging and Boosting**

| **Aspect** | **Bagging** | **Boosting** |
|-------------|-------------|--------------|
| **Training Method** | Parallel (independent) | Sequential (dependent) |
| **Main Goal** | Reduce variance | Reduce bias |
| **Data Sampling** | Bootstrap sampling (with replacement) | Weighted sampling (focus on errors) |
| **Model Weighting** | Equal weight | Weighted by performance |
| **Overfitting Risk** | Low | High (if not regularized) |
| **Examples** | Random Forest | AdaBoost, XGBoost, Gradient Boosting |

### **Question 2: Explain how the Random Forest Classifier reduces overfitting compared to a single decision tree. Mention the role of two key hyperparameters in this process.**

---

#### **How Random Forest Reduces Overfitting**

A **single Decision Tree** tends to overfit because it learns every pattern — even noise — from the training data.  
The **Random Forest Classifier** overcomes this by building an **ensemble of many Decision Trees**, each trained on a random portion of the data and features.  
The idea is that **multiple uncorrelated models combined together** will generalize better than any individual one.

**Key Techniques Used:**
1. **Bootstrap Sampling (Bagging):**  
   - Each tree is trained on a random sample of the dataset (with replacement).  
   - This ensures trees are diverse and not all dependent on the same samples.

2. **Random Feature Selection:**  
   - At every node split, a random subset of features is considered.  
   - This prevents dominant features from biasing all trees and improves generalization.

3. **Aggregation (Voting/Averaging):**  
   - Predictions from all trees are combined using **majority voting** (for classification) or **averaging** (for regression).  
   - This ensemble averaging smooths out noise and lowers variance.

---

#### **Role of Two Key Hyperparameters**

1. ### **`n_estimators` (Number of Trees)**
   - Controls how many Decision Trees are built in the forest.
   - **Higher values** → more stable and accurate predictions (reduces variance).
   - However, increasing it too much can increase training time.
   - **Typical range:** 100–500 for most datasets.

   ✅ *Effect on Overfitting:*  
   A higher number of trees reduces the chance that the model overfits, since averaging many diverse trees smooths noisy predictions.

---

2. ### **`max_features` (Number of Features Considered per Split)**
   - Controls how many features the model considers when splitting a node.
   - Smaller values add **more randomness**, reducing correlation among trees.
   - Common defaults:
     - Classification → `max_features='sqrt'`
     - Regression → `max_features='auto'` (all features)

   ✅ *Effect on Overfitting:*  
   Using fewer features per split reduces overfitting by forcing trees to explore different parts of the feature space instead of all using the same dominant features.

---

#### **Summary Table**

| **Aspect** | **Decision Tree** | **Random Forest** |
|-------------|------------------|-------------------|
| **Training Data** | Full dataset | Bootstrapped subsets |
| **Feature Selection** | All features | Random subset per node |
| **Model Correlation** | High | Low |
| **Overfitting Tendency** | High | Low |
| **Key Hyperparameters** | – | `n_estimators`, `max_feature

### **Question 3: What is Stacking in Ensemble Learning?**
---

#### **Definition**
**Stacking (Stacked Generalization)** is an **ensemble learning technique** that combines the predictions of multiple different models (base learners) using another model (called a **meta-learner** or **blender**) to produce the final prediction.

Instead of simple averaging (as in Bagging) or sequential correction (as in Boosting), Stacking **learns how to best combine the outputs** of various models through another machine learning algorithm.


#### **How Stacking Differs from Bagging and Boosting**

| **Aspect** | **Bagging** | **Boosting** | **Stacking** |
|-------------|--------------|--------------|---------------|
| **Model Training** | Parallel | Sequential | Parallel (base) + one meta model |
| **Goal** | Reduce variance | Reduce bias | Combine diverse models |
| **Base Learners** | Usually same type (e.g., Decision Trees) | Usually same type (e.g., weak learners) | Different models (e.g., KNN, SVM, RF, LR) |
| **Combining Method** | Averaging or Voting | Weighted combination (by performance) | Meta-learner learns how to combine predictions |
| **Complexity** | Moderate | High | High (multi-level model) |
| **Overfitting** | Less likely | Can overfit if not regularized | Can overfit if too complex |

---

#### **Example Use Case**

**Use Case:** Predicting whether a customer will default on a loan.

- **Base Models (Level 0):**
  - Logistic Regression → captures linear relationships  
  - Random Forest → captures non-linear interactions  
  - KNN → works well on local patterns

- **Meta Model (Level 1):**
  - Logistic Regression → takes predictions from the above models as inputs and learns the optimal combination
---


### **Question 4: What is the OOB Score in Random Forest, and why is it useful?**
### **How does it help in model evaluation without a separate validation set?**

---

#### **Definition**
The **OOB Score (Out-of-Bag Score)** is an internal performance evaluation method used in the **Random Forest algorithm**.  
When Random Forest builds each tree, it uses **bootstrap sampling**—that is, sampling the training data **with replacement**.  
This means some samples are not used to train a particular tree; these unused samples are known as **Out-of-Bag (OOB) samples**.

The OOB Score is computed by predicting these OOB samples using the trees that **did not see** them during training.  
It provides an unbiased estimate of the model’s performance, similar to cross-validation, but **without needing a separate validation set**.

---

#### **How the OOB Score Works**
1. Each Decision Tree is trained on about **63% of the training data** (bootstrapped samples).  
   The remaining **~37%** of samples are left out (OOB samples).

2. Once all trees are trained, each OOB sample is passed only through the trees that did **not** include it in training.

3. The final prediction for each OOB sample is obtained by **aggregating (voting or averaging)** predictions from those trees.

4. The accuracy of these predictions is the **OOB Score**.

---

#### **Why the OOB Score is Useful**

- ✅ **No Separate Valid**


### **Question 5: Compare AdaBoost and Gradient Boosting**

---

#### **Overview**
Both **AdaBoost (Adaptive Boosting)** and **Gradient Boosting** are **boosting algorithms**, which combine multiple weak learners (usually shallow decision trees) sequentially to form a strong predictive model.  
However, they differ in **how they handle errors**, **adjust weights**, and **optimize performance**.

---

### **1️⃣ How They Handle Errors from Weak Learners**

| **Aspect** | **AdaBoost** | **Gradient Boosting** |
|-------------|--------------|------------------------|
| **Error Handling** | Focuses on **misclassified samples** by assigning them higher weights in the next iteration. | Focuses on **minimizing the loss function’s residuals** (difference between predicted and actual values). |
| **Learning Mechanism** | Reweights samples: Misclassified → higher weight, Correct → lower weight. | Fits the next tree to the **residual errors** of the previous model. |
| **Goal** | Correct previous classification mistakes by emphasizing hard examples. | Reduce overall prediction error by optimizing the gradient of the loss function. |

---

### **2️⃣ Weight Adjustment Mechanism**

| **Aspect** | **AdaBoost** | **Gradient Boosting** |
|-------------|--------------|------------------------|
| **Weight Update** | Each sample has a weight. After each iteration, misclassified samples get higher weights so that the next learner focuses more on them. | Each tree is trained on the **residuals** (gradients) instead of reweighting samples. The next model tries to correct the remaining errors. |
| **Model Weight (α)** | Each weak learner gets a weight based on its accuracy — better learners get higher importance. | Each new tree is scaled by a **learning rate (shrinkage parameter)** to control contribution and prevent overfitting. |
| **Error Emphasis** | Explicit — directly increases weights for misclassified points. | Implicit — errors are reduced by following the gradient of the loss function. |

---

### **3️⃣ Typical Use Cases**

| **Aspect** | **AdaBoost** | **Gradient Boosting** |
|-------------|--------------|------------------------|
| **Best For** | Clean, less noisy datasets where errors can be corrected iteratively. | Complex


### **Question 6: Why does CatBoost perform well on categorical features without requiring extensive preprocessing?**
---

#### **Overview**
**CatBoost** (short for *Categorical Boosting*) is a gradient boosting algorithm developed by **Yandex**, designed specifically to handle **categorical data** efficiently.  
Unlike other boosting algorithms such as XGBoost or LightGBM, which require manual preprocessing like **One-Hot Encoding** or **Label Encoding**, CatBoost can **natively process categorical variables** — making it more accurate and faster on such datasets.

---

### **Why CatBoost Performs Well on Categorical Features**

CatBoost automatically converts categorical data into numerical representations using **statistical techniques** that capture relationships between category values and target labels.  
This allows it to **retain useful information** without inflating the feature space (as in one-hot encoding).

---

### **How CatBoost Handles Categorical Variables**

CatBoost introduces two innovative techniques to process categorical data effectively:

### **Example in Code**

```python
# Example: CatBoost automatically handles categorical features
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load sample dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train CatBoost
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=4, verbose=0)
model.fit(X_train, y_train)

# Evaluate performance
print("CatBoost Accuracy:", model.score(X_test, y_test))


In [5]:
### **Question 7: KNN Classifier Assignment – Wine Dataset Analysis with Optimization**
---

#### **Objective**
In this task, we’ll:
1. Train a **K-Nearest Neighbors (KNN)** classifier on the Wine dataset.
2. Evaluate its performance **before and after scaling**.
3. Use **GridSearchCV** to find the best hyperparameters (K value and distance metric).

---

### **Step 1: Import Libraries and Load Dataset**
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into train and test (70-30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("Training Samples:", X_train.shape[0])
print("Testing Samples:", X_test.shape[0])
### **Step 2: Train KNN (K=5) Without Scaling**
# Initialize KNN with default K=5
knn_default = KNeighborsClassifier(n_neighbors=5)
knn_default.fit(X_train, y_train)

# Predictions
y_pred_default = knn_default.predict(X_test)

# Evaluation
print("Accuracy (Without Scaling):", accuracy_score(y_test, y_pred_default))
print("\nClassification Report (Without Scaling):\n", classification_report(y_test, y_pred_default))
### **Step 3: Apply StandardScaler and Retrain KNN**
Scaling is crucial for KNN because it uses distance-based metrics.
We’ll scale the features and observe the change in model performance.
# Apply Standard Scaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN again
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

# Predictions
y_pred_scaled = knn_scaled.predict(X_test_scaled)

# Evaluation
print("Accuracy (With Scaling):", accuracy_score(y_test, y_pred_scaled))
print("\nClassification Report (With Scaling):\n", classification_report(y_test, y_pred_scaled))
### **Step 4: Optimize KNN using GridSearchCV**
We’ll search for the best:
- **Number of Neighbors (K)**: 1 to 20
- **Distance Metric**: Euclidean (`'minkowski', p=2`) and Manhattan (`'minkowski', p=1`)
# Define parameter grid
param_grid = {
    'n_neighbors': list(range(1, 21)),
    'p': [1, 2]  # 1 = Manhattan, 2 = Euclidean
}

# Initialize GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train_scaled, y_train)

# Best parameters and score
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)
### **Step 5: Train Optimized KNN Model and Evaluate**
We’ll use the best parameters found by GridSearchCV to train our final model.
# Optimized KNN
best_knn = grid.best_estimator_
best_knn.fit(X_train_scaled, y_train)
y_pred_best = best_knn.predict(X_test_scaled)

# Evaluation
print("Optimized Model Accuracy:", accuracy_score(y_test, y_pred_best))
print("\nClassification Report (Optimized KNN):\n", classification_report(y_test, y_pred_best))


SyntaxError: invalid character '’' (U+2019) (ipython-input-677085410.py, line 5)

In [None]:
### **Question 8: PCA + KNN with Variance Analysis and Visualization**
---

#### **Objective**
Perform **Dimensionality Reduction** using **PCA (Principal Component Analysis)** on the **Breast Cancer dataset**,
and evaluate how PCA affects the performance of a **K-Nearest Neighbors (KNN)** classifier.
# Step 1: Import Libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Load Dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])
### **Step 2: Scale the Data**
Scaling ensures all features contribute equally to PCA and distance-based algorithms like KNN.
# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
### **Step 3: Apply PCA and Plot Scree Plot (Explained Variance Ratio)**
The scree plot helps visualize how much variance each principal component explains.
# Apply PCA
pca = PCA()
pca.fit(X_train_scaled)

# Plot Scree Plot
plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot - Explained Variance by Components')
plt.grid(True)
plt.show()

# Print total components needed for 95% variance
total_variance_95 = np.argmax(np.cumsum(pca.explained_variance_ratio_) >= 0.95) + 1
print(f"Number of components explaining 95% variance: {total_variance_95}")
### **Step 4: Transform Data Retaining 95% Variance**
# PCA with 95% variance retained
pca_95 = PCA(n_components=0.95)
X_train_pca = pca_95.fit_transform(X_train_scaled)
X_test_pca = pca_95.transform(X_test_scaled)

print("Original feature shape:", X_train.shape)
print("Reduced feature shape:", X_train_pca.shape)
### **Step 5: Train KNN on Original and PCA-Transformed Data**
We’ll compare model performance before and after applying PCA.
# KNN on original scaled data
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

# KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

# Compare accuracies
acc_original = accuracy_score(y_test, y_pred_original)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("KNN Accuracy (Original Data):", acc_original)
print("KNN Accuracy (After PCA):", acc_pca)


**10 question** - Question 9:KNN Regressor with Distance Metrics and K-Value
Analysis
Task:
1. Generate a synthetic regression dataset
(sklearn.datasets.make_regression(n_samples=500, n_features=10)).
2. Train a KNN regressor with:
a. Euclidean distance (K=5)
b. Manhattan distance (K=5)
c. Compare Mean Squared Error (MSE) for both.
3. Test K=1, 5, 10, 20, 50 and plot K vs. MSE to analyze bias-variance tradeoff.
Answer:  
Question 10: KNN with KD-Tree/Ball Tree, Imputation, and Real-World
Data
Task:
1. Load the Pima Indians Diabetes dataset (contains missing values).
2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.
3. Train KNN using:
a. Brute-force method
b. KD-Tree
c. Ball Tree
4. Compare their training time and accuracy.
5. Plot the decision boundary for the best-performing method (use 2 most important
features).

In [6]:
### **Question 10: KNN with KD-Tree / Ball Tree, Imputation, and Real-World Data**
---

#### **Objective**
We will work with the **Pima Indians Diabetes dataset** to:
1. Handle missing data using **KNN Imputer**
2. Train and evaluate **KNN classifiers** using three different algorithms:
   - **Brute-force**
   - **KD-Tree**
   - **Ball Tree**
3. Compare their performance and training times
4. Visualize decision boundaries for the best-performing model.
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import time

# Load dataset
url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/refs/heads/main/diabetes.csv"
df = pd.read_csv(url)
df.head()
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import time

# Load dataset
url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/refs/heads/main/diabetes.csv"
df = pd.read_csv(url)
df.head()
### **Step 2: Handle Missing Values Using KNN Imputer**
Some features (like BMI, Glucose, Insulin) may have zero values which are unrealistic — we’ll treat them as missing.
# Replace zeros with NaN for imputation
cols_with_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_missing] = df[cols_with_missing].replace(0, np.nan)

# Apply KNN Imputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Verify no missing values remain
df_imputed.isnull().sum()
### **Step 3: Train-Test Split and Feature Scaling**
X = df_imputed.drop('Outcome', axis=1)
y = df_imputed['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Scale features for KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
### **Step 4: Train KNN Using Different Algorithms (Brute-force, KD-Tree, Ball Tree)**
We’ll compare training time and accuracy for each method.
# Define algorithms to test
algorithms = ['brute', 'kd_tree', 'ball_tree']
results = []

for algo in algorithms:
    start_time = time.time()
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    elapsed = time.time() - start_time

    acc = accuracy_score(y_test, y_pred)
    results.append({'Algorithm': algo, 'Accuracy': acc, 'Time (s)': elapsed})
    print(f"{algo.upper()} - Accuracy: {acc:.4f}, Time: {elapsed:.4f} sec")

# Summary table
results_df = pd.DataFrame(results)
results_df
### **Step 5: Determine the Best Performing Algorithm**
best_algo = results_df.loc[results_df['Accuracy'].idxmax()]
print("Best Algorithm:\n", best_algo)
### **Step 6: Visualize Decision Boundary for the Best Algorithm**
We’ll use only the two most important features for visualization.
# Select top 2 features based on domain importance
features = ['Glucose', 'BMI']
X2 = df_imputed[features]
y2 = df_imputed['Outcome']

# Train-test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3, random_state=42, stratify=y2)

# Scale
X_train2_scaled = scaler.fit_transform(X_train2)
X_test2_scaled = scaler.transform(X_test2)

# Best-performing KNN (using best algorithm found earlier)
knn_best = KNeighborsClassifier(n_neighbors=5, algorithm=best_algo['Algorithm'])
knn_best.fit(X_train2_scaled, y_train2)

# Create meshgrid for plotting
x_min, x_max = X_train2_scaled[:, 0].min() - 1, X_train2_scaled[:, 0].max() + 1
y_min, y_max = X_train2_scaled[:, 1].min() - 1, X_train2_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Z = knn_best.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X_train2_scaled[:, 0], X_train2_scaled[:, 1], c=y_train2, cmap='coolwarm', edgecolors='k', alpha=0.7)
plt.title(f"Decision Boundary using {best_algo['Algorithm'].upper()} (K=5)")
plt.xlabel("Glucose (Standardized)")
plt.ylabel("BMI (Standardized)")
plt.show()
### **Step 7: Summary of Results**

| **Algorithm** | **Accuracy** | **Training Time (sec)** | **Remarks** |
|----------------|--------------|--------------------------|--------------|
| Brute-force | Moderate | Slow | Checks all distances manually |
| KD-Tree | High | Fast | Best for low/medium dimensions |
| Ball Tree | Similar | Fast | Efficient for high dimensions |

✅ **Conclusion:**
- The **KD-Tree** or **Ball Tree** algorithm generally performs best in terms of speed and accuracy.
- **KNN Imputer** successfully handled missing values without losing data.
- Feature scaling and optimized neighbor search significantly improve KNN performance on real-world datasets.


SyntaxError: invalid character '—' (U+2014) (ipython-input-1641684314.py, line 44)