#Bagging & Boosting KNN & Stacking Assignment


**Instructions:** Carefully read each question. Use Google Docs, Microsoft Word, or a similar tool to create a document where you type out each question along with its answer. Save the document as a PDF, and then upload it to the LMS. Please do not zip or archive the files before uploading them. Each question carries 20 marks.


**Question 1 :** What is the fundamental idea behind ensemble techniques? How does bagging differ from boosting in terms of approach and objective?

**Answer:** The fundamental idea behind ensemble techniques is to combine multiple models (often called “weak learners”) to create a stronger overall model that performs better than any single one of them. The goal is to reduce errors by leveraging the diversity among models — each model compensates for the others’ weaknesses.

**Bagging (Bootstrap Aggregating):** Builds multiple models independently using different random subsets of data (created by sampling with replacement).

Aims to reduce variance by averaging multiple models to make predictions more stable.

**Boosting:** Builds models sequentially, where each new model focuses on correcting the errors made by the previous ones.

Aims to reduce bias by giving more weight to misclassified samples and refining the model iteratively.

**Question 2:** Explain how the Random Forest Classifier reduces overfitting compared to a single decision tree. Mention the role of two key hyperparameters in this process.


**Answer:** A Random Forest Classifier reduces overfitting by combining the predictions of many decision trees rather than relying on a single one.

A single decision tree tends to overfit because it learns all patterns — even noise — from the training data, making it too specific and less generalizable. Random Forest overcomes this by introducing randomness in both data and features, ensuring that each tree learns slightly different patterns. When their predictions are averaged (or voted), the overall model becomes more stable and less prone to overfitting.

**Two Key Hyperparameters and Their Roles**

1. n_estimators (number of trees):

- More trees mean better averaging and smoother predictions, which reduces variance and overfitting.

- However, too many trees can increase training time without much gain after a point.

2. max_features (number of features considered at each split):

- Controls randomness and correlation among trees.

- A smaller value increases diversity (since trees see different features), which helps reduce overfitting.

- A larger value makes trees more similar, which can increase overfitting.

**Question 3:** What is Stacking in ensemble learning? How does it differ from traditional bagging/boosting methods? Provide a simple example use case.


**Answer:** Stacking (Stacked Generalization) is an ensemble learning technique where multiple different models (called base learners) are trained on the same dataset, and their outputs are combined using another model (called a meta-learner or blender) to make the final prediction.

The main idea is to let the meta-learner discover how best to combine the strengths of the base models to improve accuracy.

**Stacking:** Combines different types of models (e.g., Logistic Regression, Decision Tree, SVM).

Base models are trained in parallel, and their predictions are used to train a meta-model.
**Bagging:** Uses the same type of model repeatedly (e.g., many Decision Trees).

Models are trained independently on random data subsets.

**Boosting:** Also uses the same base model but improves it sequentially.

Models are trained sequentially, each focusing on correcting errors from the previous one.

**Question 4:** What is the OOB Score in Random Forest, and why is it useful? How does it help in model evaluation without a separate validation set?

**Answer:** The OOB (Out-of-Bag) Score in a Random Forest is an internal validation score that estimates the model’s performance without needing a separate validation set.


When building a Random Forest:

- Each tree is trained on a bootstrapped sample — a random sample (with replacement) from the training data.

- On average, about 63% of the data points are used to train that tree, and the remaining 37% are not included in that sample. These unused data points are called Out-of-Bag (OOB) samples.

After all trees are built:

- Each observation is predicted only by the trees where it was OOB (not used in training).

- The OOB Score is then calculated by comparing these OOB predictions with the actual labels.

**Question 5:** Compare AdaBoost and Gradient Boosting in terms of:

● How they handle errors from weak learners

● Weight adjustment mechanism

● Typical use cases

**Answer:** 1. How They Handle Errors from Weak Learners

- **AdaBoost:**
Focuses on misclassified samples.After each round, it increases the weights of the incorrectly predicted data points so the next weak learner pays more attention to those errors.

- **Gradient Boosting:**
Focuses on residual errors (the difference between predicted and actual values).Each new model is trained to predict these residuals, gradually minimizing the overall loss function.

2. Weight Adjustment Mechanism

- **AdaBoost:**

- Assigns weights to each training sample.

- After each iteration:

- Increases weights for misclassified samples.

- Decreases weights for correctly classified samples.

- The final prediction is a weighted sum of all weak learners based on their accuracy.

- **Gradient Boosting:**

- Does not assign weights to samples directly.

- Instead, it fits the next model on the negative gradient of the loss function (which represents errors).

- Combines models by adding their predictions with a learning rate to control step size.

3. Typical Use Cases

- **AdaBoost:**

- Works best with simple weak learners like decision stumps (one-level trees).

- Used for classification problems, especially when the data is clean and not too noisy.

- Example: Spam detection, face recognition.

- **Gradient Boosting:**

- More flexible and powerful — can optimize various loss functions (classification, regression, ranking).

- Commonly used for complex datasets and regression or classification tasks.

- Example: Credit risk modeling, customer churn prediction, and many Kaggle competition solutions.

**Question 6:** Why does CatBoost perform well on categorical features without requiring extensive preprocessing? Briefly explain its handling of categorical variables.


**Answer:** CatBoost performs well on categorical features because it’s specifically designed to handle them natively, without requiring manual preprocessing like one-hot or label encoding.

# How CatBoost Handles Categorical Variables

1. Uses Target Statistics (Target Encoding with Randomness):

- Instead of assigning arbitrary numbers, CatBoost replaces each categorical value with a statistic derived from the target variable (like the average target value for that category).

- To avoid overfitting, it uses ordered target encoding, where the encoding for each row is based only on previous rows, never on the same or future ones.

- This ensures that no data leakage occurs.

2. Combinations of Categorical Features:

- CatBoost also creates combinations of categorical features (like City + ProductType) to capture interactions automatically.

3. Efficient Encoding During Training:

- All these transformations are done dynamically during training, so you don’t need to manually encode or scale the data.

**Question 7: KNN Classifier Assignment: Wine Dataset Analysis with Optimization Task:**

1. Load the Wine dataset (sklearn.datasets.load_wine()).

2. Split data into 70% train and 30% test.

3. Train a KNN classifier (default K=5) without scaling and evaluate using:

a.Accuracy

b. Precision, Recall, F1-Score (print classification report)

4. Apply StandardScaler, retrain KNN, and compare metrics.

5. Use GridSearchCV to find the best K (test K=1 to 20) and distance metric (Euclidean, Manhattan).

6. Train the optimized KNN and compare results with the unscaled/scaled versions.


In [None]:
# Step 1: Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

# Step 2: Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Step 3: Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Step 4: Train KNN (default K=5) without scaling
knn_default = KNeighborsClassifier(n_neighbors=5)
knn_default.fit(X_train, y_train)
y_pred_default = knn_default.predict(X_test)

print("=== KNN without Scaling ===")
print("Accuracy:", accuracy_score(y_test, y_pred_default))
print("Classification Report:\n", classification_report(y_test, y_pred_default))

# Step 5: Apply StandardScaler and retrain KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

print("\n=== KNN with StandardScaler ===")
print("Accuracy:", accuracy_score(y_test, y_pred_scaled))
print("Classification Report:\n", classification_report(y_test, y_pred_scaled))

# Step 6: Use GridSearchCV to find best K (1–20) and distance metric
param_grid = {
    'n_neighbors': range(1, 21),
    'metric': ['euclidean', 'manhattan']
}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print("\nBest Parameters from GridSearchCV:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Step 7: Train optimized KNN model
best_knn = grid_search.best_estimator_
best_knn.fit(X_train_scaled, y_train)
y_pred_best = best_knn.predict(X_test_scaled)

print("\n=== Optimized KNN (with Scaling) ===")
print("Accuracy:", accuracy_score(y_test, y_pred_best))
print("Classification Report:\n", classification_report(y_test, y_pred_best))

# Optional comparison summary
results = pd.Data


**Question 8 : PCA + KNN with Variance Analysis and Visualization Task:**

1. Load the Breast Cancer dataset (sklearn.datasets.load_breast_cance ()).

2. Apply PCA and plot the scree plot (explained variance ratio).

3. Retain 95% variance and transform the dataset.

4. Train KNN on the original data and PCA-transformed data, then compare accuracy.

5. Visualize the first two principal components using a scatter plot (color by class).


In [None]:
# Step 1: Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Step 2: Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
print("Dataset Shape:", X.shape)

# Step 3: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Apply PCA and plot Scree Plot (Explained Variance Ratio)
pca = PCA()
pca.fit(X_scaled)
explained_variance = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o')
plt.title("Scree Plot - Cumulative Explained Variance")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance Ratio")
plt.grid(Tr


**Question 9: KNN Regressor with Distance Metrics and K-Value Analysis Task:**

1. Generate a synthetic regression dataset (sklearn.datasets.make_regression(n_samples=500, n_features=10)).

2. Train a KNN regressor with:

a. Euclidean distance (K=5)

b. Manhattan distance (K=5)

c. Compare Mean Squared Error (MSE) for both.

3. Test K=1, 5, 10, 20, 50 and plot K vs. MSE to analyze bias-variance tradeoff.


In [None]:
# Step 1: Import necessary libraries
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

# Step 2: Generate a synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=10, noise=10, random_state=42)

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3a: Train KNN Regressor (K=5, Euclidean distance)
knn_euclidean = KNeighborsRegressor(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
mse_euclidean = mean_squared_error(y_test, y_pred_euclidean)

# Step 3b: Train KNN Regressor (K=5, Manhattan distance)
knn_manhattan = KNeighborsRegressor(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
mse_manhattan = mean_squared_error(y_test, y_pred_manhattan)

# Compare MSE for both distance metrics
print("=== Distance Metric Comparison (K=5) ===")
print(f"Euclidean Distance MSE: {mse_euclidean:.4f}")
print(f"Manhattan Distance MSE: {mse_manhattan:.4f}")

# Step 4: Test K values and analyze bias-variance tradeoff
k_values = [1, 5, 10, 20, 50]
mse_scores = []

for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k, metric='euclidean')
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Step 5: Plot K vs MSE
plt.figure(figsize=(8, 5))
plt.plot(k_values, mse_scores, marker='o', linestyle='-', color='b')
plt.title("KNN Regressor: K vs Mean Squared Error")
plt.xlabel("Number of Neighbors (K)")
plt.ylabel("Mean Squared Error (MSE)")
plt.grid(True)
plt.show()


**Question 10: KNN with KD-Tree/Ball Tree, Imputation, and Real-World Data Task:**

1. Load the Pima Indians Diabetes dataset (contains missing values).

2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.

3. Train KNN using:

a. Brute-force method

b. KD-Tree

c. Ball Tree

4. Compare their training time and accuracy.

5. Plot the decision boundary for the best-performing method (use 2 most important features).
Dataset: Pima Indians Diabetes



In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA

# Step 2: Load the Pima Indians Diabetes dataset
# If you have the file locally: df = pd.read_csv("pima-indians-diabetes.csv")
# Otherwise, you can use a known open-source link:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
           'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)

print("Dataset shape:", df.shape)
print(df.head())

# Step 3: Replace zero values (invalid) with NaN for imputation
cols_with_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_missing] = df[cols_with_missing].replace(0, np.nan)

print("\nMissing values before imputation:")
print(df.isnull().sum())

# Step 4: Apply KNN Imputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nMissing values after imputation:")
print(df_imputed.isnull().sum())

# Step 5: Split dataset
X = df_imputed.drop('Outcome', axis=1)
y = df_imputed['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Step 6: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 7: Train KNN using different algorithms and compare
methods = ['brute', 'kd_tree', 'ball_tree']
results = []

for method in methods:
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=method)

    start = time.time()
    knn.fit(X_train_scaled, y_train)
    end = time.time()

    y_pred = knn.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    train_time = end - start

    results.append({'Algorithm': method, 'Accuracy': acc, 'Training Time (s)': train_time})
    print(f"\n=== {method.upper()} METHOD ===")
    print(f"Accuracy: {acc:.4f}")
    print(f"Training Time: {train_time:.4f} seconds")

# Step 8: Compare results in a table
results_df = pd.DataFrame(results)
print("\n=== Comparison Summary ===")
print(results_df)

# Step 9: Choose the best-performing method
best_method = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]['Algorithm']
print(f"\nBest-performing method: {best_method.upper()}")

# Step 10: Visualize decision boundary using top 2 PCA components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_best = KNeighborsClassifier(n_neighbors=5, algorithm=best_method)
knn_best.fit(X_pca, y_train)

# Create a meshgrid for plotting
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                     np.arange(y_min, y_max, 0.05))

Z = knn_best.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.coolwarm)
plt.title(f"Decision Boundary using {best_method.upper()} (Top 2 PCA Features)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()
