# ANSWERS-------

## Question 1
**What is the fundamental idea behind ensemble techniques? How does bagging differ from boosting in terms of approach and objective?**

**Answer (summary):**

- **Fundamental idea:** Combine multiple models (base learners/weak learners) to form a stronger predictor. Aggregation (voting, averaging, meta-learning) reduces errors that single models make.
- **Bagging (Bootstrap Aggregating):**
  - Train many models in parallel on bootstrap samples.
  - Aggregate predictions by majority vote (classification) or average (regression).
  - Objective: reduce variance and overfitting (useful for high-variance models like deep decision trees).
- **Boosting:**
  - Train models sequentially; each model focuses on mistakes of the previous ones.
  - Aggregate with weighted sum/vote; later learners correct earlier errors.
  - Objective: reduce bias (turn many weak learners into a strong learner).
Markdown cell (Q2):

markdown
Copy code
## Question 2
**Explain how the Random Forest Classifier reduces overfitting compared to a single decision tree. Mention the role of two key hyperparameters in this process.**

**Answer (summary):**

- **Mechanism:** Build an ensemble of decision trees trained on different bootstrap samples and random subsets of features, then average/vote to smooth predictions (reduces variance).
- **Key hyperparameters:**
  - `n_estimators` (number of trees): more trees -> lower variance (up to diminishing returns).
  - `max_features` (features considered per split): smaller value -> more diverse trees -> reduce correlation -> better generalization.
- **Other controls:** `max_depth`, `min_samples_leaf`, `min_samples_split` also prevent overly complex trees.
Markdown cell (Q3):

markdown
Copy code
## Question 3
**What is Stacking in ensemble learning? How does it differ from traditional bagging/boosting methods? Provide a simple example use case.**

**Answer (summary):**

- **Stacking:** Train multiple base models (level-0) and then train a meta-model (level-1) on base models’ predictions. The meta-model learns how to combine base outputs.
- **Difference:** Bagging averages/votes; boosting weights learners sequentially; stacking trains a second-level model to combine predictions (requires out-of-fold predictions to avoid leakage).
- **Example:** Base: RandomForest, SVM, LogisticRegression → Meta: LogisticRegression trained on their predicted probabilities to improve final predictions.
Markdown cell (Q4):

markdown
Copy code
## Question 4
**What is the OOB Score in Random Forest, and why is it useful? How does it help in model evaluation without a separate validation set?**

**Answer (summary):**

- **OOB (Out-Of-Bag):** For each tree, ~1/3 of training samples are not used (OOB) in its bootstrap sample. Use those to evaluate that tree. Aggregate OOB predictions across trees to estimate generalization performance.
- **Why useful:** Provides an internal validation estimate without holding out a separate validation set — efficient use of data and fast approximate CV.
Markdown cell (Q5):

markdown
Copy code
## Question 5
**Compare AdaBoost and Gradient Boosting in terms of:**
- How they handle errors from weak learners
- Weight adjustment mechanism
- Typical use cases

**Answer (summary):**

- **AdaBoost:**
  - Emphasizes misclassified samples by increasing their sample weights.
  - Learners get weights based on performance; next learner trained on re-weighted data.
  - Good when using stumps; sensitive to outliers.
- **Gradient Boosting:**
  - Fits new learners to residuals (negative gradients) of the loss function.
  - Uses learning rate (shrinkage) and regularization; more flexible (arbitrary loss).
  - Modern variants (XGBoost/LightGBM/CatBoost) excel on tabular data and large datasets.
Markdown cell (Q6):

markdown
Copy code
## Question 6
**Why does CatBoost perform well on categorical features without requiring extensive preprocessing? Briefly explain its handling of categorical variables.**

**Answer (summary):**

- CatBoost uses **ordered target statistics** (target encoding without leakage) and permutation-driven encodings to transform categorical features safely.
- It also builds combinations of categorical features and uses symmetric tree structures, which together reduce overfitting.
- Result: Minimal manual preprocessing required — pass categorical feature indices to CatBoost and it handles encoding internally.
Markdown cell (Q7 header and instructions):

markdown
Copy code
## Question 7 — KNN Classifier Assignment: Wine Dataset Analysis with Optimization
**Task summary:** Load wine dataset, split (70/30), train KNN (k=5) without scaling and with StandardScaler, evaluate metrics, run GridSearchCV for k=1..20 and metric in {euclidean, manhattan}, compare optimized KNN.

Below are code cells to run sequentially.
Code cell (Q7 setup + evaluation) — copy into a code cell and run:

python
Copy code
# Q7: KNN on Wine dataset (70/30 split), comparison: unscaled, scaled, and GridSearch optimization
# Run this cell in Colab.

# Imports
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Load data
data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

# 70/30 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# 1) KNN without scaling (k=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred)
print("KNN WITHOUT scaling — Accuracy: {:.4f}".format(acc_unscaled))
print("Classification report (unscaled):\n", classification_report(y_test, y_pred, target_names=target_names))

# 2) Scale (StandardScaler) and retrain
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print("\nKNN WITH StandardScaler — Accuracy: {:.4f}".format(acc_scaled))
print("Classification report (scaled):\n", classification_report(y_test, y_pred_scaled, target_names=target_names))

# 3) GridSearchCV to find best K and metric
param_grid = {
    'n_neighbors': list(range(1, 21)),
    'metric': ['euclidean', 'manhattan']
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_scaled, y_train)  # use scaled features for grid search
print("\nGridSearchCV best params:", grid.best_params_)
print("GridSearchCV best CV score: {:.4f}".format(grid.best_score_))

# 4) Train optimized KNN on scaled data and compare
best_knn = grid.best_estimator_
best_knn.fit(X_train_scaled, y_train)
y_pred_best = best_knn.predict(X_test_scaled)
acc_best = accuracy_score(y_test, y_pred_best)
print("\nOptimized KNN — Test Accuracy: {:.4f}".format(acc_best))
print("Classification report (optimized):\n", classification_report(y_test, y_pred_best, target_names=target_names))

# Summary
print("Summary of Accuracies:\n Unscaled: {:.4f}\n Scaled: {:.4f}\n Optimized: {:.4f}".format(acc_unscaled, acc_scaled, acc_best))
Markdown cell (Q8 header and instructions):

markdown
Copy code
## Question 8 — PCA + KNN with Variance Analysis and Visualization
**Task summary:** Load Breast Cancer dataset, apply StandardScaler, make scree plot (explained variance ratio), keep 95% variance, train KNN on original and PCA data, compare accuracies, and scatter plot first two PCs colored by class.
Code cell (Q8 implementation + plots) — copy into a code cell and run:

python
Copy code
# Q8: PCA + KNN on Breast Cancer dataset, scree plot, retain 95% variance, compare accuracy, scatter plot of first 2 PCs.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
target_names = data.target_names

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA full for scree plot
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_scaled)
explained_ratio = pca_full.explained_variance_ratio_

# Scree plot
plt.figure(figsize=(9,5))
plt.plot(np.arange(1, len(explained_ratio) + 1), explained_ratio, marker='o')
plt.title("Scree Plot: Explained Variance Ratio")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.grid(True)
plt.show()

# Retain 95% variance
pca95 = PCA(n_components=0.95)
X_pca95 = pca95.fit_transform(X_scaled)
print("Original features:", X.shape[1])
print("Reduced features (retain 95% variance):", X_pca95.shape[1])

# Train-test split for original scaled data and PCA data (use same random_state for comparability)
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, stratify=y)
X_train_pca, X_test_pca, _, _ = train_test_split(X_pca95, y, test_size=0.3, random_state=42, stratify=y)

# KNN on original scaled data
knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(X_train_orig, y_train)
acc_orig = accuracy_score(y_test, knn_orig.predict(X_test_orig))
print("\nKNN accuracy on original scaled data: {:.4f}".format(acc_orig))

# KNN on PCA reduced data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))
print("KNN accuracy on PCA-reduced (95% var) data: {:.4f}".format(acc_pca))

# Visualization: first two principal components (full PCA transform)
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca_full[:,0], X_pca_full[:,1], c=y, cmap='coolwarm', s=50, alpha=0.7, edgecolor='k')
plt.title("Breast Cancer — First Two Principal Components")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.legend(handles=scatter.legend_elements()[0], labels=list(target_names))
plt.grid(True)
plt.show()
Markdown cell (Q9 header and instructions):

markdown
Copy code
## Question 9 — KNN Regressor with Distance Metrics and K-Value Analysis
**Task summary:** Create synthetic regression data (500 samples, 10 features), compare KNN regressor with Euclidean and Manhattan (K=5), compute MSEs, test K={1,5,10,20,50} and plot K vs MSE.
Code cell (Q9 implementation + plot) — copy into a code cell and run:

python
Copy code
# Q9: KNN Regressor — distance metrics and K analysis (MSE vs K)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression data
X, y = make_regression(n_samples=500, n_features=10, noise=15.0, random_state=42)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# K=5 Euclidean
knn_euc = KNeighborsRegressor(n_neighbors=5, metric='euclidean')
knn_euc.fit(X_train_s, y_train)
mse_euc = mean_squared_error(y_test, knn_euc.predict(X_test_s))

# K=5 Manhattan
knn_man = KNeighborsRegressor(n_neighbors=5, metric='manhattan')
knn_man.fit(X_train_s, y_train)
mse_man = mean_squared_error(y_test, knn_man.predict(X_test_s))

print("MSE Euclidean (K=5): {:.4f}".format(mse_euc))
print("MSE Manhattan (K=5): {:.4f}".format(mse_man))

# K vs MSE analysis
K_values = [1, 5, 10, 20, 50]
MSE_scores = []
for k in K_values:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    MSE_scores.append(mean_squared_error(y_test, y_pred))

# Plot K vs MSE
plt.figure(figsize=(8,5))
plt.plot(K_values, MSE_scores, marker='o')
plt.title("K vs MSE (KNN Regressor)")
plt.xlabel("K")
plt.ylabel("Mean Squared Error")
plt.grid(True)
plt.show()

# Print table
for k, mse in zip(K_values, MSE_scores):
    print("K = {:>2} -> MSE = {:.4f}".format(k, mse))
Markdown cell (Q10 header and instructions):

markdown
Copy code
## Question 10 — KNN with KD-Tree/Ball Tree, Imputation, and Real-World Data (Pima Indians Diabetes)
**Task summary:** Load Pima Indians Diabetes data (contains missing values), perform KNNImputer, scale features, train KNN with three algorithms (brute, kd_tree, ball_tree), compare training/prediction time and accuracy, and plot decision boundary using two most important features.

**Note:** This cell downloads the dataset from an online source (UCI / Kaggle-like raw CSV link). If the link is blocked, upload `pima-indians-diabetes.csv` manually into Colab.
Code cell (Q10 implementation + download helper + decision boundary) — copy into a code cell and run:

python
Copy code
# Q10: KNN Imputer + compare brute/kd_tree/ball_tree on Pima Indians Diabetes dataset
# This code attempts to download the dataset. If it fails, upload the CSV to Colab manually.

import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier

# Try to download the dataset (common raw CSV on GitHub). If internet blocked, upload file manually.
csv_path = 'pima-indians-diabetes.csv'
if not os.path.exists(csv_path):
    try:
        url = "https://raw.githubusercontent.com/selva86/datasets/master/PimaIndiansDiabetes.csv"
        df = pd.read_csv(url)
        df.to_csv(csv_path, index=False)
        print("Downloaded dataset from:", url)
    except Exception as e:
        print("Automatic download failed. Please upload 'pima-indians-diabetes.csv' to Colab. Error:", e)
        raise SystemExit("Upload dataset and re-run.")

# Read dataset
df = pd.read_csv(csv_path)
print("Dataset shape:", df.shape)
print(df.head())

# Dataset may already have column 'Outcome' or 'diabetes'; normalize column names
# Known column names from Selva86 dataset: 'pregnant','glucose','pressure','triceps','insulin','mass','pedigree','age','diabetes'
# Ensure final columns: features + 'Outcome' (0/1)
if 'Outcome' not in df.columns:
    # attempt to standardize
    if 'diabetes' in df.columns:
        df = df.rename(columns={'diabetes':'Outcome'})
    elif 'Diabetes' in df.columns:
        df = df.rename(columns={'Diabetes':'Outcome'})
    else:
        # try last column as outcome
        df.columns = list(df.columns[:-1]) + ['Outcome']

# Replace zeros with NaN for some columns (common practice)
cols_with_zero = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
                  'glucose','pressure','triceps','insulin','mass']  # include alternative names
for col in df.columns:
    if col in cols_with_zero:
        df[col] = df[col].replace(0, np.nan)

# If there are mixed column names, ensure numeric matrix and an 'Outcome' column
# Show missing counts
print("\nMissing value counts:\n", df.isna().sum())

# Prepare X, y
# Map columns so that drop 'Outcome' column for features
if 'Outcome' not in df.columns:
    raise ValueError("Outcome column not found; please ensure dataset contains outcome column named 'Outcome' or 'diabetes'.")

# Use all non-Outcome columns as features
X_raw = df.drop('Outcome', axis=1)
y = df['Outcome'].astype(int).values

# KNN Imputer
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X_raw)

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, stratify=y)

# Compare algorithms
methods = ['brute', 'kd_tree', 'ball_tree']
results = {}
for method in methods:
    model = KNeighborsClassifier(n_neighbors=5, algorithm=method)
    t0 = time.time()
    model.fit(X_train, y_train)
    t1 = time.time()
    preds = model.predict(X_test)
    t2 = time.time()
    results[method] = {
        'train_time_sec': t1 - t0,
        'predict_time_sec': t2 - t1,
        'accuracy': accuracy_score(y_test, preds),
        'confusion_matrix': confusion_matrix(y_test, preds)
    }

# Print results
for k, v in results.items():
    print(f"\nMethod: {k}")
    print(" Train time (s):", v['train_time_sec'])
    print(" Predict time (s):", v['predict_time_sec'])
    print(" Accuracy:", v['accuracy'])
    print(" Confusion matrix:\n", v['confusion_matrix'])

# Select best method by accuracy (tie-breaker: fastest predict time)
best_method = max(results.items(), key=lambda x: (x[1]['accuracy'], -x[1]['predict_time_sec']))[0]
print("\nBest method selected:", best_method)

# Decision boundary visualization using two most important features (from DecisionTree)
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
importances = dt.feature_importances_
top2 = np.argsort(importances)[-2:]
print("Top 2 feature indices for visualization:", top2)

# Prepare 2D train/test using top2 features
X2_train = X_train[:, top2]
X2_test = X_test[:, top2]

# Retrain best KNN on 2D features (for visualization)
knn_best2 = KNeighborsClassifier(n_neighbors=5, algorithm=best_method)
knn_best2.fit(X2_train, y_train)

# Create meshgrid
x_min, x_max = X2_train[:, 0].min() - 1, X2_train[:, 0].max() + 1
y_min, y_max = X2_train[:, 1].min() - 1, X2_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
grid = np.c_[xx.ravel(), yy.ravel()]
Z = knn_best2.predict(grid).reshape(xx.shape)

# Plot decision boundary
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X2_test[:, 0], X2_test[:, 1], c=y_test, cmap='coolwarm', edgecolor='k', s=50)
plt.title(f"KNN decision boundary (best: {best_method}) — using top 2 features")
plt.xlabel(f"Feature index {top2[0]}")
plt.ylabel(f"Feature index {top2[1]}")
plt.show()

# End of Q10