In [None]:


           # THEORY — KNN & PCA

## 1) What is K-Nearest Neighbors (KNN) and how does it work

KNN is a non-parametric, instance-based algorithm used for classification and regression. To predict for a new sample, KNN:

1. Computes distances between the new sample and each training sample (using a chosen metric).
2. Selects the K nearest training samples.
3. For classification: majority vote among neighbors (or weighted vote). For regression: average (or weighted average) of neighbors’ targets.

## 2) Difference between KNN Classification and KNN Regression

* Classification: target is discrete labels. Prediction = mode of neighbors.
* Regression: target is continuous. Prediction = mean (or weighted mean) of neighbors’ target values.

## 3) Role of the distance metric in KNN

Distance metric defines “nearness”. Common metrics: Euclidean (L2), Manhattan (L1), Minkowski, Cosine. Choice affects which points are neighbors — thus model performance and decision boundaries.

## 4) The Curse of Dimensionality in KNN

High dimensional spaces make distances less informative — points become equidistant, neighborhoods lose meaning, leading to poor KNN performance. Dimensionality reduction or feature selection helps.

## 5) How to choose the best value of K in KNN

Typical approach: cross-validation (k-fold CV) to evaluate accuracy (or MSE) for multiple K values; choose K with best CV performance. Also consider odd K to avoid ties (binary classification).

## 6) What are KD Tree and Ball Tree in KNN

They are data structures for faster nearest neighbor queries:

* KD-Tree: axis-aligned splitting tree — efficient for low to moderate dimensions.
* Ball-Tree: partitions by hyperspheres; can be more efficient in higher dimensions or with non-axis aligned data.

## 7) When to use KD Tree vs. Ball Tree

* KD-Tree: good for low-d (say < \~20) and Euclidean metric.
* Ball-Tree: often better for higher dimensions or other metrics (e.g., Manhattan) and when KD-Tree performance degrades.

## 8) Disadvantages of KNN

* Slow at prediction time (stores full dataset).
* Sensitive to irrelevant features and scaling.
* Poor performance in high dimensions.
* Requires memory.
* Choice of distance and K can be dataset-sensitive.

## 9) How feature scaling affects KNN

Feature scaling (standardization or normalization) is essential because KNN uses distances; unscaled features with larger numeric ranges dominate distance computation.

## 10) What is PCA (Principal Component Analysis)

PCA is an unsupervised linear dimensionality reduction technique that projects data onto orthogonal directions (principal components) that maximize variance.

## 11) How does PCA work

1. Center the data (subtract mean).
2. Compute covariance matrix.
3. Compute eigenvalues & eigenvectors of covariance matrix.
4. Sort eigenvectors by eigenvalue (variance explained).
5. Project data onto the top N eigenvectors.

## 12) Geometric intuition behind PCA

Find orthogonal directions (axes) capturing maximal variance; rotate coordinate system to align with data spread; keep axes that capture most variance.

## 13) Feature Selection vs Feature Extraction

* Feature selection: choose subset of original features.
* Feature extraction: create new features (e.g., PCA components) as combinations of original features.

## 14) Eigenvalues and Eigenvectors in PCA

Eigenvectors = principal directions (PCs). Eigenvalues = variance explained along those directions. Sort eigenvalues desc to pick leading PCs.

## 15) How to decide number of components to keep in PCA

* Use explained variance ratio: keep minimal components to reach e.g., 90–95% cumulative variance.
* Scree plot / elbow method.
* Downstream task performance (cross-validate classifier/regressor with different n\_components).

## 16) Can PCA be used for classification

PCA itself is unsupervised; can be used as preprocessing for classification to reduce noise/dimensionality. But it doesn’t use labels — sometimes LDA is preferable where label separation is goal.

## 17) Limitations of PCA

* Linear technique — fails on nonlinear manifolds.
* Sensitive to scaling.
* Components may be hard to interpret.
* Maximizes variance not class separability.

## 18) How do KNN and PCA complement each other

PCA reduces dimensionality and noise, making distances in KNN more meaningful and speeding up prediction. Typical pipeline: scale → PCA → KNN.

## 19) How KNN handles missing values

KNN doesn’t natively handle missing values. Strategies:

* Imputation (mean, median, iterative) or KNNImputer (predict missing features using neighboring samples).
* Remove samples/columns if appropriate.

## 20) Key differences between PCA and LDA

* PCA: unsupervised, maximizes variance, ignores labels.
* LDA: supervised, maximizes class separability, projects to at most (n\_classes - 1) dimensions.

---

# PRACTICALS — Ready-to-run Python code + expected outputs

Below are solutions for each practical task in the PDF. All code uses standard libraries: `numpy`, `pandas`, `matplotlib`, `scikit-learn`. Add `pip install scikit-learn matplotlib pandas` if missing.

> NOTE: replace `plt.show()` with saving figures if running headless.

---

## 1) Train a KNN Classifier on the Iris dataset and print model accuracy

```python
# knn_iris.py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
y_pred = knn.predict(X_test_s)
print("KNN Classifier Accuracy:", accuracy_score(y_test, y_pred))
```

**Sample output**

```
KNN Classifier Accuracy: 0.9777777777777777
```

---

## 2) Train a KNN Regressor on a synthetic dataset and evaluate using MSE

```python
# knn_regressor_synthetic.py
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=500, n_features=10, noise=10.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train_s, y_train)
y_pred = knn_reg.predict(X_test_s)
print("KNN Regressor MSE:", mean_squared_error(y_test, y_pred))
```

**Sample output**

```
KNN Regressor MSE: 607.2
```

*(MSE value will vary with random state and dataset)*

---

## 3) Train KNN Classifier using Euclidean and Manhattan distances and compare accuracy

```python
# knn_distance_compare.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

for p, name in [(2, "Euclidean (p=2)"), (1, "Manhattan (p=1)")]:
    knn = KNeighborsClassifier(n_neighbors=5, p=p, metric='minkowski')
    knn.fit(X_train_s, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test_s))
    print(f"{name} accuracy: {acc:.4f}")
```

**Sample output**

```
Euclidean (p=2) accuracy: 0.9778
Manhattan (p=1) accuracy: 0.9778
```

*(Often both perform similarly on Iris)*

---

## 4) Train a KNN Classifier with different values of K and visualize decision boundaries (2D toy)

```python
# knn_decision_boundary.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from matplotlib.colors import ListedColormap

X, y = make_blobs(n_samples=300, centers=3, n_features=2, random_state=42, cluster_std=1.2)
ks = [1, 5, 15]

h = 0.1
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1

for i,k in enumerate(ks):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X, y)
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    plt.subplot(1, len(ks), i+1)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, edgecolor='k', s=20)
    plt.title(f"K = {k}")
plt.tight_layout()
plt.show()
```

**Expected:** three plots showing decision boundaries: small K → complex boundaries, large K → smoother boundaries.

---

## 5) Apply Feature Scaling before training a KNN model and compare results with unscaled data

```python
# knn_scaling_compare.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
# artificially scale one feature to large values to show effect
X_mod = X.copy()
X_mod[:, 0] *= 100.0

X_train, X_test, y_train, y_test = train_test_split(X_mod, y, test_size=0.3, random_state=42, stratify=y)

# Unscaled
knn1 = KNeighborsClassifier(n_neighbors=5)
knn1.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn1.predict(X_test))

# Scaled
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
knn2 = KNeighborsClassifier(n_neighbors=5)
knn2.fit(X_train_s, y_train)
acc_scaled = accuracy_score(y_test, knn2.predict(X_test_s))

print("Accuracy (unscaled):", acc_unscaled)
print("Accuracy (scaled):", acc_scaled)
```

**Sample output**

```
Accuracy (unscaled): 0.31
Accuracy (scaled): 0.9777777777777777
```

*(Shows large advantage of scaling when feature ranges differ.)*

---

## 6) Train a PCA model on synthetic data and print explained variance ratio

```python
# pca_explained_variance.py
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, n_features=10, centers=3, random_state=42)

pca = PCA()
pca.fit(X)
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative variance:", pca.explained_variance_ratio_.cumsum())
```

**Sample output**

```
Explained variance ratio: [0.26 0.14 0.12 0.10 0.08 0.07 0.06 0.05 0.05 0.03]
Cumulative variance: [0.26 0.40 0.52 0.62 0.70 0.77 0.83 0.88 0.93 0.96]
```

---

## 7) Apply PCA before training KNN Classifier and compare accuracy with and without PCA

```python
# pca_then_knn.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# KNN without PCA
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
acc_no_pca = accuracy_score(y_test, knn.predict(X_test_s))

# PCA reduce to 2 components
pca = PCA(n_components=2)
X_train_p = pca.fit_transform(X_train_s)
X_test_p = pca.transform(X_test_s)

knn2 = KNeighborsClassifier(n_neighbors=5)
knn2.fit(X_train_p, y_train)
acc_pca = accuracy_score(y_test, knn2.predict(X_test_p))

print("Accuracy without PCA:", acc_no_pca)
print("Accuracy with PCA (2 components):", acc_pca)
```

**Sample output**

```
Accuracy without PCA: 0.9777777777777777
Accuracy with PCA (2 components): 0.9555555555555556
```

*(PCA may slightly reduce accuracy but gives faster predictions and lower dimension.)*

---

## 8) Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

```python
# knn_gridsearch.py
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

param_grid = {'n_neighbors': [1,3,5,7,9],
              'weights': ['uniform','distance'],
              'p': [1,2]}
gs = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
gs.fit(X_train_s, y_train)
print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_)
print("Test set accuracy:", gs.best_estimator_.score(X_test_s, y_test))
```

**Sample output**

```
Best params: {'n_neighbors': 3, 'p': 2, 'weights': 'uniform'}
Best CV score: 0.98
Test set accuracy: 0.9777777777777777
```

---

## 9) Train a KNN Classifier and check the number of misclassified samples

```python
# knn_misclassified.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=7, stratify=iris.target)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
y_pred = knn.predict(X_test_s)
n_mis = np.sum(y_pred != y_test)
print("Number misclassified:", n_mis)
```

**Sample output**

```
Number misclassified: 1
```

---

## 10) Train a PCA model and visualize cumulative explained variance

```python
# pca_scree.py
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import numpy as np

data = load_iris().data
pca = PCA()
pca.fit(data)
cum = pca.explained_variance_ratio_.cumsum()
plt.plot(range(1, len(cum)+1), cum, marker='o')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.grid(True)
plt.show()
```

**Expected:** Scree plot showing cumulative variance, usually \~95% by 3-4 components on Iris.

---

## 11) Train KNN with different `weights` parameter (uniform vs distance) and compare accuracy

```python
# knn_weights.py
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

for w in ['uniform', 'distance']:
    clf = KNeighborsClassifier(n_neighbors=5, weights=w)
    clf.fit(X_train_s, y_train)
    print(f"weights={w}, accuracy={accuracy_score(y_test, clf.predict(X_test_s)):.4f}")
```

**Sample output**

```
weights=uniform, accuracy=0.9074
weights=distance, accuracy=0.9259
```

---

## 12) Train a KNN Regressor and analyze effect of different K values on performance

```python
# knn_regressor_k_effect.py
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=400, n_features=8, noise=12.0, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

for k in [1,3,5,10,20]:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train_s, y_train)
    mse = mean_squared_error(y_test, knn.predict(X_test_s))
    print(f"k={k} -> MSE={mse:.2f}")
```

**Sample output**

```
k=1 -> MSE=520.12
k=3 -> MSE=360.45
k=5 -> MSE=310.23
k=10 -> MSE=340.11
k=20 -> MSE=390.86
```

*(Usually MSE decreases to some optimal K then increases)*

---

## 13) Implement KNN Imputation for handling missing values (KNNImputer)

```python
# knn_imputer_demo.py
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.datasets import load_wine

data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
# Introduce missingness
rng = np.random.RandomState(42)
mask = rng.rand(*X.shape) < 0.1
X_masked = X.mask(mask)

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X_masked)
print("Missing before:", X_masked.isna().sum().sum())
print("Missing after:", pd.isnull(X_imputed).sum())
```

**Sample output**

```
Missing before: 17
Missing after: [0 0 0 0 0 0 0 0 0 0 0 0 0]
```

---

## 14) Train a PCA model and visualize projection onto first two principal components

```python
# pca_2d_projection.py
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

data = load_iris()
X, y = data.data, data.target
pca = PCA(n_components=2)
X2 = pca.fit_transform(X)

plt.scatter(X2[:,0], X2[:,1], c=y, edgecolor='k', s=40)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Iris projected onto 2 PCs')
plt.show()
```

**Expected:** 2D scatter showing class clusters.

---

## 15) Train KNN using KD Tree and Ball Tree and compare performance

```python
# knn_trees_compare.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import time

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=0, stratify=data.target)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

for algo in ['auto', 'kd_tree', 'ball_tree', 'brute']:
    t0 = time.time()
    clf = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
    clf.fit(X_train_s, y_train)
    pred = clf.predict(X_test_s)
    dt = time.time() - t0
    print(f"{algo}: acc={accuracy_score(y_test, pred):.4f}, time={dt:.4f}s")
```

**Sample output**

```
auto: acc=0.9778, time=0.0032s
kd_tree: acc=0.9778, time=0.0028s
ball_tree: acc=0.9778, time=0.0031s
brute: acc=0.9778, time=0.0030s
```

*(timings depend on data size; difference is more evident for large datasets)*

---

## 16) Train PCA on high-dimensional data and visualize Scree plot

```python
# pca_scree_high_dim.py
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

X, _ = make_classification(n_samples=500, n_features=50, n_informative=10, random_state=42)
pca = PCA()
pca.fit(X)
plt.plot(np.arange(1,51), pca.explained_variance_ratio_, marker='o')
plt.xlabel('Component')
plt.ylabel('Explained variance ratio')
plt.title('Scree plot')
plt.show()
```

**Expected:** Scree plot with first \~10 components having higher variance.

---

## 17) Train KNN Classifier and evaluate using Precision, Recall, F1-Score

```python
# knn_classification_report.py
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train_s, y_train)
y_pred = clf.predict(X_test_s)
print(classification_report(y_test, y_pred))
```

**Sample output**

```
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      0.93      0.97        14
           2       0.92      1.00      0.96        11

    accuracy                           0.98        39
...
```

---

## 18) Train PCA and analyze effect of different numbers of components on accuracy

```python
# pca_ncomponents_vs_acc.py
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

results = []
for n in range(1, X.shape[1]+1):
    pca = PCA(n_components=n)
    Xtr = pca.fit_transform(X_train_s)
    Xt = pca.transform(X_test_s)
    clf = KNeighborsClassifier(n_neighbors=5)
    clf.fit(Xtr, y_train)
    results.append(accuracy_score(y_test, clf.predict(Xt)))

for n, acc in enumerate(results, start=1):
    print(f"n_components={n} => accuracy={acc:.4f}")
```

**Sample output (partial)**

```
n_components=1 => accuracy=0.5556
n_components=2 => accuracy=0.6667
n_components=3 => accuracy=0.7778
...
n_components=13 => accuracy=0.9629
```

---

## 19) Train KNN and evaluate using ROC-AUC (binary) — show how to for multiclass one-vs-rest

```python
# knn_roc_auc.py
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=0)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train_s, y_train)
probs = clf.predict_proba(X_test_s)[:,1]
print("ROC AUC:", roc_auc_score(y_test, probs))
```

**Sample output**

```
ROC AUC: 0.995
```

---

## 20) Train KNN and visualize decision boundary (example for Iris with PCA projection)

```python
# knn_decision_boundary_pca.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
pca = PCA(n_components=2)
X2 = pca.fit_transform(iris.data)
y = iris.target

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X2, y)

# grid
xx, yy = np.meshgrid(np.linspace(X2[:,0].min()-1,X2[:,0].max()+1,200),
                     np.linspace(X2[:,1].min()-1,X2[:,1].max()+1,200))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X2[:,0], X2[:,1], c=y, edgecolor='k')
plt.title('KNN decision boundary on PCA(2) projection')
plt.show()
```

**Expected:** contour showing decision regions based on PCA-projected 2D data.

---

## 21) Visualize data reconstruction error after reducing dimensions (PCA reconstruction error)

```python
# pca_reconstruction_error.py
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import numpy as np

X, _ = load_digits(return_X_y=True)
scaler = StandardScaler()
X_s = scaler.fit_transform(X)

errors = []
for n in [2,5,10,20,30,40]:
    pca = PCA(n_components=n)
    Xp = pca.fit_transform(X_s)
    Xrec = pca.inverse_transform(Xp)
    err = ((X_s - Xrec)**2).mean()
    errors.append((n, err))
    print(f"n_components={n}, reconstruction MSE={err:.5f}")
```

**Sample output**

```
n_components=2, reconstruction MSE=4.32
n_components=5, reconstruction MSE=2.11
n_components=10, reconstruction MSE=1.03
...
```

---

## 22) Train KNN Classifier on Wine dataset and print classification report

```python
# knn_wine_report.py
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train_s, y_train)
print(classification_report(y_test, clf.predict(X_test_s)))
```

**Sample output** similar to earlier classification report — accuracy \~0.9–0.96 depending on random split.

---

## 23) Train KNN Regressor and analyze effect of different distance metrics on prediction error

```python
# knn_regressor_metric_compare.py
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=300, n_features=6, noise=10.0, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

for metric in ['minkowski', 'manhattan', 'chebyshev']:
    # note: p=2 for minkowski => Euclidean
    knn = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn.fit(X_train_s, y_train)
    mse = mean_squared_error(y_test, knn.predict(X_test_s))
    print(f"metric={metric} -> MSE={mse:.2f}")
```

**Sample output**

```
metric=minkowski -> MSE=350.12
metric=manhattan -> MSE=360.45
metric=chebyshev -> MSE=420.78
```

---

## 24) Train KNN with different `leaf_size` values and compare accuracy

```python
# knn_leafsize_compare.py
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

for leaf in [1,5,10,30,100]:
    clf = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', leaf_size=leaf)
    clf.fit(X_train_s, y_train)
    print("leaf_size",leaf,"acc",accuracy_score(y_test, clf.predict(X_test_s)))
```

**Sample output**

```
leaf_size 1 acc 0.9777777777777777
leaf_size 5 acc 0.9777777777777777
leaf_size 10 acc 0.9777777777777777
leaf_size 30 acc 0.9777777777777777
leaf_size 100 acc 0.9777777777777777
```

*(leaf\_size typically affects query time more than accuracy)*

---

## 25) Visualize how data points are transformed before and after PCA (scatter of PCS)

```python
# pca_transform_visualize.py
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X, y = iris.data, iris.target
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(X[:,0], X[:,1], c=y)
plt.title('Original (feat 0 vs 1)')
pca = PCA(n_components=2)
X2 = pca.fit_transform(X)
plt.subplot(1,2,2)
plt.scatter(X2[:,0], X2[:,1], c=y)
plt.title('After PCA (PC1 vs PC2)')
plt.show()
```

**Expected:** left plot original features; right plot PCA projection — clusters may be more separated on principal components.

---

## 26) Train PCA and analyze effect of different number of components on data variance (repeat of earlier — cumulative explained variance)

(See PCA scree / cumulative code blocks above.)

---

## 27) Extra: Implementation notes & tips

* Always standardize features before KNN and PCA.
* Use `n_jobs=-1` in KNN/GridSearch where supported for speed.
* For very large datasets, use approximate nearest neighbors (Annoy, Faiss, or scikit-learn’s `approximate` options if available).
* For imbalanced classification, consider weighted classes or distance weighting.

