### KNN & PCA

1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
- K-Nearest Neighbors (KNN) is a non-parametric, instance-based supervised learning method that predicts by looking at the K closest training samples to a query point using a distance metric (commonly Euclidean) and aggregating their targets. In classification, KNN assigns the class by majority (technically plurality) vote among the K neighbors, effectively using the neighbors’ mode as the predicted label. In regression, it predicts a continuous value by averaging (or taking the median of) the neighbors’ target values, making the output the neighbors’ mean/median. K controls the bias–variance trade-off: small K can be noisy but sensitive to local structure, while larger K smooths predictions and reduces sensitivity to outliers; the algorithm remains “lazy,” doing most computation at query time by storing the dataset and searching neighbors.

2. What is the Curse of Dimensionality and how does it affect KNN
performance?
- The Curse of Dimensionality refers to phenomena that arise as feature count grows, where data becomes sparse, distances concentrate, and the amount of data needed grows exponentially, degrading learning and search efficiency. For KNN, this means distance metrics lose discriminative power (nearest and farthest neighbors become similarly distant), making neighbor selection noisy, increasing variance/overfitting, and demanding much more data and careful preprocessing (scaling, feature selection, dimensionality reduction) to maintain performance.

3. What is Principal Component Analysis (PCA)? How is it different from
feature selection?
- PCA is a dimensionality reduction technique that linearly transforms correlated features into a smaller set of orthogonal principal components that capture the maximum variance, typically computed via eigen decomposition or SVD of the covariance matrix and used for compression, de-noising, and visualization. Unlike feature selection, which keeps a subset of original features, PCA performs feature extraction by creating new rotated features (linear combinations), so components may be less interpretable but often reduce multicollinearity and retain most information with fewer dimensions.

4. What are eigenvalues and eigenvectors in PCA, and why are they
important?
- In PCA, eigenvectors are the orthogonal directions (principal axes) along which the data variance is maximized, and eigenvalues are the amounts of variance captured along each such direction, obtained by solving $$A v = \lambda v$$ for the data covariance matrix A. The top eigenvectors (with the largest eigenvalues) define the principal components used to re-express data, so selecting them ranks components by explained variance and enables dimensionality reduction while preserving the most information.

5. How do KNN and PCA complement each other when applied in a single
pipeline?
- PCA and KNN work well together because PCA creates a lower-dimensional set of uncorrelated components that preserve most variance, which reduces noise and distance “concentration” so KNN’s distance-based neighbor search becomes more discriminative and faster to compute. PCA also mitigates multicollinearity and scales features implicitly (after standardization), helping KNN avoid overfitting in high dimensions while often improving accuracy and inference latency; the typical pipeline is: standardize → PCA (retain components for, say, 95% variance) → KNN.

In [1]:
# 6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
# KNN on Wine dataset: with vs without scaling

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
import sklearn

print("scikit-learn version:", sklearn.__version__)

data = load_wine(as_frame=True)
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

knn_no_scale = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
knn_no_scale.fit(X_train, y_train)
y_pred_no = knn_no_scale.predict(X_test)
acc_no = accuracy_score(y_test, y_pred_no)

pipe_scaled = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2))
])
pipe_scaled.fit(X_train, y_train)
y_pred_scaled = pipe_scaled.predict(X_test)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without scaling: {acc_no:.4f}")
print(f"Accuracy with scaling   : {acc_scaled:.4f}")

results = pd.DataFrame({
    "Setting": ["Without Scaling", "With Scaling"],
    "Test Accuracy": [acc_no, acc_scaled]
})
print("\nResults:\n", results)


scikit-learn version: 1.6.1
Accuracy without scaling: 0.7778
Accuracy with scaling   : 0.9333

Results:
            Setting  Test Accuracy
0  Without Scaling       0.777778
1     With Scaling       0.933333


In [2]:
# 7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd

wine = load_wine(as_frame=True)
X = wine.data

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

pca = PCA(n_components=None, random_state=42)
pca.fit(X_std)

evr = pca.explained_variance_ratio_
cum_evr = np.cumsum(evr)

print("Explained variance ratio per component:")
for i, (r, c) in enumerate(zip(evr, cum_evr), start=1):
    print(f"PC{i:02d}: {r:.4f}  |  Cumulative: {c:.4f}")

df = pd.DataFrame({
    "PC": [f"PC{i}" for i in range(1, len(evr)+1)],
    "ExplainedVarianceRatio": evr,
    "CumulativeEVR": cum_evr
})
print("\nTable:\n", df)

Explained variance ratio per component:
PC01: 0.3620  |  Cumulative: 0.3620
PC02: 0.1921  |  Cumulative: 0.5541
PC03: 0.1112  |  Cumulative: 0.6653
PC04: 0.0707  |  Cumulative: 0.7360
PC05: 0.0656  |  Cumulative: 0.8016
PC06: 0.0494  |  Cumulative: 0.8510
PC07: 0.0424  |  Cumulative: 0.8934
PC08: 0.0268  |  Cumulative: 0.9202
PC09: 0.0222  |  Cumulative: 0.9424
PC10: 0.0193  |  Cumulative: 0.9617
PC11: 0.0174  |  Cumulative: 0.9791
PC12: 0.0130  |  Cumulative: 0.9920
PC13: 0.0080  |  Cumulative: 1.0000

Table:
       PC  ExplainedVarianceRatio  CumulativeEVR
0    PC1                0.361988       0.361988
1    PC2                0.192075       0.554063
2    PC3                0.111236       0.665300
3    PC4                0.070690       0.735990
4    PC5                0.065633       0.801623
5    PC6                0.049358       0.850981
6    PC7                0.042387       0.893368
7    PC8                0.026807       0.920175
8    PC9                0.022222       0.942397
9  

In [3]:
# 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
# KNN on original vs PCA(2) features - Wine dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import sklearn

print("scikit-learn version:", sklearn.__version__)

wine = load_wine(as_frame=True)
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

pipe_base = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
pipe_base.fit(X_train, y_train)
y_pred_base = pipe_base.predict(X_test)
acc_base = accuracy_score(y_test, y_pred_base)

pipe_pca2 = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2, random_state=42)),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
pipe_pca2.fit(X_train, y_train)
y_pred_pca2 = pipe_pca2.predict(X_test)
acc_pca2 = accuracy_score(y_test, y_pred_pca2)

print(f"Accuracy on standardized original features: {acc_base:.4f}")
print(f"Accuracy on PCA(2) features              : {acc_pca2:.4f}")

results = pd.DataFrame({
    "Setting": ["Std + KNN (original 13)", "Std + PCA(2) + KNN"],
    "Test Accuracy": [acc_base, acc_pca2]
})
print("\nResults:\n", results)

scikit-learn version: 1.6.1
Accuracy on standardized original features: 0.9333
Accuracy on PCA(2) features              : 0.9333

Results:
                    Setting  Test Accuracy
0  Std + KNN (original 13)       0.933333
1       Std + PCA(2) + KNN       0.933333


In [4]:
# 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
# KNN with Euclidean vs Manhattan distance on scaled Wine dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import pandas as pd
import sklearn

print("scikit-learn version:", sklearn.__version__)

wine = load_wine(as_frame=True)
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

pipe_euclid = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2))
])

pipe_manhat = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=1))
])

pipe_euclid.fit(X_train, y_train)
y_pred_e = pipe_euclid.predict(X_test)
acc_e = accuracy_score(y_test, y_pred_e)

pipe_manhat.fit(X_train, y_train)
y_pred_m = pipe_manhat.predict(X_test)
acc_m = accuracy_score(y_test, y_pred_m)

print(f"Accuracy (Euclidean, p=2): {acc_e:.4f}")
print(f"Accuracy (Manhattan, p=1): {acc_m:.4f}")

results = pd.DataFrame({
    "Distance": ["Euclidean (p=2)", "Manhattan (p=1)"],
    "Test Accuracy": [acc_e, acc_m]
})
print("\nResults:\n", results)

scikit-learn version: 1.6.1
Accuracy (Euclidean, p=2): 0.9333
Accuracy (Manhattan, p=1): 0.9778

Results:
           Distance  Test Accuracy
0  Euclidean (p=2)       0.933333
1  Manhattan (p=1)       0.977778


10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
- Use PCA to reduce dimensionality: Standardize gene expression (e.g., log-transform counts, normalize, then z-score) and run PCA to project tens of thousands of correlated genes into a compact set of orthogonal components that capture the dominant biological variation while filtering noise and batch structure common in high-throughput assays.  
- Decide how many components to keep: Inspect the explained variance via a scree plot and choose the “elbow,” or retain components until a cumulative variance threshold (typically 90–95%) is reached; validate this choice with nested cross-validation to ensure downstream performance stabilizes and is not sensitive to the exact component count.  
- Use KNN after reduction: Train KNN on the retained PCs (with distances computed in the reduced space) since neighbor search becomes more discriminative and less prone to the curse of dimensionality; tune K (and distance metric) via cross-validation to balance bias–variance in the low-dimensional manifold.  
- Evaluate the model: Use stratified nested cross-validation to tune hyperparameters and estimate generalization, reporting accuracy alongside recall, precision, F1 (per class) and macro-averages due to class imbalance; include calibration checks, confusion matrices, and, if applicable, external validation or repeated splits to confirm stability, plus sensitivity analyses to the number of PCs.  
- Justify the pipeline: PCA+KNN directly addresses p≫n overfitting by compressing correlated gene signals into data-driven “metagenes” that preserve signal while reducing variance and computational load; it is transparent, reproducible, and robust for biomedical data where interpretability and validation are critical, and it integrates standard QC steps (normalization, batch assessment), principled model selection, and rigorous validation suitable for clinical research workflows.