#KNN & PCA
 1. — What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Answer:
K-Nearest Neighbors (KNN) is a non-parametric, instance-based supervised learning algorithm. It stores the training examples and, for a new query point, finds the k closest training examples (neighbors) according to a distance metric (commonly Euclidean).

  - Classification: The predicted class for the query is typically the majority class among the k neighbors (optionally weighted by inverse distance).

 - Regression: The predicted value is typically the average (or distance-weighted average) of the target values of the k neighbors.

   Key points:

 - No explicit training phase (lazy learner) — training = storing data.

 - Choice of k affects bias–variance: small k → low bias, high variance; large k → higher bias, lower variance.

 - Choice of distance metric (Euclidean, Manhattan, Minkowski) and feature scaling (StandardScaler) strongly influence performance.

 - Complexity: prediction cost is O(n_features × n_train) per query unless optimized (KD-tree, Ball-tree) — but those degrade in high dims.

2. — What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:
The curse of dimensionality refers to several problems that arise when the number of features (dimensions) grows large:

 - Distances become less informative: in high dimensions, distances between points concentrate (nearest and farthest distances become similar).

 - Sparsity: data become sparse, requiring exponentially more samples to densely cover the space.

 - Increased noise and overfitting risk.

   For KNN: because KNN relies on distance, when distances lose discrimination, KNN struggles — neighbors may no longer be “meaningful,” and performance degrades. Dimensionality reduction (PCA, feature selection) or using distance metrics robust in high dims, and feature scaling, help.

3. — What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:
PCA is an unsupervised linear dimensionality reduction technique that finds orthogonal directions (principal components) of maximum variance in the data and projects data onto the top components. Steps: center data, compute covariance matrix, compute eigenvectors/eigenvalues, sort by eigenvalue, keep top components.

 - Difference from feature selection:

    - PCA produces new features (linear combinations of original features) — feature extraction.

   - Feature selection chooses a subset of original features (keeps original variables).

   - PCA reduces dimensionality by capturing variance; selection keeps interpretable original features.

4. — What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:
In PCA, eigenvectors of the covariance matrix are the principal component directions (unit vectors) — they indicate directions in feature space with maximal variance. Eigenvalues correspond to the variance explained by their eigenvectors (principal components). Sorting eigenvalues descending gives component importance. They are important because they let us:

 - Rank components by explained variance.

  - Choose how many components to keep (e.g., retain components that collectively explain 90% variance).

 - Project high-dimensional data into a lower-dimensional subspace that preserves most variance.

5. — How do KNN and PCA complement each other when applied in a single pipeline?

Answer:
PCA reduces dimensionality and noise, making distance metrics more meaningful and reducing computational cost for KNN. Pipeline: scale → PCA (reduce dims) → KNN. Benefits:

 - Mitigates curse of dimensionality.

 - Faster KNN predictions (fewer dimensions).

 - Often improves generalization by removing noisy / redundant features.

   Caveat: PCA is linear and unsupervised — it may discard features relevant for classification if those have low variance. Consider supervised dimensionality reduction if needed.

 6. Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

Answer:


In [1]:


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
data = load_wine()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn_no_scale.predict(X_test))

# With scaling
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_s, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_s))

print("Accuracy without scaling :", round(acc_no_scale,4))
print("Accuracy with scaling    :", round(acc_scaled,4))


Accuracy without scaling : 0.8056
Accuracy with scaling    : 0.9722


7. Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

Answer:

In [2]:

from sklearn.decomposition import PCA

# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"PC{i} : {ratio:.4f}")

print("\nCumulative variance explained:")
print(pca.explained_variance_ratio_.cumsum())


PC1 : 0.3620
PC2 : 0.1921
PC3 : 0.1112
PC4 : 0.0707
PC5 : 0.0656
PC6 : 0.0494
PC7 : 0.0424
PC8 : 0.0268
PC9 : 0.0222
PC10 : 0.0193
PC11 : 0.0174
PC12 : 0.0130
PC13 : 0.0080

Cumulative variance explained:
[0.36198848 0.55406338 0.66529969 0.73598999 0.80162293 0.85098116
 0.89336795 0.92017544 0.94239698 0.96169717 0.97906553 0.99204785
 1.        ]


8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

Answer:  

In [3]:


from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Scaled data reuse from above
X_train_s, X_test_s, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# KNN on full scaled data
knn_full = KNeighborsClassifier(n_neighbors=5)
knn_full.fit(X_train_s, y_train)
acc_full = accuracy_score(y_test, knn_full.predict(X_test_s))

# PCA with 2 components
pca2 = PCA(n_components=2, random_state=42)
X_pca2 = pca2.fit_transform(X_scaled)
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(
    X_pca2, y, test_size=0.2, random_state=42, stratify=y
)
knn_pca2 = KNeighborsClassifier(n_neighbors=5)
knn_pca2.fit(X_train_p, y_train_p)
acc_pca2 = accuracy_score(y_test_p, knn_pca2.predict(X_test_p))

print("Accuracy (full scaled features):", round(acc_full,4))
print("Accuracy (PCA 2 components)   :", round(acc_pca2,4))


Accuracy (full scaled features): 0.9722
Accuracy (PCA 2 components)   : 0.8889


9.  Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

Answer:

In [4]:


# Use scaled data
X_train_s, X_test_s, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_s, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_s))

# Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_s, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_s))

print("Euclidean accuracy :", round(acc_euclidean,4))
print("Manhattan accuracy :", round(acc_manhattan,4))


Euclidean accuracy : 0.9722
Manhattan accuracy : 1.0


10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep


● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

In [5]:


from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import balanced_accuracy_score, classification_report
import numpy as np

# Simulate high-dimensional gene data
X_hd, y_hd = make_classification(
    n_samples=100, n_features=5000, n_informative=50, n_redundant=50,
    n_classes=3, random_state=42
)

X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
    X_hd, y_hd, test_size=0.2, random_state=42, stratify=y_hd
)

# Pipeline: StandardScaler -> PCA -> KNN
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('knn', KNeighborsClassifier())
])

# Parameter grid
param_grid = {
    'pca__n_components': [10, 20, 50, 100],
    'knn__n_neighbors': [3, 5, 7],
    'knn__metric': ['euclidean', 'manhattan']
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(pipe, param_grid=param_grid, cv=cv,
                    scoring='balanced_accuracy', n_jobs=-1)
grid.fit(X_train_hd, y_train_hd)

print("Best parameters:", grid.best_params_)
print("Best cross-val balanced accuracy:", round(grid.best_score_,4))

# Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test_hd)

print("\nClassification Report on Test Data:\n")
print(classification_report(y_test_hd, y_pred))


30 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.12/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/py

Best parameters: {'knn__metric': 'manhattan', 'knn__n_neighbors': 3, 'pca__n_components': 20}
Best cross-val balanced accuracy: 0.4489

Classification Report on Test Data:

              precision    recall  f1-score   support

           0       0.37      1.00      0.54         7
           1       1.00      0.17      0.29         6
           2       0.00      0.00      0.00         7

    accuracy                           0.40        20
   macro avg       0.46      0.39      0.27        20
weighted avg       0.43      0.40      0.27        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
