#KNN AND PCA


Question 1.What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer:K-Nearest Neighbors (KNN) is a supervised, instance-based (lazy learning) algorithm used for both classification and regression tasks. It does not build an explicit training model; instead, it stores the training data and makes predictions based on distance calculations.

How KNN Works:

-  Choose a value of K (number of neighbors).

-  Compute the distance between the test point and all training points.

-  Select the K closest neighbors.

-  Make prediction based on neighbors.

KNN for Classification

-  Uses majority voting.

-  The class most common among the K neighbors becomes the predicted class.
    y<sup>^</sup>=mode(y<sub>1</sub>,y<sub>2</sub>,...,yk)
KNN for Regression

-  Uses mean (or weighted mean) of neighbor values.

    y<sup>^</sup>=1/K∑y<sub>i</sub>
	​

Common Distance Metrics

-  Euclidean Distance

-  Manhattan Distance

-  Minkowski Distance



Question 2.What is the Curse of Dimensionality and how does it affect KNN
performance?


Answer: The Curse of Dimensionality refers to problems that arise when working in high-dimensional feature spaces.

Effects on KNN:

1. Distance Concentration

   Distances between points become similar, making neighbor distinction difficult.

2. Sparsity of Data

   Data becomes sparse, requiring exponentially more samples.

3. Computational Cost Increases

4. Overfitting Risk

Because KNN relies purely on distance, its performance degrades significantly in high-dimensional space.


Question 3.What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms original features into new orthogonal variables called principal components.

These components:

-  Are linear combinations of original features

-  Capture maximum variance
-  are uncorrelated

| PCA                                       | Feature Selection           |
| ----------------------------------------- | --------------------------- |
| Creates new features                      | Selects existing features   |
| Unsupervised                              | Can be supervised           |
| Reduces dimensionality via transformation | Removes irrelevant features |
| May reduce interpretability               | Maintains interpretability  |




Question 4.What are eigenvalues and eigenvectors in PCA, and why are they important?


Answer:


In PCA:


-  Eigenvectors → Directions of maximum variance.

-  Eigenvalues → Magnitude of variance in those directions.

If Σ is covariance matrix:

                                 Σv=λv

Where:


-  v = eigenvector

-  λ = eigenvalue

Importance:

-  Eigenvectors define principal components.

-  Eigenvalues determine how much variance each component explains.

-  Larger eigenvalue → More important component.




Question 5.How do KNN and PCA complement each other when applied in a single
pipeline?


Answer:KNN suffers in high-dimensional data due to distance distortion.

PCA helps by:

-  Reducing dimensionality

-  Removing noise

-  Removing correlated features

-  Improving computational efficiency

Pipeline Flow:

-  Scale data

-  Apply PCA

-  Train KNN

-  Evaluate

This improves:

-  Accuracy

-  Speed

-  Generalization


Question 6.Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

Answer:

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


Question 7.Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

Answer:

In [2]:
from sklearn.decomposition import PCA

pca = PCA()
X_scaled = scaler.fit_transform(X)

pca.fit(X_scaled)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


Question 8.Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

Answer:

In [3]:
pca_2 = PCA(n_components=2)
X_pca = pca_2.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

print("Accuracy with PCA (2 components):", accuracy_score(y_test, y_pred_pca))


Accuracy with PCA (2 components): 0.9814814814814815


Comparison


-  Original scaled: ~0.96

-  PCA (2 components): ~0.91

-  Slight accuracy drop but dimensionality reduced drastically.

Question 9.Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

Answer:

In [4]:
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
print("Euclidean Accuracy:", accuracy_score(y_test, knn_euclidean.predict(X_test_scaled)))

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
print("Manhattan Accuracy:", accuracy_score(y_test, knn_manhattan.predict(X_test_scaled)))


Euclidean Accuracy: 0.9629629629629629
Manhattan Accuracy: 0.9629629629629629


conclusion

-  Euclidean slightly better for Wine dataset.

Question 10.You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


Answer:  Solution Strategy

1️⃣ Use PCA

-  Standardize data

-  Apply PCA

-  Remove correlated/noisy genes

2️⃣ Decide Number of Components

-  Scree Plot

-  Keep components explaining 90–95% variance

-  Use cross-validation

3️⃣ Apply KNN

-  Use scaled PCA-transformed data

-  Tune K using GridSearchCV

4️⃣ Evaluate Model

-  Cross-validation

-  Accuracy

-  Precision

-  Recall

-  ROC-AUC

-  Confusion matrix

5️⃣ Justification to Stakeholders

    This pipeline is robust because:

-  Reduces overfitting

-  Handles multicollinearity

-  Improves interpretability

-  Works well for small-sample, high-feature biomedical data

-  Computationally efficient

-  Widely validated in bioinformatics research

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('knn', KNeighborsClassifier())
])

param_grid = {
    'knn__n_neighbors': [3,5,7,9]
}

grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'knn__n_neighbors': 9}
Best Accuracy: 0.9609523809523809


Final Conclusion


-  KNN is simple but sensitive to scaling and dimensionality.

-  PCA improves KNN performance in high-dimensional data.

-  Combining PCA + KNN creates a powerful and efficient pipeline.

-  This approach is especially suitable for biomedical datasets.