# KNN & PCA | Assignment


1.  What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
- K-Nearest Neighbors (KNN) is a **supervised machine learning algorithm** that makes predictions by looking at the **“K closest” data points** (neighbors) to a new input.

It is called **“lazy learning”** because it **does not build a model** during training. It simply **stores the training data** and uses it during prediction time.

-  How KNN Works (Step-by-Step)

For a new data point (test point), KNN does this:

1. **Choose a value of K** (example: K = 3, 5, 7)
2. **Calculate distance** between the new point and all training points
   (common distance: **Euclidean distance**)
3. **Pick the K nearest points**
4. Use those neighbors to make the prediction:

   * **Majority vote → Classification**
   * **Average value → Regression**
- Distance Measures Used in KNN

Most common:
 1) Euclidean Distance

[
d = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}
]

Other distances:

* **Manhattan distance**
* **Minkowski distance**
* **Cosine similarity** (often used in text problems)

 KNN for Classification (Example)

### Goal: Predict a class (like Yes/No, Spam/Not Spam)

### How prediction is made:

* Take **K nearest neighbors**
* Check their labels
* The **most common class wins** (majority voting)

### Example:

If K = 5 neighbors have labels:
✅ {A, A, B, A, B}

Then prediction = **A** (because A appears 3 times)



#  KNN for Regression (Example)

### Goal: Predict a number (like price, salary, marks)

### How prediction is made:

* Take **K nearest neighbors**
* Take the **average of their values**

### Example:

If K = 3 neighbors have values:
{200, 220, 210}

Prediction:
[
\frac{200+220+210}{3} = 210
]

-Choosing the Right K Value

* **Small K (like 1 or 3):**

  * Very sensitive to noise
  * Can overfit

* **Large K (like 15 or 25):**

  * More stable
  * Can underfit

Usually, we test multiple K values using **cross-validation**.
 - Advantages of KNN

✔ Simple and easy to understand
✔ Works well for small datasets
✔ No training time (fast training)

-  Disadvantages of KNN

 Slow prediction for large datasets (because it checks all points). Sensitive to irrelevant features . Needs **feature scaling** (important!)





2.  What is the Curse of Dimensionality and how does it affect KNN
performance?
- The **Curse of Dimensionality** means:
 When the **number of features (dimensions)** becomes very large, many machine learning algorithms (especially distance-based ones like **KNN**) start performing **poorly**.



## Why is it called a “curse”?

Because in high dimensions:

### 1) **Data becomes sparse**

Even if you have thousands of rows, in a high-dimensional space the points are **far apart** and the space is mostly empty.

So KNN struggles to find “truly close” neighbors.



## How it affects KNN performance

KNN depends completely on **distance** (Euclidean/Manhattan etc.).
In high dimensions:

### 1) **Distances become less meaningful**

The distance between the nearest and farthest neighbors becomes almost the same.

So KNN can’t clearly decide which points are “nearest”.

 Example idea:

* In 2D, nearest points are clearly close.
* In 100D, almost all points look similarly far.



### 2) **More noise features reduce accuracy**

If many features are irrelevant, KNN includes them in distance calculation, making wrong neighbors appear “close”.

This reduces classification/regression quality.


### 3) **Prediction becomes slower**

KNN must compute distance from the test point to **all training points**.
More features = more calculations = slower prediction.



### 4) **Needs much more data**

To cover the space properly, high dimensions need **huge data**.
Otherwise KNN overfits or becomes unstable.


## Result on KNN

**Accuracy decreases**
 **Prediction time increases**
 **Neighbors become unreliable**


##  How to fix / reduce the problem

✔ **Feature scaling** (must for KNN)
✔ **Feature selection** (remove useless features)
✔ **Dimensionality reduction** (PCA, t-SNE for visualization)
✔ **Use smaller number of important features**
✔ Try other models (Decision Trees, Random Forest, etc.)



3. What is Principal Component Analysis (PCA)? How is it different from
feature selection?
- ## What is PCA (Principal Component Analysis)?

**PCA** is a **dimensionality reduction technique** that converts your original features into a **new set of features** called **principal components**.

These principal components:

* are **combinations of original features**
* are **uncorrelated (independent)**
* capture the **maximum variance (information)** in the data

 Goal: **Reduce features but keep most of the important information.**


## How PCA works

PCA finds new directions (axes) in the data such that:

1. **PC1 (1st Principal Component)** captures the **most variance**
2. **PC2** captures the **2nd most variance** (and is perpendicular to PC1)
3. And so on…

Then we keep only the **top components** (like 2 or 3 instead of 20 features).

  Example

Suppose you have **10 features**.

After PCA, you may reduce them to **3 principal components** while still keeping **95% of the data information**.

So instead of:
 (X1, X2, X3, … X10)

You use:
 (PC1, PC2, PC3)



#  PCA vs Feature Selection (Main Difference)

| Feature Selection                            | PCA                                             |
| -------------------------------------------- | ----------------------------------------------- |
| Selects the **best original features**       | Creates **new features (principal components)** |
| Keeps features like X1, X4, X7               | Makes new features like PC1 = 0.5X1 + 0.3X2 + … |
| More **interpretable**                       | Less interpretable (hard to explain)            |
| Can improve model performance                | Reduces noise + helps with multicollinearity    |
| Works well when some features are irrelevant | Works well when features are correlated         |



##  Key Point

###  Feature Selection:

✔ Removes unnecessary features
✔ Keeps real features
✔ Easy to explain

###  PCA:

✔ Compresses features into fewer components
✔ Useful when features are correlated
✔ Helps reduce curse of dimensionality



## When to use PCA?

Use PCA when:

* You have **many features**
* Features are **highly correlated**
* You want faster models and less overfitting
* You want to reduce dimensionality for algorithms like **KNN, SVM**



4.  What are eigenvalues and eigenvectors in PCA, and why are they
important?
- In PCA, **eigenvectors and eigenvalues** come from the **covariance matrix** (or correlation matrix) of the dataset, and they tell PCA:

 **Which directions to project the data on**
 **How much information (variance) each direction contains**



##  Eigenvectors in PCA (What they mean)

**Eigenvectors** represent the **principal components (new axes/directions)**.

 Think of an eigenvector as a direction in the feature space where the data varies the most.

So in PCA:

* **Eigenvector 1 → PC1 direction**
* **Eigenvector 2 → PC2 direction**
* etc.

 PCA projects the data onto these eigenvectors.


##  Eigenvalues in PCA (What they mean)

**Eigenvalues** tell **how much variance (information)** is captured along the corresponding eigenvector.

 Bigger eigenvalue = more important principal component.

Example:

* Eigenvalue for PC1 = 5.2 (high variance)
* Eigenvalue for PC2 = 1.1
* Eigenvalue for PC3 = 0.2 (very low variance)

So PC1 is the most useful.


##  Why are eigenvalues & eigenvectors important in PCA?

### 1) They decide the **new feature directions**

Eigenvectors define the **principal components** (PCs).

### 2) They decide which components to keep

Eigenvalues help you choose how many PCs to keep.

Common method:
 Keep the components with **highest eigenvalues**


##  Explained Variance Ratio

PCA uses eigenvalues to calculate:

[
\text{Explained Variance Ratio} = \frac{\lambda_i}{\sum \lambda}
]

Where:

* ( \lambda_i ) = eigenvalue of component i

This tells what % of information each PC contains.




5. How do KNN and PCA complement each other when applied in a single
pipeline?
- KNN and PCA work really well together because **PCA makes the data easier for KNN to handle**.

###  Why they complement each other

KNN is a **distance-based algorithm**, so its performance depends heavily on:

* meaningful distances
* fewer noisy/irrelevant features
* faster distance calculations

PCA helps with exactly these points.


##  How PCA helps KNN in one pipeline

### 1) Reduces the Curse of Dimensionality

In high dimensions, distances become less reliable, so KNN struggles.

 PCA reduces dimensions → distances become more meaningful → KNN becomes more accurate.


### 2) Removes noise and redundant features

If features are highly correlated or contain noise, KNN can pick wrong neighbors.

 PCA combines correlated features into fewer strong components → cleaner input for KNN.



### 3) Makes KNN faster

KNN calculates distance from each test point to all training points.

 Fewer dimensions = fewer computations = faster prediction.


### 4) Can reduce overfitting

Too many features can make KNN sensitive to small variations.

PCA keeps only important variance → smoother decision boundaries.


##  Typical PCA + KNN Pipeline

**Correct order:**

1. **Scale the data** (important!)
2. Apply **PCA**
3. Apply **KNN**

Example pipeline:
`StandardScaler → PCA → KNN`



6. Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

In [1]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 1) KNN WITHOUT feature scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
acc_no_scale = knn_no_scale.score(X_test, y_test)

# 2) KNN WITH feature scaling (StandardScaler)
knn_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
knn_scaled.fit(X_train, y_train)
acc_scaled = knn_scaled.score(X_test, y_test)

print("Accuracy WITHOUT scaling:", acc_no_scale)
print("Accuracy WITH scaling   :", acc_scaled)


Accuracy WITHOUT scaling: 0.8055555555555556
Accuracy WITH scaling   : 0.9722222222222222


7. : Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [2]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Scale features before PCA
X_scaled = StandardScaler().fit_transform(X)

# Train PCA model
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
evr = pca.explained_variance_ratio_

print("Explained Variance Ratio of each Principal Component:\n")
for i, ratio in enumerate(evr, start=1):
    print(f"PC{i}: {ratio:.6f}")


Explained Variance Ratio of each Principal Component:

PC1: 0.361988
PC2: 0.192075
PC3: 0.111236
PC4: 0.070690
PC5: 0.065633
PC6: 0.049358
PC7: 0.042387
PC8: 0.026807
PC9: 0.022222
PC10: 0.019300
PC11: 0.017368
PC12: 0.012982
PC13: 0.007952


8.  Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [3]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 1) KNN on ORIGINAL dataset (with scaling)
knn_original = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
knn_original.fit(X_train, y_train)
acc_original = knn_original.score(X_test, y_test)

# 2) KNN on PCA-transformed dataset (Top 2 components)
knn_pca2 = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2)),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
knn_pca2.fit(X_train, y_train)
acc_pca2 = knn_pca2.score(X_test, y_test)

print("Accuracy on ORIGINAL dataset (scaled):", acc_original)
print("Accuracy on PCA dataset (Top 2 PCs)   :", acc_pca2)


Accuracy on ORIGINAL dataset (scaled): 0.9722222222222222
Accuracy on PCA dataset (Top 2 PCs)   : 0.9166666666666666


9. Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [5]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data (KNN is sensitive to feature scales)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare different distance metrics
metrics = ['euclidean', 'manhattan']

print("KNN Classification Results (k=5):")
print("-" * 35)

for metric in metrics:
    # Initialize and train the KNN classifier
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)

    # Predict and calculate accuracy
    y_pred = knn.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)

    print(f"Metric: {metric:10} | Accuracy: {accuracy:.4f}")


KNN Classification Results (k=5):
-----------------------------------
Metric: euclidean  | Accuracy: 0.9444
Metric: manhattan  | Accuracy: 0.9444


10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data



Answer -  In high-dimensional biomedical contexts like gene expression analysis, where features (thousands of genes) vastly outnumber samples (patients), a pipeline combining Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN) is a standard robust solution to combat the "curse of dimensionality" and prevent overfitting.
1. PCA for Dimensionality Reduction
PCA transforms the high-dimensional gene space into a set of uncorrelated linear combinations called Principal Components (PCs). In biomedical data, this filters out technical noise and redundant gene correlations, concentrating the most significant biological variations into the first few components.
2. Deciding Components to Keep Determining the number of components (\(k\)) is critical for balancing information retention against noise reduction: Cumulative Explained Variance: A common threshold is to retain enough components to explain 95% of the total variance.Scree Plot (Elbow Method): Plot eigenvalues in descending order and identify the "elbow" point where adding more components yields diminishing returns in variance captured.Kaiser Criterion: Retain only components with an eigenvalue greater than 1.0, as these explain more variance than any single original feature.
3. KNN Classification Post-Reduction
Once reduced, the data is projectable into a lower-dimensional space (e.g., from 10,000 genes to 50 PCs). KNN then classifies new patients by measuring the Euclidean distance to their \(K\) nearest neighbors in this "cleaner" PC space. This avoids the distance-metric breakdown that occurs in high-dimensional spaces.
4. Evaluation Strategy Cross-Validation: Use k-fold cross-validation (e.g., 5-fold or 10-fold) to ensure the model generalizes across small sample sizes.Clinical Metrics: Beyond accuracy, prioritize Sensitivity (identifying true cancer cases) and Specificity (avoiding false alarms), which are vital in biomedical decisions.Confusion Matrix: Use this to visualize specific misclassifications between cancer subtypes.
5. Stakeholder Justification
Robustness against Noise: PCA removes "noisy" genes that don't vary across patients, ensuring the model focuses only on biologically relevant signals.
Prevention of Overfitting: By reducing the feature count, we avoid the model "memorizing" specific training samples, making it more reliable for future patients.
Computational Efficiency: PCA-KNN runs significantly faster than deep learning or full-feature models, which is crucial for clinical deployment.

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# 1. Load and Scale Data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. PCA - Decide components (Explain 95% variance)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"Original features: {X.shape[1]}")
print(f"Reduced features (PCs): {pca.n_components_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.4f}")

# 3. Train KNN on PCs
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

# 4. Evaluate
cv_scores = cross_val_score(knn, X_train_pca, y_train, cv=5)
test_accuracy = knn.score(X_test_pca, y_test)

print("-" * 30)
print(f"Cross-Validation Mean Accuracy: {cv_scores.mean():.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

Original features: 30
Reduced features (PCs): 10
Total variance explained: 0.9511
------------------------------
Cross-Validation Mean Accuracy: 0.9538
Test Accuracy: 0.9561
