# **KNN & PCA Assignment**
---


## Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

### Answer:K-Nearest Neighbors (KNN) is a simple and intuitive supervised machine learning algorithm that works on the principle of similarity. Instead of building an explicit model during training, KNN stores the entire training dataset and makes predictions at the time of testing. When a new data point is given, the algorithm calculates the distance between this point and all training data points using a distance measure such as Euclidean distance. It then selects the K closest data points, known as neighbors. In classification problems, KNN predicts the class of the new data point by taking a majority vote among the classes of these nearest neighbors, while in regression problems, it predicts a continuous value by calculating the average (or weighted average) of the target values of the nearest neighbors. Thus, KNN uses the same basic mechanism for both tasks, differing only in how the final prediction is made—voting for classification and averaging for regression.
---

## Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

### Answer:The **curse of dimensionality** refers to the problems that arise when the number of features (dimensions) in a dataset becomes very large. As dimensionality increases, the data points become more sparse in the feature space, and the distance between points starts to lose its meaning. In the context of K-Nearest Neighbors (KNN), this directly affects performance because KNN relies heavily on distance calculations to find similar data points. When there are too many dimensions, the distances between the nearest and farthest neighbors become almost the same, making it difficult for KNN to correctly identify truly “nearest” neighbors. As a result, KNN becomes less accurate, more sensitive to noise, and computationally expensive. Therefore, high dimensional data can significantly degrade KNN’s performance unless dimensionality reduction or feature selection techniques are applied.
---

## Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

### Answer: Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique used to reduce the number of features in a dataset while preserving as much variance (information) as possible. It works by transforming the original correlated features into a new set of uncorrelated variables called principal components, which are ordered such that the first few components capture the maximum variance in the data. PCA changes the original feature space by creating new features that are linear combinations of the original ones, and it does not consider the target variable while performing this transformation.

### PCA is different from feature selection because feature selection does not create new features; instead, it selects a subset of the most important original features based on certain criteria such as correlation, statistical tests, or model-based importance. While PCA focuses on reducing dimensionality by transforming features, feature selection focuses on retaining interpretability by keeping the original features intact.
---

## Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

### Answer:In Principal Component Analysis (PCA), **eigenvectors and eigenvalues** are mathematical concepts derived from the **covariance matrix** of the dataset and play a central role in dimensionality reduction. **Eigenvectors** represent the directions (principal components) along which the data varies the most, while **eigenvalues** indicate the amount of variance captured along each of those directions. A larger eigenvalue means that its corresponding eigenvector explains more information present in the data. PCA ranks eigenvectors in descending order of their eigenvalues and selects the top ones to form a lower-dimensional representation of the data. These selected eigenvectors ensure that maximum variance is retained while reducing dimensions, which is why eigenvalues and eigenvectors are crucial for identifying the most informative directions in the data.
---

## Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

### Answer:**How KNN and PCA complement each other in a single pipeline:**

* PCA reduces the number of features by transforming high-dimensional data into a lower-dimensional space.
* By reducing dimensions, PCA helps overcome the **curse of dimensionality**, which negatively affects KNN.
* KNN relies on distance calculations, and PCA makes these distances more meaningful by removing noise and redundant features.
* PCA improves computational efficiency by reducing the number of distance calculations required by KNN.
* Applying PCA before KNN often improves **accuracy and generalization** of the KNN model.
* PCA helps KNN perform better on datasets with many correlated features.
* Together, PCA (as preprocessing) and KNN (as a classifier/regressor) form an efficient and effective machine learning pipeline.
---

## Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
## Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ----------- KNN WITHOUT FEATURE SCALING -----------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

print("Accuracy without feature scaling:", accuracy_no_scaling)

# ----------- KNN WITH FEATURE SCALING -----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy with feature scaling:", accuracy_scaled)


Accuracy without feature scaling: 0.7222222222222222
Accuracy with feature scaling: 0.9444444444444444


## Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
wine = load_wine()
X = wine.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
explained_variance = pca.explained_variance_ratio_

for i, var in enumerate(explained_variance, start=1):
    print(f"Principal Component {i}: {var:.4f}")


Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


## Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ----------- KNN on ORIGINAL DATA (with scaling) -----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)

print("Accuracy on original dataset:", accuracy_original)

# ----------- KNN on PCA-TRANSFORMED DATA (Top 2 components) -----------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy on PCA-transformed dataset:", accuracy_pca)


Accuracy on original dataset: 0.9444444444444444
Accuracy on PCA-transformed dataset: 0.9444444444444444


## Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Scale the data (Crucial for distance-based KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train and Evaluate KNN with Euclidean distance (p=2)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euc = knn_euclidean.predict(X_test_scaled)
acc_euc = accuracy_score(y_test, y_pred_euc)

# 5. Train and Evaluate KNN with Manhattan distance (p=1)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_man = knn_manhattan.predict(X_test_scaled)
acc_man = accuracy_score(y_test, y_pred_man)

# Output Results
print(f"KNN Accuracy (Euclidean): {acc_euc:.4f}")
print(f"KNN Accuracy (Manhattan): {acc_man:.4f}")


KNN Accuracy (Euclidean): 0.9444
KNN Accuracy (Manhattan): 0.9444


## Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)


### Answer: 1️. Using PCA to Reduce Dimensionality

In gene expression data:

* Features (genes) ≫ Samples (patients)
* Many genes are correlated and noisy
* High dimensionality causes **overfitting** and poor generalization

**PCA (Principal Component Analysis)**:

* Transforms original features into a smaller set of **orthogonal components**
* Keeps maximum variance (biological signal)
* Removes noise and redundancy

**Steps**:

1. Standardize features (important for PCA)
2. Apply PCA
3. Project data into lower-dimensional space

## 2️. Deciding How Many Components to Keep

We use **Explained Variance Ratio**:

* Choose the minimum number of components that explain **90–95% variance**
* This balances information retention and dimensionality reduction

Methods:

* Cumulative explained variance plot
* `n_components=0.95` in PCA (automatic selection)

## 3️. Using KNN After PCA

Why **KNN + PCA** works well here:

* KNN suffers in high dimensions (curse of dimensionality)
* PCA creates compact, informative feature space
* Distance calculations become meaningful

Pipeline:

* PCA → KNN
* Use small `k` (e.g., 3–7) to avoid over-smoothing

## 4️. Model Evaluation

We evaluate using:

* **Accuracy**
* **Confusion Matrix**
* **Classification Report**
* **Cross-validation (optional but recommended)**

## 5️. Justifying This Pipeline to Stakeholders (Biomedical Context)

**Why this is robust for real-world biomedical data:**

* Reduces overfitting with small sample sizes
* Improves interpretability by capturing dominant biological patterns
* Distance-based KNN becomes reliable after PCA
* Computationally efficient and reproducible
* Commonly used in genomics and bioinformatics research

> *“This pipeline balances biological signal preservation with statistical robustness, making it suitable for high-dimensional, low-sample biomedical datasets.”*


In [5]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load dataset (proxy for gene expression data)
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# PCA (retain 95% variance)
pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Original features:", X.shape[1])
print("Reduced features after PCA:", X_train_pca.shape[1])

# KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

# Predictions
y_pred = knn.predict(X_test_pca)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Original features: 30
Reduced features after PCA: 10

Accuracy: 0.956140350877193

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

