Assignment Code: DA-AG-016
# KNN & PCA | Assignment

**Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?**

Ans:-

**K-Nearest Neighbors (KNN)?**

   - K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks.
   - It is a non-parametric (doesn’t assume any data distribution) and instance-based (lazy learner) algorithm.

 **KNN does Work in both classification and regression problems**

  1. **Choose K **(the number of neighbors, e.g., K = 3 or 5).

  2. **Calculate distance** between the new data point and all training points (commonly Euclidean distance, but Manhattan, Minkowski, or cosine similarity can also be used).

  3. **Find the K nearest neighbors** (training points with the smallest distances).

  4. **Make prediction:**

   - **Classification** → assign the class that is most frequent among the K neighbors (majority voting).

   - **Regression** → take the average (or weighted average, where closer neighbors have more weight) of the K neighbors’ values.

**Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

Ans:-

 **The Curse of Dimensionality**

   - The curse of dimensionality refers to the various problems that arise when working with high-dimensional data (data with a large number of features).

**How it Affects KNN Performance**

**1. Distance Becomes Less Informative**

  - In high dimensions, the distance between the nearest neighbor and the farthest neighbor tends to become almost the same.

  - This makes it hard for KNN to distinguish between "close" and "far" neighbors.

 **2. Increased Computational Cost**

  - More dimensions → more distance calculations → slower performance.

 **3. Risk of Overfitting**

  - High-dimensional data is sparse, so KNN may overfit to noise instead of finding meaningful neighbors.

 **4. Feature Irrelevance**

  - If many features are irrelevant, they can dilute the impact of useful features when computing distance.

**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

Ans:-

**Principal Component Analysis (PCA)**

  - Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a smaller set of new features (called principal components) while retaining as much variance (information) as possible.

  **How PCA is Different from Feature Selection**


| Aspect               | **PCA (Dimensionality Reduction)**                                                             | **Feature Selection**                                                          |
| -------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| **Definition**       | Creates new features (principal components) that are linear combinations of original features. | Selects a subset of the original features, removing irrelevant/redundant ones. |
| **Approach**         | **Transforms** features into a new space.                                                      | **Keeps** the most important original features.                                |
| **Interpretability** | Components are harder to interpret (since they are mixes of original features).                | Easy to interpret (you know exactly which features are used).                  |
| **Goal**             | Reduce dimensionality while preserving maximum variance.                                       | Reduce dimensionality by discarding irrelevant or redundant features.          |
| **Type**             | **Feature extraction** technique.                                                              | **Feature selection** technique.                                               |


**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?**

Ans:-

**Eigenvalues and Eigenvectors (in PCA context)**

**1. Eigenvectors**

  - Directions (axes) in the feature space along which the data varies the most.

  - In PCA, these eigenvectors are the principal components (new axes).

  - They define the orientation of the new feature space.

**2. Eigenvalues**

  - Numbers that tell us how much variance (information) is captured by each eigenvector (principal component).

  - A larger eigenvalue means that direction (eigenvector) explains more variance in the data.

  **Why They Are Important in PCA**

  Imagine shining a flashlight on a 3D object to create a shadow on the wall (a 2D projection):

  - The **eigenvectors** are the directions of the flashlight beam (how we project).

  - The **eigenvalues** tell us how much of the object’s shape is preserved in each projection.

**Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**


Ans:-

** How PCA Helps KNN**

 PCA can be applied before KNN to improve performance:

 1.** Dimensionality Reduction**

 - PCA reduces the number of features while keeping most of the variance.

 - This makes distance calculations in KNN more reliable.

 **2. Noise Removal**

 - PCA ignores components with very small eigenvalues (low variance = likely noise).

 - Cleaner data → KNN neighbors become more meaningful.

**3. Faster Computation**

 - Fewer dimensions = fewer distance calculations.

 - KNN becomes more scalable.

**4. Better Generalization**

 - By removing irrelevant/weak components, PCA reduces overfitting risk in KNN.

In [None]:
# Dataset:
from sklearn.datasets.load_wine()

**Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**


In [2]:
# Ans:- KNN on Wine dataset with and without scaling

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# ---------------- Without Scaling ----------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_no_scaling = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ---------------- With Scaling ----------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("KNN Accuracy without Scaling:", acc_no_scaling)
print("KNN Accuracy with Scaling   :", acc_scaled)


KNN Accuracy without Scaling: 0.7222222222222222
KNN Accuracy with Scaling   : 0.9444444444444444


**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**


In [3]:
# Ans:- PCA on Wine dataset - Explained Variance Ratio

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Step 1: Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA (keep all components for analysis)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Step 3: Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")


Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

In [4]:
# Ans:- KNN on PCA-transformed Wine dataset vs Original dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split into train-test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ---------------- Original Dataset with Scaling ----------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

# ---------------- PCA-transformed Dataset (Top 2 PCs) ----------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# Print results
print("KNN Accuracy on Original Dataset:", acc_original)
print("KNN Accuracy on PCA (2 components):", acc_pca)


KNN Accuracy on Original Dataset: 0.9444444444444444
KNN Accuracy on PCA (2 components): 0.9444444444444444


**Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)**

In [5]:
# Ans:- KNN with different distance metrics on Wine dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ---------------- KNN with Euclidean Distance ----------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ---------------- KNN with Manhattan Distance ----------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("KNN Accuracy with Euclidean Distance:", acc_euclidean)
print("KNN Accuracy with Manhattan Distance:", acc_manhattan)


KNN Accuracy with Euclidean Distance: 0.9444444444444444
KNN Accuracy with Manhattan Distance: 0.9814814814814815


**Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.**

**Due to the large number of features and a small number of samples, traditional models overfit.**

Explain how you would:

  - Use PCA to reduce dimensionality
  -  Decide how many components to keep
  - Use KNN for classification post-dimensionality reduction
  -  Evaluate the model
  -  Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data


Ans:-

**Step 1: Use PCA to Reduce Dimensionality**

  - Gene expression datasets often have thousands of features (genes) but few samples (patients).

 - PCA helps by projecting the data into fewer principal components, capturing most of the variance while removing noise.

 **Step 2: Decide How Many Components to Keep**

 - Use the explained variance ratio (scree plot or cumulative variance).

 - Choose the smallest number of components that explains, say, 90–95% of the variance.

 - This balances information retention vs. dimensionality reduction.

 **Step 3: Apply KNN Post-PCA**

 - Run KNN on PCA-transformed data.

 - Tune K (neighbors) using cross-validation.

 - Since PCA removes noise, distances between samples are more meaningful.

 **Step 4: Evaluate the Model**

 - Use Stratified k-Fold Cross-Validation (important because dataset is small and imbalanced).

 - Metrics: Accuracy, F1-score, Confusion Matrix (since cancer types may be imbalanced).

 **Step 5: Justification to Stakeholders**

 - PCA prevents overfitting by reducing noise and redundant features.

 - KNN is simple, interpretable, and distance-based, which makes sense for gene similarity.

 - Pipeline is robust and generalizable, suitable for biomedical research where sample sizes are small.

 - Can identify patterns in gene expression without requiring complex black-box models.

**Python Code (Example with Wine dataset as proxy for gene expression)**

In [6]:
# PCA + KNN pipeline for high-dimensional biomedical data (example with Wine dataset)

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Load dataset (using Wine dataset as an example proxy)
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (retain 95% variance)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print("Original features:", X.shape[1])
print("Reduced features (PCA):", X_pca.shape[1])

# Train KNN classifier on PCA-transformed data
knn = KNeighborsClassifier(n_neighbors=5)

# Evaluate with Stratified 5-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(knn, X_pca, y, cv=cv, scoring='accuracy')

print("\nCross-validation accuracies:", scores)
print("Mean Accuracy:", np.mean(scores))


Original features: 13
Reduced features (PCA): 10

Cross-validation accuracies: [0.97222222 0.97222222 0.97222222 0.94285714 0.97142857]
Mean Accuracy: 0.9661904761904762
