#KNN & PCA


## **Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

**Answer:**

K-Nearest Neighbors (KNN) is a **supervised machine learning algorithm** that can be used for both classification and regression tasks.

* **How it works (Classification):**

  1. Given a test data point, KNN calculates the distance (e.g., Euclidean) between the test point and all training points.
  2. It selects the **k nearest neighbors** based on the smallest distances.
  3. The majority class among these neighbors is assigned as the predicted class.

* **How it works (Regression):**

  1. The same distance calculation is done.
  2. The average (or weighted average) of the target values of the k nearest neighbors is taken as the predicted value.

**Key Points:**

* Simple and non-parametric.
* Sensitive to feature scaling.
* Performance depends on the choice of **k** and distance metric.



## **Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

**Answer:**

* The **Curse of Dimensionality** refers to the phenomena that occur when data has **too many features (high-dimensional space)**.
* As the number of dimensions increases:

  * Distances between points become less meaningful.
  * All points tend to appear equidistant, reducing KNN accuracy.
  * More data is required to adequately cover the space.

**Impact on KNN:**

* KNN relies on distance measures. In high dimensions, the differences in distances become very small.
* This leads to poor classification or regression results unless dimensionality reduction techniques like PCA are applied.



## **Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

**Answer:**

* **PCA** is an **unsupervised dimensionality reduction technique** that transforms data into a new set of orthogonal features called **principal components**.
* Each principal component captures the **maximum variance** in the data.
* **Difference from Feature Selection:**

  * PCA creates **new features** (linear combinations of original features).
  * Feature selection chooses a **subset of original features** without transformation.



## **Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

**Answer:**

* **Eigenvectors:** Directions of maximum variance in the data (principal axes).
* **Eigenvalues:** Magnitude of variance along each eigenvector.
* **Importance:**

  * Eigenvectors define the principal components.
  * Eigenvalues tell how much variance is captured by each component, helping to decide **how many components to retain**.





## **Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

**Answer:**

* PCA reduces **dimensionality** and removes **noise/redundancy**, which improves KNN performance in high-dimensional datasets.
* KNN benefits from PCA because:

  * Reduced dimensions make distances meaningful.
  * Training/testing becomes faster.
  * Reduces overfitting risk in small datasets with many features.

In [1]:
from sklearn.datasets import load_wine
import pandas as pd

wine = load_wine()
X = wine.data
y = wine.target
df = pd.DataFrame(X, columns=wine.feature_names)
df['target'] = y
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


## **Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy.**


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred)
print("Accuracy without scaling:", acc_no_scaling)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with scaling:", acc_with_scaling)

Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


## **Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio.**


In [3]:
from sklearn.decomposition import PCA

# Scale before PCA
X_scaled = StandardScaler().fit_transform(X)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
for i, var in enumerate(explained_variance):
    print(f"PC{i+1}: {var:.4f}")

PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080



## **Question 8: Train KNN on PCA-transformed dataset (top 2 components) and compare accuracy.**

In [5]:
pca_2 = PCA(n_components=2)
X_pca2 = pca_2.fit_transform(X_scaled)

# Train-test split
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca2, y, test_size=0.2, random_state=42)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test_pca, y_pred_pca)
print("Accuracy with top 2 PCA components:", accuracy_pca)

Accuracy with top 2 PCA components: 1.0


## **Question 9: Train KNN with different distance metrics (euclidean, manhattan) on scaled data.**


In [6]:
metrics = ['euclidean', 'manhattan']
for metric in metrics:
    knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn_metric.fit(X_train_scaled, y_train)
    y_pred_metric = knn_metric.predict(X_test_scaled)
    print(f"Accuracy with {metric} distance:", accuracy_score(y_test, y_pred_metric))


Accuracy with euclidean distance: 0.9444444444444444
Accuracy with manhattan distance: 0.9444444444444444


## **Question 10: High-dimensional gene expression dataset pipeline explanation**

**Answer:**

1. **Use PCA to reduce dimensionality:**

   * Scale features and apply PCA.
   * Reduce noise and remove redundant features.

2. **Decide number of components:**

   * Keep components explaining ~95% of variance (`explained_variance_ratio_`).

3. **Use KNN post-dimensionality reduction:**

   * Train KNN on PCA-transformed data for classification.

4. **Evaluate model:**

   * Use train-test split, accuracy, confusion matrix, or cross-validation.

5. **Justify pipeline:**

   * PCA prevents overfitting in high-dimensional data.
   * KNN performs better with meaningful distances.
   * Scalable and interpretable for biomedical datasets.

In [7]:
import numpy as np
from sklearn.datasets import make_classification
X_gene, y_gene = make_classification(n_samples=100, n_features=500, n_informative=50, n_classes=3, random_state=42)

# Scale and PCA
X_gene_scaled = StandardScaler().fit_transform(X_gene)
pca_gene = PCA(n_components=0.95)  # retain 95% variance
X_gene_pca = pca_gene.fit_transform(X_gene_scaled)
print("Original shape:", X_gene.shape, "Reduced shape:", X_gene_pca.shape)

# Train KNN
X_train_gene, X_test_gene, y_train_gene, y_test_gene = train_test_split(X_gene_pca, y_gene, test_size=0.2, random_state=42)
knn_gene = KNeighborsClassifier(n_neighbors=5)
knn_gene.fit(X_train_gene, y_train_gene)
y_pred_gene = knn_gene.predict(X_test_gene)
print("Accuracy on gene dataset:", accuracy_score(y_test_gene, y_pred_gene))

Original shape: (100, 500) Reduced shape: (100, 87)
Accuracy on gene dataset: 0.25
