Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Answer:

Definition:
K-Nearest Neighbors (KNN) is a non-parametric, instance-based machine learning algorithm that makes predictions based on the similarity (distance) between data points.

Working Principle:

Given a new input, the algorithm finds the k closest training samples using a distance metric (e.g., Euclidean, Manhattan).

The prediction is based on these nearest neighbors.

In Classification:

Each of the k nearest neighbors votes for a class label.

The majority class is assigned to the new input.

Example: If k=5 and 3 neighbors belong to class A, 2 to class B → Prediction = Class A.

In Regression:

The prediction is the average (or weighted average) of the values of the k nearest neighbors.

Example: If neighbors have values [10, 12, 14], prediction = 12.

Advantages:

Simple and easy to implement.

Works well with smaller datasets.

Disadvantages:

Computationally expensive for large datasets.

Sensitive to irrelevant/noisy features.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:

Definition:
The Curse of Dimensionality refers to the phenomenon where the performance of algorithms deteriorates as the number of features (dimensions) increases.

Why it happens:

In high dimensions, data points become sparse.

Distance between points becomes less meaningful because all points appear equally far.

Effect on KNN:

KNN relies heavily on distance measures.

In high dimensions, distances become less discriminative, leading to poor classification/regression accuracy.

More computational cost for finding neighbors.

Solutions:

Apply dimensionality reduction (e.g., PCA).

Use feature selection to remove irrelevant features.

Scale/normalize data.

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:

Definition of PCA:

PCA is a statistical technique for dimensionality reduction.

It transforms original correlated features into a smaller set of uncorrelated principal components.

Each component captures maximum variance in the data.

Steps of PCA:

Standardize data.

Compute covariance matrix.

Calculate eigenvalues & eigenvectors.

Select top components (based on variance).

Project data onto new components.

PCA vs Feature Selection:

Feature Selection: Selects a subset of the original features (no transformation).

PCA: Creates new features (linear combinations of originals).

Example:

Feature selection: Choose “Age” and “Income” from a dataset.

PCA: Combine Age + Income into a new principal component.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:

Eigenvalues:

Represent the amount of variance captured by each principal component.

Larger eigenvalue → More variance explained.

Eigenvectors:

Define the direction of the new feature axes (principal components).

Each eigenvector corresponds to a principal component.

Importance in PCA:

Eigenvalues: Help decide how many components to retain.

Eigenvectors: Provide the new feature space for projecting data.

Example:
If the first eigenvalue explains 70% variance → keep that component as it carries most information.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Answer:

Problem:

KNN suffers from Curse of Dimensionality in high-dimensional datasets.

Solution with PCA:

PCA reduces dimensions, removes noise, and retains maximum variance.

This makes distance calculations in KNN more meaningful.

Pipeline Flow:

Step 1: Apply PCA → reduce features.

Step 2: Train KNN on reduced dataset.

Advantages:

Faster computation (less features).

Better accuracy.

Less overfitting.

Example:

High-dimensional gene dataset → Apply PCA → Use KNN for classification

Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

Answer

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without Scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_without = accuracy_score(y_test, y_pred)

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
acc_with = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", acc_without)
print("Accuracy with scaling:", acc_with)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

In [2]:
from sklearn.decomposition import PCA

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


Explained Variance Ratio: [0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

In [3]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train_pca)
y_pred_pca = knn.predict(X_test_pca)
acc_pca = accuracy_score(y_test_pca, y_pred_pca)

print("Accuracy with original scaled data:", acc_with)
print("Accuracy with PCA (2 components):", acc_pca)


Accuracy with original scaled data: 0.9629629629629629
Accuracy with PCA (2 components): 0.9814814814814815


Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

In [4]:
# Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))

# Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)


Euclidean Accuracy: 0.9629629629629629
Manhattan Accuracy: 0.9629629629629629


Question 10: Gene Expression Dataset (High-dimensional data & small samples)

Answer:

Use PCA to reduce dimensionality:

Gene expression data may have thousands of features but only a few samples.

PCA reduces features into a smaller set of components capturing max variance.

Decide number of components:

Plot cumulative explained variance.

Select components covering 90–95% variance.

Use KNN after PCA:

Apply KNN on reduced dataset.

Distance metric now works better because data is denser in low dimensions.

Evaluate model:

Use cross-validation for reliable results.

Evaluate with accuracy, precision, recall, F1-score.

Justification to stakeholders:

Reduces overfitting risk.

Handles high-dimensional biomedical data efficiently.

Improves interpretability & computational efficiency.

A robust real-world pipeline:

In [None]:
Raw Data → PCA → KNN → Evaluation
