Question 1:  What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?


Answer:  K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based (lazy learning) algorithm used for both classification and regression tasks. It makes predictions based on the similarity between a new data point and the existing labeled data.

How KNN Works (Common Steps)

Choose K – the number of nearest neighbors to consider.

Compute distance between the new data point and all training points
(commonly Euclidean distance, but Manhattan, Minkowski, or cosine distance may also be used).

Select the K closest data points.

Aggregate their outputs to make a prediction.

KNN does not build an explicit model during training; all computation happens at prediction time.

KNN for Classification

In classification, KNN assigns a class label to a new data point based on the majority class among its K nearest neighbors.

Example

If K = 5 and the nearest neighbors have labels:

Class A → 3 points

Class B → 2 points

The new data point is classified as Class A.

Key Points

Output: Discrete class

Decision rule: Majority voting

Can use distance-weighted voting, where closer neighbors have more influence.

KNN for Regression

In regression, KNN predicts a continuous value by averaging (or weighting) the values of the K nearest neighbors.

Example

If K = 4 and the target values of neighbors are:

10, 12, 14, 16

Predicted value =

10+12+14+16
	​

=13
Key Points

Output: Continuous value

Decision rule: Mean or weighted mean

Distance-weighted averages often improve accuracy.

Choice of K

Small K → Low bias, high variance (sensitive to noise)

Large K → High bias, low variance (smoother predictions)

Optimal K is usually chosen using cross-validation.

Advantages of KNN

Simple and intuitive

No training phase

Works well with small datasets

Can handle multi-class classification naturally

Limitations of KNN

Computationally expensive at prediction time

Sensitive to feature scaling (normalization is essential)

Performance degrades with large datasets

Sensitive to irrelevant features and noise

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer: Curse of Dimensionality

The Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data (i.e., data with a large number of features). As the number of dimensions increases, the feature space grows exponentially, causing data points to become increasingly sparse. This sparsity makes it difficult for distance-based algorithms like K-Nearest Neighbors (KNN) to work effectively.

How It Affects KNN Performance

KNN relies entirely on distance calculations to identify “nearest” neighbors. In high-dimensional spaces, this assumption breaks down in several ways:

Distance Concentration

In high dimensions, the distance between the nearest and farthest points becomes almost the same.

As a result, nearest neighbors are no longer meaningfully close.

This reduces KNN’s ability to distinguish between relevant and irrelevant neighbors.

Reduced Similarity Meaning

With many features, points tend to appear equally distant from each other.

Distance metrics (like Euclidean distance) lose their discriminative power.

KNN predictions become closer to random guessing.

Increased Data Requirement

To maintain meaningful neighborhood relationships, the dataset size must grow exponentially with the number of dimensions.

In practice, this is often infeasible, leading to poor generalization.

Higher Computational Cost

More dimensions mean more distance calculations.

This increases prediction time and memory usage, making KNN inefficient for large, high-dimensional datasets.

Noise and Irrelevant Features

High-dimensional data often contains irrelevant or noisy features.

These features distort distance calculations, causing incorrect neighbor selection.

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:  Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique used to transform a high-dimensional dataset into a lower-dimensional space while preserving as much variance (information) as possible.

Instead of selecting existing features, PCA creates new features, called principal components, which are linear combinations of the original variables.

How PCA Works (Conceptually)

Standardize the data (mean = 0, variance = 1).

Compute the covariance matrix to understand feature relationships.

Calculate eigenvectors and eigenvalues:

Eigenvectors → directions of maximum variance (principal components)

Eigenvalues → amount of variance captured by each component

Sort components by descending eigenvalues.

Select top K components that explain most of the variance.

Project data onto these components to obtain reduced dimensions.

Key Characteristics of PCA

Unsupervised (does not use target variable)

Reduces dimensionality while minimizing information loss

Removes multicollinearity

Produces orthogonal (uncorrelated) components

Components are often not directly interpretable

Feature Selection

Feature selection is the process of choosing a subset of the original features without transforming them.

The goal is to retain the most relevant features and remove redundant or irrelevant ones.

Types of Feature Selection

Filter methods (correlation, chi-square, ANOVA)

Wrapper methods (forward selection, backward elimination)

Embedded methods (Lasso, decision tree feature importance)

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer: Eigenvalues and Eigenvectors in PCA

In Principal Component Analysis (PCA), eigenvalues and eigenvectors are the mathematical foundation that determine how the data is transformed and how much information is retained after dimensionality reduction.

Eigenvectors in PCA

An eigenvector represents a direction in the feature space along which the data varies the most.

Role in PCA

Each eigenvector corresponds to a principal component.

It defines the new axis onto which the original data is projected.

Eigenvectors are orthogonal (perpendicular) to each other, ensuring no redundancy.

They are linear combinations of original features.

Intuition

Think of eigenvectors as the directions of maximum spread in the data cloud.

Eigenvalues in PCA

An eigenvalue indicates the amount of variance captured along its corresponding eigenvector.

Role in PCA

Larger eigenvalue → more information (variance) captured.

Smaller eigenvalue → less useful component.

Eigenvalues help rank principal components.

Intuition

Eigenvalues tell us how important each direction (eigenvector) is.

Mathematical Context (Simplified)

PCA computes eigenvectors and eigenvalues of the covariance matrix (or correlation matrix).

If Σv = λv:

v = eigenvector (principal component direction)

λ = eigenvalue (variance along that direction)

Why Eigenvalues and Eigenvectors Are Important in PCA

Define New Feature Space

Eigenvectors determine the axes of the reduced-dimensional space.

Dimensionality Reduction

Eigenvalues help decide how many principal components to keep.

Variance Preservation

Selecting components with the largest eigenvalues preserves maximum information.

Noise Reduction

Components with small eigenvalues often represent noise and can be discarded.

Explained Variance Ratio

Eigenvalues are used to compute the percentage of variance explained by each component.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?


Answer:  How KNN and PCA Complement Each Other in a Single Pipeline

KNN and PCA are often combined in a single machine learning pipeline because PCA directly addresses the key weaknesses of KNN. Together, they improve accuracy, efficiency, and robustness, especially for high-dimensional data.

Core Idea

PCA reduces dimensionality and removes redundancy by projecting data into a lower-dimensional, uncorrelated feature space.

KNN then performs distance-based prediction more effectively in this reduced space.

In short: PCA prepares the feature space; KNN performs the prediction.

Why PCA Improves KNN Performance
1. Mitigates the Curse of Dimensionality

KNN relies on distance metrics, which degrade in high dimensions.

PCA reduces the number of features while preserving variance.

Distances become more meaningful, improving neighbor selection.

2. Removes Multicollinearity

Highly correlated features distort distance calculations.

PCA transforms correlated features into orthogonal components.

KNN benefits from unbiased distance computation.

3. Improves Computational Efficiency

KNN has no training phase but is expensive at prediction time.

Fewer dimensions → faster distance calculations → lower memory usage.

4. Reduces Noise

Components with small eigenvalues often represent noise.

Removing them leads to cleaner neighborhoods and better predictions.

Typical KNN + PCA Pipeline

Standardize features

Apply PCA (retain components explaining, e.g., 95% variance)

Fit KNN on the transformed data

Predict using reduced-dimension distances

Practical Example
Without PCA

100 features

Sparse data

Nearest neighbors poorly defined

High prediction latency

With PCA

Reduced to 15 principal components

Compact, dense feature space

More reliable neighbors

Faster and more accurate predictions

When This Combination Is Most Effective

High-dimensional datasets

Correlated or noisy features

Distance-based models (KNN, K-Means)

Image, text embeddings, gene expression data

Dataset: Use the Wine Dataset from sklearn.datasets.load_wine().

Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases. (Include your Python code and output in the code box below.)

Answer:  KNN is a distance-based algorithm.
The Wine dataset contains features with very different scales (e.g., alcohol vs proline).
Without scaling, features with larger numeric ranges dominate distance calculations, leading to poor performance.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# KNN without scaling
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# KNN with scaling (pipeline)
knn_scaled_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])
knn_scaled_pipeline.fit(X_train, y_train)
y_pred_scaled = knn_scaled_pipeline.predict(X_test)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print(f"Accuracy without scaling: {acc_no_scaling:.4f}")
print(f"Accuracy with scaling:    {acc_scaled:.4f}")


Accuracy without scaling: 0.8056
Accuracy with scaling:    0.9722


In [2]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component. (Include your Python code)

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
X, y = load_wine(return_X_y=True)

# Standardize features (required for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA (keep all components)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"Principal Component {i}: {ratio:.4f}")


Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


In [3]:
#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset. (Include your Python code)

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# -------------------------------
# KNN on ORIGINAL (SCALED) DATA
# -------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

acc_original = accuracy_score(y_test, y_pred_original)

# -------------------------------
# PCA (Top 2 Components)
# -------------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

acc_pca = accuracy_score(y_test, y_pred_pca)

# -------------------------------
# Results
# -------------------------------
print(f"Accuracy with original features: {acc_original:.4f}")
print(f"Accuracy with PCA (2 components): {acc_pca:.4f}")


Accuracy with original features: 0.9722
Accuracy with PCA (2 components): 0.9167


In [4]:
# Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results. (Include your Python code )

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -------------------------------
# KNN with Euclidean distance
# -------------------------------
knn_euclidean = KNeighborsClassifier(
    n_neighbors=5,
    metric='euclidean'
)
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -------------------------------
# KNN with Manhattan distance
# -------------------------------
knn_manhattan = KNeighborsClassifier(
    n_neighbors=5,
    metric='manhattan'
)
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# -------------------------------
# Results
# -------------------------------
print(f"Accuracy with Euclidean distance: {acc_euclidean:.4f}")
print(f"Accuracy with Manhattan distance: {acc_manhattan:.4f}")




Accuracy with Euclidean distance: 0.9722
Accuracy with Manhattan distance: 1.0000


Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit. Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data (Include your Python code

Answer:  Gene expression datasets typically have thousands of genes (features) but very few patient samples. This setting is highly prone to overfitting, poor generalization, and unstable distance calculations.

A PCA → KNN pipeline directly addresses these challenges.

1. Using PCA to Reduce Dimensionality
Why PCA is appropriate

Gene expression features are highly correlated

PCA:

Projects data into orthogonal components

Preserves maximum biological variance

Removes noise and redundancy

Mitigates the curse of dimensionality

How it is applied

Standardize gene expression values

Apply PCA on the training set only

Transform both train and test data using learned components

2. Deciding How Many Components to Keep

We retain components based on explained variance, not an arbitrary number.

Common biomedical practice

Keep components explaining 90–95% cumulative variance

This balances:

Information retention

Noise reduction

Model stability

Decision rule

Choose the smallest number of components such that cumulative explained variance ≥ 95%

This avoids overfitting while preserving disease-related signal.

3. Using KNN After PCA
Why KNN?

Non-parametric (no assumptions about gene distributions)

Works well in low-dimensional, denoised spaces

Naturally captures patient similarity patterns

Why after PCA?

PCA ensures:

Meaningful distance calculations

Reduced computational cost

Improved neighborhood quality

KNN is trained on PCA-transformed features, not raw gene counts.

4. Model Evaluation Strategy

Because biomedical datasets are small, evaluation must be rigorous.

Metrics used

Accuracy (overall performance)

Confusion matrix (class-wise errors)

Cross-validation (stability across splits)

Validation principles

PCA fitted only on training folds (prevents data leakage)

Stratified splits to preserve cancer class balance

5. Justifying This Pipeline to Stakeholders
Why this is robust for real-world biomedical data

Scientific validity

PCA captures dominant biological variation

Reduces noise from irrelevant or low-expression genes

Statistical reliability

Prevents overfitting in small-sample, high-feature settings

Improves generalization to unseen patients

Operational feasibility

Faster inference

Lower memory footprint

Easy to retrain when new samples arrive

Clinical interpretability

Variance-based dimensionality reduction is widely accepted

Model decisions are based on patient similarity in latent biological space

This pipeline aligns with best practices in bioinformatics and translational medicine.

In [5]:
from sklearn.datasets import load_wine  # placeholder for gene expression data
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix

# Load example dataset (replace with gene expression matrix)
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# PCA + KNN pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),  # retain 95% variance
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean'))
])

# Train model
pipeline.fit(X_train, y_train)

# Predictions
y_pred = pipeline.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5)

print(f"Test Accuracy: {accuracy:.4f}")
print("Confusion Matrix:\n", conf_matrix)
print(f"Cross-Validation Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")


Test Accuracy: 0.9722
Confusion Matrix:
 [[12  0  0]
 [ 0 13  1]
 [ 0  0 10]]
Cross-Validation Accuracy: 0.9495 ± 0.0329
