Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
-- K-Nearest Neighbors (KNN) is a simple, non-parametric, supervised learning algorithm used for both classification and regression problems.

How KNN Works (Basic Idea):

KNN makes predictions based on the similarity (distance) between the input data point and its K nearest neighbors in the training dataset.

"K" is the number of nearest neighbors considered.

The similarity is usually measured by Euclidean distance, though other metrics like Manhattan or Minkowski distances can also be used.

KNN for Classification:

Input: A new, unlabeled data point.

Process:

Calculate the distance between the new point and all training points.

Identify the K closest neighbors.

Count the most frequent class among those neighbors.

Output: Assign the majority class to the new data point.

Example:
If
𝐾
=
5
K=5 and among the 5 nearest neighbors:

3 are class A

2 are class B
Then the prediction is Class A.

KNN for Regression:

Input: A new data point without a target value.

Process:

Calculate the distance to all training points.

Find the K nearest neighbors.

Take the average (or weighted average) of their target values.

Output: Return the mean (or weighted mean) as the predicted value.

Example:
If the nearest 3 neighbors have values [5.2, 6.1, 5.7], the predicted value would be:

5.2
+
6.1
+
5.7
3
=
5.67
3
5.2+6.1+5.7
	​

=5.67

ey Considerations:

Choice of K:

Too small → model becomes sensitive to noise (overfitting).

Too large → model may lose important local patterns (underfitting).

Feature scaling: KNN is distance-based, so it is sensitive to the scale of features. Normalization or standardization is usually needed.

Lazy learning: KNN does no learning during training; it simply stores the training data and does all computation during prediction.

Computational cost: Prediction can be slow on large datasets because it must compute distances to all training points.

| Feature       | KNN for Classification       | KNN for Regression         |
| ------------- | ---------------------------- | -------------------------- |
| Output        | Most common class (majority) | Average of neighbor values |
| Decision Rule | Voting                       | Averaging                  |
| Use Case      | Label prediction             | Value prediction           |


Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?
-- What is the Curse of Dimensionality?

The Curse of Dimensionality refers to various problems that arise when analyzing and organizing data in high-dimensional spaces (i.e., when the number of features or dimensions is very large).

As dimensions increase:

Data becomes sparse.

Distance metrics (like Euclidean distance) become less meaningful.

The volume of the space increases exponentially, so data points become more spread out.

All points tend to look equally far apart, making it hard to distinguish neighbors.

 How It Affects KNN Performance:

Since KNN relies on distance calculations, the curse of dimensionality causes major issues for its accuracy and efficiency in high-dimensional data:

 1. Distance Becomes Less Informative

In high dimensions, the difference between the nearest and farthest neighbors shrinks.

This reduces the contrast between close and far points.

As a result, KNN may struggle to find meaningful "nearest" neighbors.

 Example: In 100 dimensions, the distance between all points tends to converge, making it difficult for KNN to distinguish which neighbors are truly close.

 2. Increased Sparsity

The dataset becomes sparser as dimensionality increases, even if the number of samples remains the same.

Sparse data makes it difficult to form reliable neighborhoods for prediction.

 3. Higher Computational Cost

Distance computation becomes expensive in high dimensions.

Since KNN requires computing distance to all training points, prediction time increases dramatically.

 4. Risk of Overfitting

High-dimensional data often contains irrelevant or noisy features.

KNN treats all features equally unless feature selection or weighting is applied.

This can cause KNN to overfit on noise, leading to poor generalization.

Curse of Dimensionality: High-dimensional data weakens the effectiveness of distance-based methods like KNN.

It causes distance distortion, data sparsity, computational inefficiency, and overfitting.

To improve KNN performance in high dimensions, reduce dimensionality and focus on relevant features.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?
--  What is Principal Component Analysis (PCA)?

PCA is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much variance (information) as possible.

It does this by transforming the original features into a new set of uncorrelated features called principal components.

 How PCA Works:

Standardize the data (mean = 0, variance = 1).

Compute the covariance matrix of the features.

Compute eigenvalues and eigenvectors of the covariance matrix.

Select the top K eigenvectors (those with the largest eigenvalues).

Project the data onto these top eigenvectors (principal components).

Each principal component is a linear combination of the original features, and they are ordered by the amount of variance they capture.

 Example:

Suppose you have 3 features: height, weight, age.

PCA might create:

PC1: 0.6×height + 0.7×weight + 0.2×age

PC2: -0.5×height + 0.1×weight + 0.9×age

You can keep just the top 1 or 2 principal components instead of all 3 original features.

 PCA Is Used To:

Reduce dimensionality

Remove multicollinearity

Speed up training

Visualize high-dimensional data (e.g., 2D plot of PC1 vs PC2)

| Aspect               | PCA (Feature Extraction)                    | Feature Selection                     |
| -------------------- | ------------------------------------------- | ------------------------------------- |
| **What it does**     | Creates new features (principal components) | Chooses a subset of existing features |
| **Output Features**  | Transformed (linear combinations)           | Original, unmodified features         |
| **Purpose**          | Maximize variance in fewer dimensions       | Keep only relevant features           |
| **Interpretability** | Low (components are abstract combinations)  | High (original features are retained) |
| **Example**          | PC1 = 0.6×X1 + 0.8×X2                       | Keep only X1 and X3, drop X2          |


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?
-- In PCA, eigenvalues and eigenvectors come from linear algebra, specifically from the covariance matrix of the data.

They are core components used to:

Identify principal components (directions of maximum variance).

Determine how much variance each principal component captures.

 What is an eigenvector?

An eigenvector is a direction (a vector) along which a linear transformation acts by stretching or compressing.

In the context of PCA:

An eigenvector represents a principal component direction — a new axis in the feature space.

It defines the orientation of the principal component.

 What is an eigenvalue?

An eigenvalue is a scalar that tells how much variance is along its corresponding eigenvector.

In PCA:

Larger eigenvalue → more variance captured by that component.

Eigenvalues help you decide how many components to keep.

 PCA Step Involving Eigenvalues & Eigenvectors:

Compute the covariance matrix of the standardized data.

Compute eigenvectors and eigenvalues of this matrix.

Sort eigenvectors by their eigenvalues in descending order.

Select the top k eigenvectors → these form the principal components.

Project the data onto these eigenvectors

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?
-- PCA and KNN work well together because PCA helps address key weaknesses of KNN — especially in high-dimensional spaces — making KNN more accurate, faster, and less prone to overfitting.

| PCA's Role in Pipeline                | Benefit for KNN                                                                                                                                  |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|  **Reduces Dimensionality**         | KNN struggles in high dimensions due to the **curse of dimensionality**. PCA reduces the number of features while preserving important patterns. |
|  **Removes Noise / Redundancy**     | KNN treats all features equally. PCA filters out **noisy or irrelevant combinations**, helping KNN focus on more meaningful distances.           |
|  **Improves Speed**                  | KNN is computationally expensive during prediction. Reducing dimensionality via PCA reduces the **number of distance calculations per point**.   |
|  **Transforms Correlated Features** | KNN doesn’t handle multicollinearity well. PCA creates **uncorrelated components**, making the distance metric more reliable.                    |
|  **Better Visualization**           | You can visualize the data (e.g., in 2D or 3D after PCA), which helps interpret KNN’s behavior or spot misclassifications.                       |

Real-World Scenario:

Suppose you have a dataset with 100 features.

Many of them are noisy, redundant, or correlated.

PCA reduces it to, say, 10 informative components.

KNN then operates in a cleaner, lower-dimensional space, improving performance.

 Caution:

PCA is unsupervised — it does not consider class labels when reducing dimensions.

So, it's possible that some important discriminative information may be lost.

Use cross-validation to find the right number of PCA components before applying KNN.


In [1]:
#Dataset:
#Use the Wine Dataset from sklearn.datasets.load_wine().
#uestion 6: Train a KNN Classifier on the Wine dataset with and without feature
#     caling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ---- 1. KNN without feature scaling ----
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ---- 2. KNN with feature scaling ----
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_with_scaling = KNeighborsClassifier(n_neighbors=5)
knn_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = knn_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling: {accuracy_with_scaling:.4f}")


Accuracy without scaling: 0.7222
Accuracy with scaling: 0.9444


In [2]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance
# ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Load the wine dataset
data = load_wine()
X = data.data

# Standardize the features (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Display each component's explained variance
for i, variance in enumerate(explained_variance):
    print(f"Principal Component {i+1}: {variance:.4f}")



Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


In [3]:
#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
# components). Compare the accuracy with the original dataset.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


In [4]:
# Question 9: Train a KNN Classifier with different distance metrics (euclidean,
# manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_wine()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Feature scaling (important for distance-based models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ---- KNN with Euclidean distance (default) ----
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ---- KNN with Manhattan distance ----
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print(f"Accuracy with Euclidean distance: {acc_euclidean:.4f}")
print(f"Accuracy with Manhattan distance: {acc_manhattan:.4f}")



Accuracy with Euclidean distance: 0.9444
Accuracy with Manhattan distance: 0.9815


In [5]:
# Question 10: You are working with a high-dimensional gene expression dataset to
#classify patients with different types of cancer.
#Due to the large number of features and a small number of samples, traditional models
#overfit.
#Explain how you would:
#● Use PCA to reduce dimensionality
#● Decide how many components to keep
#● Use KNN for classification post-dimensionality reduction
#● Evaluate the model
#● Justify this pipeline to your stakeholders as a robust solution for real-world
#biomedical data

from sklearn.decomposition import PCA

# Standardize data first (very important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

