Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
Answer: K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks. It is a non-parametric and instance-based learning algorithm, meaning it makes predictions based on the entire training dataset rather than learning explicit parameters.

How KNN Works (General Process):

Store the training data.

When a new data point (query) needs to be predicted:

Measure the distance between the query point and all training points (commonly using Euclidean distance).

Identify the ‘k’ closest data points (neighbors).

Use these neighbors to make a prediction.

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?
Answer: The Curse of Dimensionality refers to various issues and challenges that arise when analyzing or organizing data in high-dimensional spaces (i.e., when the number of features or dimensions increases).

As dimensions increase:

The volume of the space increases exponentially.

Data points become sparse.

Distances between points become less meaningful or less distinguishable.

Algorithms that rely on distance measures (like KNN) often perform poorly.

How It Affects KNN Performance:

K-Nearest Neighbors (KNN) relies heavily on distance metrics to find the nearest neighbors. In high-dimensional spaces, these distances become unreliable due to:

1. Distance Convergence:

In high dimensions, the difference between the nearest and farthest neighbor becomes very small.

This makes it difficult for KNN to distinguish between close and far points.

As a result, KNN may include irrelevant or distant neighbors in its predictions, reducing accuracy.

2. Increased Noise and Irrelevance:

High-dimensional data often includes many irrelevant or redundant features.

These irrelevant features can distort distance calculations, leading to misleading neighbor selection.

3. Sparsity of Data:

The data becomes sparse, meaning each data point is far from every other point.

KNN requires a dense neighborhood to find good neighbors — but in high dimensions, such neighborhoods often don’t exist.

4. Computational Complexity:

As the number of dimensions increases, so does the computational cost of calculating distances for all training samples.

This makes KNN increasingly slow and inefficient.

🛠️ How to Mitigate the Curse of Dimensionality in KNN:

Feature Selection:

Choose only the most relevant features using techniques like correlation analysis, recursive feature elimination (RFE), or LASSO.

Dimensionality Reduction:

Use methods like PCA (Principal Component Analysis) or t-SNE to reduce the number of dimensions while retaining important information.

Distance Metric Tuning:

Consider alternate metrics like Manhattan distance or Mahalanobis distance which might perform better in some high-dimensional cases.

Normalization/Standardization:

Scale all features so that no single feature dominates the distance calculation.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?
Answer: Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much variability (information) as possible.

It transforms the original features into a new set of uncorrelated variables called principal components, which are ordered by the amount of variance they capture from the data.

| Aspect               | PCA (Dimensionality Reduction)         | Feature Selection                             |
| -------------------- | -------------------------------------- | --------------------------------------------- |
| **What it does**     | Transforms features into new ones      | Selects a subset of original features         |
| **Output features**  | New, uncorrelated features (PCs)       | Original features only                        |
| **Interpretability** | Low – PCs are combinations of features | High – original features are preserved        |
| **Goal**             | Maximize variance retained             | Retain most informative/or relevant features  |
| **Technique Type**   | **Feature extraction**                 | **Feature selection**                         |
| **Example methods**  | PCA, t-SNE, LDA                        | Filter (e.g., Chi-squared), Wrapper, Embedded |


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Answer :-  What are Eigenvalues and Eigenvectors?

In the context of Principal Component Analysis (PCA):

An eigenvector represents a direction or axis in the feature space.

An eigenvalue represents the amount of variance (or information) captured in that direction.

Think of eigenvectors as directions of maximum variance in your data, and eigenvalues as how important those directions are.

Why Are They Important in PCA?
🔹 1. Finding Principal Components:

PCA computes the covariance matrix of the data.

Then it finds the eigenvectors and eigenvalues of this matrix.

Each eigenvector becomes a principal component direction.

Each eigenvalue tells you how much variance that component captures.

🔹 2. Ordering Importance:

PCA sorts components based on eigenvalues (from largest to smallest).

The top components (with largest eigenvalues) are kept for dimensionality reduction.

🔹 3. Dimensionality Reduction:

By keeping only the top k eigenvectors (with highest eigenvalues), you reduce the number of features while preserving most of the variance

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Answer: Using Principal Component Analysis (PCA) before applying K-Nearest Neighbors (KNN) can significantly improve model performance, especially on high-dimensional datasets. They complement each other by solving different challenges:

PCA	Reduces feature space, removes noise

KNN	Classifies/regresses based on distances

| Component | Role in Pipeline                                             |
| --------- | ------------------------------------------------------------ |
| PCA       | Reduces dimensions, denoises data                            |
| KNN       | Predicts based on nearest neighbors using improved distances |


Dataset:

Use the Wine Dataset from sklearn.datasets.load_wine().

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# --- KNN without feature scaling ---
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# --- KNN with feature scaling ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# --- Print results ---
print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaled:.4f}")


Accuracy without scaling: 0.8056
Accuracy with scaling:    0.9722


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

(Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
data = load_wine()
X = data.data

# Standardize the data before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Display each principal component's contribution
for i, var_ratio in enumerate(explained_variance):
    print(f"Principal Component {i+1}: {var_ratio:.4f}")


Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)
Answer:

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# ---- KNN on original scaled data ----
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# ---- PCA Transformation (top 2 components) ----
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# ---- KNN on PCA-reduced data ----
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# ---- Print results ----
print(f"Accuracy on original dataset (scaled): {accuracy_original:.4f}")
print(f"Accuracy on PCA-reduced dataset (2 components): {accuracy_pca:.4f}")


Accuracy on original dataset (scaled): 0.9722
Accuracy on PCA-reduced dataset (2 components): 0.9167


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)


In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the dataset
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ---- KNN with Euclidean distance (default) ----
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ---- KNN with Manhattan distance ----
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# ---- Print Results ----
print(f"Accuracy with Euclidean distance: {accuracy_euclidean:.4f}")
print(f"Accuracy with Manhattan distance: {accuracy_manhattan:.4f}")


Accuracy with Euclidean distance: 0.9722
Accuracy with Manhattan distance: 1.0000


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)


In [5]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report

# Simulate high-dimensional gene expression data
X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
                           n_redundant=950, n_classes=3, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (retain 95% of variance)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Original features: {X.shape[1]}")
print(f"PCA-reduced features: {X_pca.shape[1]}")

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Evaluate using Stratified K-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(knn, X_pca, y, cv=cv, scoring='accuracy')

# Print cross-validation results
print(f"\nCross-validated Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

# Final model evaluation on full data (optional)
knn.fit(X_pca, y)
y_pred = knn.predict(X_pca)
print("\nClassification Report on Full Data:")
print(classification_report(y, y_pred))


Original features: 1000
PCA-reduced features: 38

Cross-validated Accuracy: 0.5600 ± 0.0970

Classification Report on Full Data:
              precision    recall  f1-score   support

           0       0.85      0.65      0.73        34
           1       0.70      0.91      0.79        33
           2       0.81      0.76      0.78        33

    accuracy                           0.77       100
   macro avg       0.78      0.77      0.77       100
weighted avg       0.78      0.77      0.77       100

