**Q1.What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems ?**
- K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based learning algorithm. Here's what that means:
   - Supervised learning: It requires a labeled dataset (each example has a known outcome).

   - Non-parametric: It doesn't make assumptions about data distribution (like linearity or normality).

   -  Instance-based (lazy learning): There is no explicit training phase. Instead, the algorithm stores all training data and makes decisions at prediction time using local information.
Additionally, KNN is sometimes referred to as a lazy learner because it delays computation until it needs to make a prediction.

**KNN for Classification**
1. Identify k nearest neighbors
2. Majority (or plurality) voting: Assign the class label that occurs most frequently among the neighbors.
- In binary classification, using an odd value of k helps avoid ties.
GeeksforGeeks
MachineLearningMastery.com

4. Optionally, apply weighted voting where nearer neighbors have more influence (e.g., weights based on inverse distance).

**KNN for Regression**
1. Identify k nearest neighbors
2. Compute a prediction:
- Simple average of neighbor values (common approach)
- Weighted average by distance (gives closer points more weight)

**Q2. What is the Curse of Dimensionality and how does it affect KNN
performance?**
- The Curse of Dimensionality refers to a collection of problems that emerge when working with data in high-dimensional spaces—issues that aren't present in low-dimensional settings like 2D or 3D. Coined by Richard Bellman, these phenomena include exponential growth in data sparsity, computational complexity, and the breakdown of intuitive notions like “nearness” or similarity.

It impacts KNN in the following ways:
1. Distance Concentration & Uniformity:- As dimensionality grows, the difference between the nearest and farthest points shrinks—the distances become nearly identical. This makes it challenging for KNN to identify meaningful neighbors, since “closest” and “farthest” converge.

2. Sparse Data, Exponential Growth Requirements:- High-dimensional spaces expand exponentially. To preserve the same density of data points, the dataset size has to grow dramatically—often infeasibly so. Otherwise, KNN struggles because there aren’t enough training points close enough to be reliable.

3. Computational Burden:- KNN must compute distances between the query and all training points. In high dimensions, both the number of features and the volume of data increase, making these computations significantly more expensive.

4. Breakdown of Data Structures (e.g., KD‑Trees):- Data structures like KD‑trees rely on spatial partitioning to accelerate nearest-neighbor search. But in high dimensions, almost all branches must be visited because points are so uniformly distributed—the structure becomes nearly as slow as brute-force search.

5. Reduction in Predictive Effectiveness:- With distances becoming less meaningful, KNN loses its core assumption: that nearby points are similar. This leads to degraded performance in both classification and regression tasks.

6. Emergence of “Hubs” in KNN Graphs:- In high-dimensional spaces, some data points disproportionately become neighbors for many others, forming “hubs.” This skewed distribution can distort results in classification and other KNN‐based tasks.

**Q3: What is Principal Component Analysis (PCA)? How is it different from
feature selection ?**
- Principal Component Analysis (PCA) is a linear dimensionality reduction technique most commonly used for exploratory data analysis, visualization, and preprocessing.
    -  It transforms the original features into a new coordinate system where the axes—called principal components—are orthogonal and ordered by the amount of variance they capture. The goal is to retain as much of the dataset's variability as possible in fewer dimensions.

    - Mathematically, PCA computes the eigenvectors of the covariance (or correlation) matrix, and principal components correspond to the eigenvectors associated with the largest eigenvalues.

    - The first few principal components capture most of the data’s intrinsic variability and can be used to reduce dimensions without significant information loss—making it easier to visualize or speed up further processing.

The difference between PCA and feature selection are as follows:

| Aspect               | PCA (Feature Extraction)                       | Feature Selection                              |
| -------------------- | ---------------------------------------------- | ---------------------------------------------- |
| Creates new features | Yes—combines existing features into components | No—selects subset of original features         |
| Interpretability     | Lower—new features lose original meaning       | Higher—retains understandable variables        |
| Goal                 | Capture maximum variance in fewer dimensions   | Retain the most relevant features for the task |
| Technique type       | Unsupervised, mathematical transformation      | Supervised or unsupervised selection           |
| Typical methods      | Eigen decomposition, linear projection         | RFE, LASSO, filter/wrapper/embedded methods    |


**Q4. What are eigenvalues and eigenvectors in PCA, and why are they important?**
- Eigenvectors are special non-zero vectors that, when transformed by a matrix (e.g., a covariance matrix), retain their direction; they may only be stretched or shrunk. Formally, for a matrix A and vector v, if
Av=λv, then v is an eigenvector and λ its corresponding eigenvalue.
- Eigenvalues (λ) are the scalars that quantify how much the transformation scales the eigenvector—how much it's stretched (|λ| > 1), shrunk (|λ| < 1), or reversed (if λ is negative).

The eigenvalues and eogenvectors are important because:
1. Identify Principal Directions:- Eigenvectors indicate the directions in which the data varies most intensely—these are your principal components.

2. Quantify Importance with Eigenvalues:- Eigenvalues tell you how much variance each principal component captures. By ranking them, you determine which components are most informative.

3. Decide How Many Components to Keep:- By examining eigenvalues (e.g., via a scree plot), you can choose the top components that collectively explain a major portion of the variance—this balances dimensionality reduction and information retention.

4. Orthogonality Ensures Independence:- Eigenvectors in PCA are orthogonal (uncorrelated), meaning each principal component adds unique information.

**Q5. How do KNN and PCA complement each other when applied in a single
pipeline?**
- 1. Tackles the Curse of Dimensionality: High-dimensional data dilutes the concept of distance—points become uniformly distant from each other, making KNN’s distance-based decisions unreliable.PCA mitigates this by projecting the data onto a lower-dimensional space, where distances regain meaning and KNN becomes more effective.
2. Improves Accuracy: Numerous real-world studies show that PCA improves KNN performance:
      - Using PCA on air quality data with KNN improved accuracy from about 90.74% to 93.06%, a 2.32% uplift

      - In fish species classification with image features, combining PCA (for feature reduction) and KNN boosted accuracy by 7.5% compared to KNN alone
      - An article on gas sensor data found that PCA significantly increased the AUC of a KNN classifier—from 0.822 to 0.979
3. Speeds Up Computation: Reducing dimensions reduces both storage needs and distance computations. One sports project reported huge efficiency gains: reducing image feature space to just 50 PCA components made KNN significantly faster—even though accuracy trade-offs existed. Similarly, the MNIST/KNN pipeline saw efficiency and accuracy gains when reducing to 50–100 components, peaking around 97.76% accuracy

4. Often Matches More Complex Nonlinear Methods: PCA can sometimes perform nearly as effectively as nonlinear alternatives like autoencoders—while being orders of magnitude faster. In evaluations on datasets like MNIST, KNN applied to PCA-transformed data matched the accuracy of autoencoder-based embeddings, with dramatically faster computation.

**Q6. Dataset: Use the Wine Dataset from sklearn.datasets.load_wine().**

**Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.**

Ans. 1. Goodboychan’s Example (StandardScaler): A hands‑on example using the Wine dataset demonstrates a clear benefit from scaling:
- Without scaling, KNN achieved an accuracy of approximately 75.56%.
- With StandardScaler, accuracy jumped to around 95.56%.

2. K‑Nearest Neighbors on Wine Dataset with Scaling vs Unscaled (R Implementation)
Another real‑world evaluation showcased a dramatic improvement:
- Without scaling, average accuracy over repeated trials was around 74.36%.
- With scaling, it soared to 93.82%.



In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
acc_unscaled = knn.score(X_test, y_test)

pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier())
pipeline.fit(X_train, y_train)
acc_scaled = pipeline.score(X_test, y_test)

print("Accuracy without scaling:", acc_unscaled)
print("Accuracy with scaling:", acc_scaled)

Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


**Q7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**
- To Compute Explained Variance Ratio on the Wine Dataset
1. Load and scale the Wine dataset:

In [4]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

data = load_wine()
X, y = data.data, data.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Scaling is important—PCA is sensitive to feature scales; without it, high-variance features can disproportionately influence the components.

2. Fit PCA to retain all components

In [5]:
pca = PCA(n_components=X.shape[1])  # 13 components for 13 original features
pca.fit(X_scaled)

3. Retrieve and display explained variance ratio

In [6]:
explained_variance_ratio = pca.explained_variance_ratio_
for i, ratio in enumerate(explained_variance_ratio, start=1):
    print(f"PC{i}: {ratio:.4f}")

PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


**Q8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.**
- Here’s what I found regarding the comparison of KNN classifier performance on the Wine dataset with and without PCA (selecting top 2 principal components):
Key Findings from Web Sources
- A comparative experiment using 3 nearest neighbors (k=3) evaluated KNN on:
1. The original feature set
2. A PCA-transformed version using the first 6 PCs

The recorded accuracies were:
- Original dataset: approximately 79.4%
- With 6 principal components: around 78.2%

Although this study uses 6 PCs instead of 2, it provides solid insight: using reduced-dimensional data led to a slight drop in training accuracy, indicating that aggressive reduction can lose predictive information.

In [7]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
acc_original = knn.score(X_test_scaled, y_test)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
acc_pca2 = knn_pca.score(X_test_pca, y_test)

print(f"Accuracy (original data): {acc_original:.3f}")
print(f"Accuracy (PCA, 2 components): {acc_pca2:.3f}")

Accuracy (original data): 0.963
Accuracy (PCA, 2 components): 0.981


**Q9. Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.**
Ans. Insights from Available Analyses
1. Empirical Results with Wine Dataset (Custom Implementation)
A hands-on evaluation applying KNN (with both Euclidean and Manhattan distances) using nested cross-validation revealed:
- Overall performance was consistently strong (~93–95%) across folds.
- Manhattan distance slightly outperformed Euclidean distance in several cases.
- Best recorded accuracy: 94.29% with k = 1 and Manhattan distance.

2. Literature & Broader Comparisons
Broader experimental research supports these findings:
- Studies across multiple datasets found Euclidean and Manhattan distances often perform similarly, and generally outperform other options like Minkowski variants.
- In a specific comparison using the Wine dataset, Manhattan distance yielded better accuracy than Euclidean.

In [8]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

X, y = load_wine(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

metrics = ['euclidean', 'manhattan']
results = {}
for m in metrics:
    knn = KNeighborsClassifier(n_neighbors=3, metric=m)
    scores = cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy')
    results[m] = (np.mean(scores), np.std(scores))

for m, (mean_acc, std) in results.items():
    print(f"{m.title():<10}: Mean Accuracy = {mean_acc:.3f} ± {std:.3f}")

Euclidean : Mean Accuracy = 0.944 ± 0.040
Manhattan : Mean Accuracy = 0.961 ± 0.038


**Q10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.**

**Explain how you would:**

● **Use PCA to reduce dimensionality**

● **Decide how many components to keep**

● **Use KNN for classification post-dimensionality reduction**

● **Evaluate the model**

● **Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data**

Ans. Here’s a clear and structured explanation—grounded in real-world biomedical research—on how you could tackle high-dimensional gene expression data (with many features but few samples) using a PCA → KNN pipeline:

1. Use PCA to Reduce Dimensionality
- Problem: Gene expression datasets typically contain thousands of genes (features) but only tens to hundreds of patient samples. This disparity causes serious overfitting and computational burden.
- Solution (PCA): Principal Component Analysis transforms the original correlated gene features into a smaller set of uncorrelated principal components, capturing the greatest variance in the data while discarding noise and redundancy.

Steps:
- Preprocess data: handle missing values (e.g., impute via nearest neighbors), and standardize genes to mean = 0, variance = 1.
- Apply PCA, extracting components that represent the bulk of informative variance—thus creating a compact and meaningful feature representation.

2. Decide How Many Components to Keep
- Criterion: Use the explained variance ratio to determine how many principal components capture, say, 95% of total variance.
- Practical approach: Compute the cumulative explained variance and select the smallest number of components reaching your threshold (e.g., 95%)
This ensures you’re preserving the most important patterns while greatly reducing dimensionality and mitigating overfitting.

3. Use KNN for Classification After Dimensionality Reduction
- Once dimensionality is reduced, apply K‑Nearest Neighbors in the new component space.
- KNN benefits from PCA because:

   -   It operates in a denser, lower-dimensional space, making distance computations meaningful again.

   -   It’s less prone to overfitting and runs more efficiently.

- Empirical studies in cancer classification show that PCA-based pipelines (e.g., with SVM or Random Forest) offer greater accuracy, precision, recall, and computational efficiency, reducing overfitting.

4. Evaluate the Model Thoroughly

To ensure robustness in a biomedical context:
- Use cross-validation (e.g., k‑fold) to estimate accuracy with precision.
- Evaluate multiple metrics, not just accuracy—e.g., precision, recall, F1-score, and AUC—since class imbalances may exist
- For clinical relevance, deploy DET curves (Detection Error Trade-off) rather than just ROC curves, which helps assess performance across varying decision thresholds—important in personalized medicine settings.

5. Justify the PCA + KNN Pipeline to Stakeholders

| Benefit                                   | Explanation                                                                                                                                                         |
| ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Reduces overfitting**                   | PCA filters out noisy, redundant features, focusing on the most informative variance—this improves generalizability                               |
| **Improves computational efficiency**     | Fewer dimensions mean faster training/prediction and lower memory usage                                                                           |
| **Retains critical signal**               | PCA ensures that the retained components capture the core data structure, supporting accurate classification                                    |
| **Transparent and interpretable process** | PCA followed by KNN is simpler and more explainable than complex models—crucial in biomedical research.                                                             |
| **Clinically robust evaluation**          | Using advanced evaluation methods like DET curves ensures the model’s reliability across clinical thresholds                               |
| **Proven effectiveness**                  | Case studies in cancer genomics consistently show improved model performance (accuracy, precision, recall, F1) with PCA preprocessing. |

