#### **Question 1:  What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for classification but can also be used for regression tasks. It works by finding the "k" closest data points (neighbors) to a given input and makes a predictions based on the majority class (for classification) or the average value (for regression). Since KNN makes no assumptions about the underlying data distribution it makes it a non-parametric and instance-based learning method.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the training set immediately instead it stores the entire dataset and performs computations only at the time of classification.

**Working of KNN algorithm**

Thе K-Nearst Neighbors (KNN) algorithm operates on the principle of similarity where it predicts the label or value of a new data point by considering the labels or values of its K nearest neighbors in the training dataset.

**Step 1: Selecting the optimal value of K**
- K represents the number of nearest neighbors that needs to be considered while making prediction.

**Step 2: Calculating distance**
- To measure the similarity between target and training data points Euclidean distance is widely used. Distance is calculated between data points in the dataset and target point.

**Step 3: Finding Nearest Neighbors**
- The k data points with the smallest distances to the target point are nearest neighbors.

**Step 4: Voting for Classification or Taking Average for Regression(Make prediction)**

- When you want to classify a data point into a category like spam or not spam, the KNN algorithm looks at the K closest points in the dataset. These closest points are called neighbors. The algorithm then looks at which category the neighbors belong to and picks the one that appears the most. This is called majority voting.

- In regression, the algorithm still looks for the K closest points. But instead of voting for a class in classification, it takes the average of the values of those K neighbors. This average is the predicted value for the new point for the algorithm.

**KNN for Classification**

- Suppose we want to predict whether a fruit is apple or orange.

- We look at the K nearest fruits.

- If most neighbors are apples, we classify the new fruit as an apple.

- Example: If K=5 and among the 5 neighbors → 3 are apples, 2 are oranges → prediction = apple.

**KNN for Regression**

- Suppose we want to predict the price of a house.

- We look at the K nearest houses (based on features like size, location, etc.).

- The prediction is usually the average price of those K houses.

- Example: If K=3 and house prices of neighbors = [50L, 55L, 60L] → prediction = (50+55+60)/3 = 55L.

**Advantages of KNN**
- Simple to use: Easy to understand and implement.
- No training step: No need to train as it just stores the data and uses it during prediction.
- Few parameters: Only needs to set the number of neighbors (k) and a distance method.
- Versatile: Works for both classification and regression problems.


---

##### **Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

**What is the Curse of Dimensionality?**
- The Curse of Dimensionality refers to various phenomena that arise when dealing with high-dimensional data.
- As the number of features or dimensions increases, the volume of the feature space grows exponentially, leading to sparsity in the data distribution.
- This sparsity can result in several challenges such as increased computational complexity, overfitting, and deteriorating performance of certain algorithms.

**How does Dimensionality effect KNN Performance?**

The impact of dimensionality on the performance of KNN (K-Nearest Neighbors) is a well-known issue in machine learning. Here's a breakdown of how dimensionality affects KNN performance:

- **Increased Sparsity:** As the number of dimensions increases, the volume of the space grows exponentially. Consequently, the available data becomes sparser, meaning that data points are spread farther apart from each other. This sparsity can lead to difficulties in finding meaningful nearest neighbors, as there may be fewer neighboring points within a given distance.

- **Equal Distances:** In high-dimensional spaces, the concept of distance becomes less meaningful. As the number of dimensions increases, the distance between any two points tends to become more uniform, or equidistant. This phenomenon occurs because the influence of any single dimension diminishes as the number of dimensions grows, leading to points being distributed more uniformly across the space.

- **Degraded Performance:** KNN relies on the assumption that nearby points in the feature space are likely to have similar labels. However, in high-dimensional spaces, this assumption may no longer hold true due to the increased sparsity and equalization of distances. As a result, KNN may struggle to accurately classify data points, leading to degraded performance.

- **Increased Computational Complexity:** With higher dimensionality, the computational cost of KNN increases significantly. The algorithm needs to compute distances in a high-dimensional space, which involves more calculations. This can make the KNN algorithm slower and less efficient, especially when dealing with large datasets.

The curse of dimensionality makes distances meaningless in high dimensions, causing KNN to perform poorly. That’s why KNN works best in low to moderate-dimensional data after proper feature selection or dimensionality reduction.

---

**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

Principal Component Analysis (PCA) is a dimensionality reduction technique used in data analysis and machine learning. It helps you to reduce the number of features in a dataset while keeping the most important information. It changes your original features into new features these new features don’t overlap with each other and the first few keep most of the important differences found in the original data.

In short Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space by finding the directions (principal components) that capture the maximum variance in the data.

**How Principal Component Analysis Works**

PCA uses linear algebra to transform data into new features called principal components. It finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the covariance matrix. PCA selects the top components with the highest eigenvalues and projects the data onto them simplify the dataset.

**PCA vs Feature Selection**

Principal Component Analysis (PCA) and feature selection are both dimensionality reduction techniques but they work differently. PCA is a feature extraction method that transforms the original correlated features into a new set of uncorrelated features called principal components, which are linear combinations of the original features and are ordered by how much variance they capture. This makes PCA powerful for removing redundancy, but the new features are harder to interpret. In contrast, feature selection is a feature elimination method that simply selects the most relevant original features and discards the less useful or redundant ones. Unlike PCA, feature selection keeps the features in their original form, which makes them easier to understand and directly usable in analysis.

---

**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

In Principal Component Analysis (PCA), eigenvalues are scalar values that represent the amount of variance captured by each principal component. They are derived from the covariance matrix of the dataset and are used to determine the significance of each component.

Eigenvectors are the vectors indicating the direction of the axes along which the data varies the most. Each eigenvector has a corresponding eigenvalue, quantifying the amount of variance captured along its direction.


**Why They're Important**

Eigenvalues and eigenvectors are crucial for dimensionality reduction and data visualization in PCA.

- **Eigenvectors Determine Direction:** The eigenvectors define the directions of maximum variance in the data. The eigenvector with the highest eigenvalue is the first principal component, representing the most significant dimension of the data.

- **Eigenvalues Determine Importance:** The magnitude of an eigenvalue indicates the amount of variance captured by its corresponding eigenvector. Larger eigenvalues mean more information is captured, while smaller ones represent noise or less significant data dimensions. This allows you to select only the most important components (those with the largest eigenvalues) and discard the rest, effectively reducing the dimensionality of the dataset without losing much of the important information.

---

#### **Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN) are often used together in a machine learning pipeline to improve model performance. PCA is a dimensionality reduction technique, and KNN is a classification or regression algorithm. When combined, PCA is typically used as a preprocessing step before applying KNN.This combination is a classic example of how a data transformation step can solve key problems for a downstream model. PCA complements KNN by addressing its major weaknesses, which are its sensitivity to the curse of dimensionality and its computational inefficiency with high-dimensional data.

**KNN and PCA complement each other very well when applied in a single pipeline**

- KNN relies on distance between data points for classification or regression. However, in high-dimensional datasets, distances become less meaningful (curse of dimensionality), noisy features disturb distance calculations, and computation becomes heavy.

- PCA reduces dimensionality by projecting data onto a smaller set of principal components that capture the most important variance. This removes redundant/noisy features, makes distance measures more reliable, and speeds up computations.

- Together, PCA improves KNN’s accuracy (by focusing on informative features), efficiency (by reducing feature space), and interpretability (by allowing visualization of data in 2D/3D), while KNN provides a simple yet powerful classifier or regressor on the transformed space.

In short: PCA prepares the data by cleaning and compressing it, and KNN benefits by making better, faster, and more interpretable predictions.

---

#### **Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# KNN WITHOUT scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

# KNN WITH scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_with_scale = KNeighborsClassifier(n_neighbors=5)
knn_with_scale.fit(X_train_scaled, y_train)
y_pred_with_scale = knn_with_scale.predict(X_test_scaled)
accuracy_with_scale = accuracy_score(y_test, y_pred_with_scale)

# Results
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")
print(f"Accuracy with scaling: {accuracy_with_scale:.4f}")
print(f"Improvement: {(accuracy_with_scale - accuracy_no_scale)*100:.2f}%")

Accuracy without scaling: 0.7407
Accuracy with scaling: 0.9630
Improvement: 22.22%


#### **Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.** 

In [5]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Standardize features (important before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA (keep all components)
pca = PCA(n_components=X.shape[1])   # same number of components as features
X_pca = pca.fit_transform(X_scaled)

# 4. Print explained variance ratio
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f} ({ratio*100:.2f}%)")


Explained variance ratio of each principal component:
PC1: 0.3620 (36.20%)
PC2: 0.1921 (19.21%)
PC3: 0.1112 (11.12%)
PC4: 0.0707 (7.07%)
PC5: 0.0656 (6.56%)
PC6: 0.0494 (4.94%)
PC7: 0.0424 (4.24%)
PC8: 0.0268 (2.68%)
PC9: 0.0222 (2.22%)
PC10: 0.0193 (1.93%)
PC11: 0.0174 (1.74%)
PC12: 0.0130 (1.30%)
PC13: 0.0080 (0.80%)


#### **Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

In [12]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN on ORIGINAL scaled dataset (all 13 features)
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# Apply PCA (retain top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# KNN on PCA-transformed dataset (2 components)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Results
print("KNN Classification Results:")
print(f"Original dataset (13 features): {accuracy_original:.4f} ({accuracy_original*100:.2f}%)")
print(f"PCA dataset (2 components):     {accuracy_pca:.4f} ({accuracy_pca*100:.2f}%)")
print(f"Accuracy difference:            {accuracy_original - accuracy_pca:.4f} ({(accuracy_original - accuracy_pca)*100:.2f}%)")

# Show variance explained by top 2 components
variance_explained = sum(pca.explained_variance_ratio_)
print(f"\nVariance explained by top 2 PCs: {variance_explained:.4f} ({variance_explained*100:.2f}%)")

if accuracy_pca < accuracy_original:
   print("✗ PCA reduced accuracy (information loss)")
else:
   print("✓ PCA maintained/improved accuracy")

KNN Classification Results:
Original dataset (13 features): 0.9630 (96.30%)
PCA dataset (2 components):     0.9815 (98.15%)
Accuracy difference:            -0.0185 (-1.85%)

Variance explained by top 2 PCs: 0.5496 (54.96%)
✓ PCA maintained/improved accuracy


#### **Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.**

In [17]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,stratify=y)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN with Euclidean distance (default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean',p=2)
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan',p=1)
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Results
print("KNN with Different Distance Metrics:")
print(f"Euclidean distance: {accuracy_euclidean:.4f} ({accuracy_euclidean*100:.2f}%)")
print(f"Manhattan distance: {accuracy_manhattan:.4f} ({accuracy_manhattan*100:.2f}%)")

difference = accuracy_euclidean - accuracy_manhattan
print(f"Accuracy difference: {difference:.4f} ({difference*100:.2f}%)")

if accuracy_euclidean > accuracy_manhattan:
   print("✓ Euclidean distance performs better")
elif accuracy_manhattan > accuracy_euclidean:
   print("✓ Manhattan distance performs better")
else:
   print("= Both distances perform equally")

print("\nDistance Metric Explanation:")
print("• Euclidean: √(Σ(xi - yi)²) - straight-line distance")
print("• Manhattan: Σ|xi - yi| - city block distance")

KNN with Different Distance Metrics:
Euclidean distance: 0.9444 (94.44%)
Manhattan distance: 0.9815 (98.15%)
Accuracy difference: -0.0370 (-3.70%)
✓ Manhattan distance performs better

Distance Metric Explanation:
• Euclidean: √(Σ(xi - yi)²) - straight-line distance
• Manhattan: Σ|xi - yi| - city block distance


#### **Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.** 
**Due to the large number of features and a small number of samples, traditional models overfit.** 

Explain how you would: 
- Use PCA to reduce dimensionality 
- Decide how many components to keep 
- Use KNN for classification post-dimensionality reduction 
- Evaluate the model 
- Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data 


## **Using PCA and KNN for Cancer Classification**

---

### **1. PCA for Dimensionality Reduction**

* Gene expression datasets usually contain **thousands of features (genes)** but only a **small number of patients (samples)**.
* Training a model directly on this raw dataset would cause **overfitting**, because the model could memorize noise instead of learning real patterns.
* **Principal Component Analysis (PCA)** solves this by transforming the original features into a smaller set of **principal components**.
* These components are **linear combinations of the original genes** and are ordered by how much variation (information) they capture.
* By projecting the dataset into this lower-dimensional space, we keep the most meaningful structure while discarding noise and redundancy.

---

### **2. Deciding How Many Components to Keep**

* PCA produces as many components as original features, but not all are equally useful.
* To decide how many to keep, we look at the **explained variance ratio**: how much of the dataset’s variability is captured by each component.
* We calculate the **cumulative explained variance curve**, which shows how variance adds up as more components are included.
* A common rule is to keep enough components to explain **90–95% of the variance**, which balances **information retention** and **dimensionality reduction**.
* Example: If 20 components out of 500 capture 95% of the variance, we can reduce from 500 genes to 20 features without losing much signal.

---

### **3. KNN for Classification**

* After dimensionality reduction, we train a **K-Nearest Neighbors (KNN)** classifier on the new dataset.
* KNN works by finding the **closest patients (neighbors)** in the reduced feature space and assigning a cancer type based on the **majority vote**.
* This approach makes sense here because PCA has made distances more reliable by removing noisy or irrelevant features.
* We use **cross-validation** to tune the value of `k` (the number of neighbors). Too small `k` → sensitive to noise; too large `k` → overly smooth predictions.

---

### **4. Evaluating the Model**

* To measure performance, we hold out a portion of the data as a **test set** that the model never sees during training.
* We calculate metrics:

  * **Accuracy**: overall correct predictions.
  * **Precision**: how many predicted patients of a certain cancer type are truly that type.
  * **Recall (Sensitivity)**: how many patients of a given cancer type were correctly identified.
  * **F1-score**: balance between precision and recall, especially useful when cancer types are imbalanced.
* These metrics together ensure we aren’t just doing well on the most common cancer but are reliable across all classes.

---

### **5. Justification to Stakeholders**

* **Challenge addressed**: Biomedical datasets usually have far more genes than patients, which leads to overfitting. PCA reduces dimensionality, helping the model generalize better.
* **Biological signal retention**: By keeping the top principal components, we focus on the strongest patterns in the gene expression data, which often correspond to real biological differences between cancer types.
* **Efficiency**: Fewer features mean faster computation and easier storage, which is practical in real-world pipelines.
* **Transparency**: KNN is a simple, non-parametric model. Its predictions can be explained directly (e.g., “this patient is classified based on their similarity to these neighbors”).
* **Robustness**: PCA + KNN together make a defensible, efficient, and interpretable solution that works well on **small-sample, high-dimensional biomedical data**—a very common scenario in cancer genomics research.
