# KNN & PCA


1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

--> K-Nearest Neighbors (KNN) is a supervised, non-parametric, and instance-based machine learning algorithm used for both classification and regression problems.
It is called a lazy learning algorithm because it does not build an explicit model during training; instead, it stores the training data and performs computation only at prediction time.

**How KNN Works:** The core idea of KNN is that similar data points exist close to each other in the feature space.

**General Steps**

- Choose the number of neighbors K.

- Calculate the distance between the new data point and all training points
(commonly Euclidean distance).

- Select the K nearest neighbors.

- Aggregate the neighbors’ outputs to make a prediction.

**Distance Measures:**
Commonly used distance metrics include:

- Euclidean distance

- Manhattan distance

- Minkowski distance

- Cosine similarity

**KNN for Classification**

In classification problems:

- The algorithm finds the K nearest neighbors.

- The class label is assigned using majority voting among the neighbors.

Example:

If K = 5 and among the nearest neighbors:

* 3 belong to class “Yes"

- 2 belong to class “No”

Then the predicted class is “Yes”.

**KNN for Regression**

In regression problems:

- The algorithm finds the K nearest neighbors.

- The output value is the average (or weighted average) of the neighbor's target values.

Example:

If K = 3 and the neighbors’ values are:

50, 60, 70

The predicted value is: 60

$
50 + 60 + 70 / 3
$


**Advantages of KNN:**

- Simple and easy to understand

- No training phase

- Works well with small datasets

- Can model complex decision boundaries


**Conclusion** : K-Nearest Neighbors (KNN) is a simple yet powerful algorithm that makes predictions based on the similarity between data points. In classification, it uses majority voting, while in regression, it predicts values by averaging neighbors’ outputs. Despite its simplicity, proper choice of K and distance metrics is crucial for optimal performance.

---


2. What is the Curse of Dimensionality and how does it affect KNN
performance?

--> The Curse of Dimensionality refers to the set of problems that arise when the number of features (dimensions) increases in a dataset. As dimensionality grows, the data space becomes increasingly sparse, making it difficult for distance-based algorithms to find meaningful patterns.

**the Curse of Dimensionality Occurs:**
- In high-dimensional space, data points become far apart from each other.

- The volume of the feature space grows exponentially with dimensions.

- A much larger amount of data is required to maintain the same data density.

**Effect of Curse of Dimensionality on KNN:** KNN relies entirely on distance calculations, which are strongly affected by high dimensionality.

1. Distance Becomes Less Meaningful :
In high dimensions, the distances between nearest and farthest neighbors become very similar.This reduces the ability of KNN to distinguish between close and distant points.

2. Loss of Neighborhood Concept:KNN assumes nearby points are similar.
In high dimensions, all points appear almost equally distant, breaking this assumption.

3. Increased Computational Cost: More features increase distance calculation time.
Prediction becomes slow and inefficient.

4. Higher Risk of Overfitting : Noise increases with more features.
KNN may rely on irrelevant features, reducing generalization.

**Impact on KNN Performance:**

- Decreased accuracy

- Poor generalization

- Slower predictions

- Increased sensitivity to noise

**How to Mitigate the Curse of Dimensionality in KNN:**

- Feature selection (remove irrelevant features)

- Dimensionality reduction (PCA, LDA)

- Feature scaling and normalization

- Use distance-weighted KNN

- Increase dataset size (if possible)

**Conclusion** : The Curse of Dimensionality significantly degrades KNN performance by making distance measures unreliable in high-dimensional spaces. Since KNN depends on the notion of proximity, increased dimensionality leads to reduced accuracy, higher computational cost, and poor generalization. Applying dimensionality reduction and feature selection techniques is essential for effective KNN performance.

---


3.  What is Principal Component Analysis (PCA)? How is it different from
feature selection?

--> Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique used to transform a high-dimensional dataset into a lower-dimensional space while preserving as much variance (information) as possible.

PCA creates new variables called principal components, which are linear combinations of the original features and are mutually uncorrelated.

**How PCA Works:**

1. Standardize the data.

2. Compute the covariance matrix.

3. Calculate eigenvalues and eigenvectors.

4. Sort eigenvectors by decreasing eigenvalues.

5. Select the top k principal components.

6. Project the original data onto these components.

The first principal component captures the maximum variance, the second captures the next highest variance, and so on.

**Key Characteristics of PCA:**

- Unsupervised technique

- Reduces dimensionality

- Removes multicollinearity

- Produces new transformed features

- Commonly used for visualization and noise reduction

**Feature Selection** : Feature selection is the process of selecting a subset of original features that are most relevant to the prediction task, without transforming them.It keeps the original meaning of features intact.

**Types of Feature Selection:**

- Filter methods (correlation, chi-square)

- Wrapper methods (forward selection, backward elimination)

- Embedded methods (LASSO, tree-based feature importance)

**Difference Between PCA and Feature Selection :**

| Aspect | PCA | Feature Selection |
|------|-----|------------------|
| Approach | Feature extraction | Feature selection |
| Type | Unsupervised | Usually supervised |
| Output Features | New transformed features | Original features |
| Interpretability | Low | High |
| Uses Target Variable | No | Yes (often) |
| Multicollinearity | Removes it | May retain it |
| Example | PC1, PC2 | Age, Income |



**Conclusion** : PCA reduces dimensionality by creating new uncorrelated features that capture maximum variance, whereas feature selection reduces dimensionality by choosing the most relevant original features. PCA focuses on data representation, while feature selection focuses on feature relevance to the target variable.

----


4.  What are eigenvalues and eigenvectors in PCA, and why are they
important?

--> Eigenvectors are **direction vectors** that define the new axes (principal
components) in PCA. They indicate the **directions of maximum variance** in the
data after transformation.In PCA, eigenvectors are obtained from the **covariance matrix** of the
dataset.
Eigenvalues are **scalar values** that correspond to each eigenvector. They
represent the **amount of variance** captured along the direction of their
associated eigenvector.

- Large eigenvalue → More variance explained
- Small eigenvalue → Less variance explained


**Role of Eigenvalues and Eigenvectors in PCA:**

1. **Direction of Data Spread**
   - Eigenvectors define the directions along which data varies the most.

2. **Variance Measurement**
   - Eigenvalues quantify how much information (variance) is present in each
     direction.

3. **Principal Component Selection**
   - Eigenvectors with the **largest eigenvalues** are selected as principal
     components.

4. **Dimensionality Reduction**
   - By keeping only top eigenvectors, PCA reduces dimensions while preserving
     maximum information.

**Why are They Important in PCA?:**

- Help identify the most important patterns in data
- Reduce noise and redundancy
- Remove multicollinearity
- Improve computational efficiency
- Enable data visualization in lower dimensions

**Example:**
If a dataset has:
- Eigenvalue₁ = 5.2
- Eigenvalue₂ = 1.1
- Eigenvalue₃ = 0.2

Then the first principal component (PC1) captures the most variance and is the
most important.


**Conclusion:**
In PCA, **eigenvectors define the directions of new feature axes**, while
**eigenvalues indicate the importance of those directions**. Together, they
allow PCA to reduce dimensionality while retaining the most meaningful
information from the dataset.


---


5. How do KNN and PCA complement each other when applied in a single
pipeline?

--> K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) are often used
together because PCA helps overcome the limitations of KNN in high-dimensional
data. While KNN relies on distance calculations, PCA reduces dimensionality and
removes redundancy, making distance measures more meaningful.



**Role of PCA in the Pipeline:**
- Reduces the number of features by projecting data onto principal components.
- Removes correlated and irrelevant features.
- Mitigates the **curse of dimensionality**.
- Preserves maximum variance with fewer dimensions.



**Role of KNN in the Pipeline:**
- Performs classification or regression using distance-based similarity.
- Benefits from cleaner, lower-dimensional feature space.
- Produces more accurate and faster predictions after PCA.


**How They Complement Each Other:**

1. Improved Distance Measurement
- PCA ensures distances used by KNN are more meaningful by removing noise and
  redundancy.

2. Reduced Computational Cost
- Fewer dimensions mean faster distance calculations in KNN.

3. Better Generalization
- PCA removes noisy features, reducing overfitting in KNN.

4. Improved Accuracy
- KNN performs better when irrelevant features are eliminated.



**Typical PCA + KNN Pipeline:**
1. Standardize the dataset.
2. Apply PCA to reduce dimensionality.
3. Train KNN on transformed data.
4. Evaluate performance.


**Example Use Cases:**
- Image recognition
- Text classification
- Bioinformatics datasets
- High-dimensional sensor data



**Conclusion:**
PCA and KNN complement each other by combining dimensionality reduction with
distance-based learning. PCA makes KNN more efficient and accurate by addressing
the curse of dimensionality, while KNN leverages the reduced feature space to
make reliable predictions.

---


In [1]:
# 6. Train a KNN Classifier on the Wine dataset with and without feature
# scaling. Compare model accuracy in both cases.

# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# KNN WITHOUT Feature Scaling

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

print("Accuracy WITHOUT feature scaling:", accuracy_no_scaling)


# KNN WITH Feature Scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy WITH feature scaling:", accuracy_scaled)


Accuracy WITHOUT feature scaling: 0.7222222222222222
Accuracy WITH feature scaling: 0.9444444444444444


In [2]:
# 7. Train a PCA model on the Wine dataset and print the explained variance
# ratio of each principal component.

# Import required libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
data = load_wine()
X = data.data

# Standardize the features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model (keeping all components)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio for each principal component
print("Explained Variance Ratio of each Principal Component:")
for i, var in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"PC{i}: {var:.4f}")


Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


In [3]:
#8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
# components). Compare the accuracy with the original dataset.

# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# KNN on Original Dataset
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)


# KNN on PCA-transformed Dataset (Top 2 components)
# Apply PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train KNN on PCA data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

# Predict and evaluate
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)


# Print Results
print("Accuracy using Original Dataset:", accuracy_original)
print("Accuracy using PCA-transformed Dataset (2 components):", accuracy_pca)


Accuracy using Original Dataset: 0.9629629629629629
Accuracy using PCA-transformed Dataset (2 components): 0.9814814814814815


In [4]:
#9. Train a KNN Classifier with different distance metrics (euclidean,
# manhattan) on the scaled Wine dataset and compare the results.

# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# KNN with Euclidean Distance
knn_euclidean = KNeighborsClassifier(
    n_neighbors=5,
    metric='euclidean'
)
knn_euclidean.fit(X_train_scaled, y_train)

y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)


# KNN with Manhattan Distance
knn_manhattan = KNeighborsClassifier(
    n_neighbors=5,
    metric='manhattan'
)
knn_manhattan.fit(X_train_scaled, y_train)

y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)


# Print Results
print("Accuracy with Euclidean Distance:", accuracy_euclidean)
print("Accuracy with Manhattan Distance:", accuracy_manhattan)


Accuracy with Euclidean Distance: 0.9629629629629629
Accuracy with Manhattan Distance: 0.9629629629629629


10.  You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

--> Gene expression datasets typically have **thousands of features (genes)** but **very few samples**.  
- This causes **overfitting** in traditional machine learning models.  
- Many features are **correlated** or irrelevant for classification.

To address this, we can use a **pipeline combining PCA for dimensionality reduction and KNN for classification**.


**Step 1: Use PCA to Reduce Dimensionality:**
- Apply **Principal Component Analysis (PCA)** to transform the original high-dimensional data into a smaller set of **uncorrelated principal components**.  
- PCA captures the directions of **maximum variance**, reducing noise and redundancy.  
- This lowers the risk of overfitting while preserving the most important information.



**Step 2: Decide How Many Components to Keep:**
- **Explained Variance Ratio:**  
  Calculate the **cumulative explained variance** for each principal component.  
- **Rule of Thumb:**  
  Retain the top components that explain **95–99% of the total variance**.  
- This ensures dimensionality is reduced without losing critical information.


**Step 3: Use KNN for Classification Post-PCA:**
- Apply **feature scaling** to the PCA-transformed data.  
- Train a **K-Nearest Neighbors (KNN) classifier** using the reduced-dimensional data.  
- KNN is chosen because it is **non-parametric**, simple, and effective in moderate dimensions.



**Step 4: Evaluate the Model:**
- Split the data into **training and test sets** or use **cross-validation**.  
- Use metrics suitable for imbalanced or biomedical data:
  - **Accuracy**: Overall correct predictions  
  - **Precision, Recall, F1-score**: Important when misclassification is costly  
  - **ROC-AUC**: Evaluates separability between classes  
- Optionally, plot a **confusion matrix** to visualize misclassifications.



**Step 5: Justify the Pipeline to Stakeholders:**
- **Reduced Overfitting:** PCA lowers feature dimensionality, improving generalization.  
- **Interpretability:** KNN predictions are easy to explain (based on similarity to known patients).  
- **Computational Efficiency:** Reduced dimensions speed up distance calculations in KNN.  
- **Data-Driven Decisions:** Preserves biologically relevant variation while filtering noise.  
- **Robustness:** The pipeline works well even when the dataset has far more features than samples, common in biomedical datasets.



**Summary :**

1. **PCA** reduces thousands of genes to a manageable number of principal components.  
2. **Number of components** is chosen based on **explained variance** (95–99%).  
3. **KNN** classifies patients in the reduced space, minimizing overfitting.  
4. **Evaluation metrics** ensure robust assessment.  
5. The pipeline is **stakeholder-friendly**, balancing accuracy, interpretability, and computational efficiency.




In [5]:

### Optional Python Skeleton for Implementation
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, roc_auc_score

# X = gene expression data, y = cancer type labels

# 1. Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Apply PCA
pca = PCA(n_components=0.95)  # retain 95% variance
X_pca = pca.fit_transform(X_scaled)

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.2, random_state=42
)

# 4. Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# 5. Predict and Evaluate
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      1.00      0.97        14
           1       1.00      0.86      0.92        14
           2       0.89      1.00      0.94         8

    accuracy                           0.94        36
   macro avg       0.94      0.95      0.94        36
weighted avg       0.95      0.94      0.94        36

