# KNN & PCA | Assignment

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

  - K-Nearest Neighbors (KNN) is a non-parametric, instance-based machine learning algorithm that can be used for both classification and regression tasks. It makes predictions based on the similarity between a new data point and the points in the training dataset.

How KNN Works:
1. Store the training data KNN is lazy learning, meaning it doesn’t build an explicit model. It just stores the training data.

2. Compute distance to neighbors When making a prediction for a new input, KNN calculates the distance between the new point and all training points.

3. Common distance metrics: Euclidean, Manhattan, or Minkowski distance.
Select K nearest neighbors

Choose the K closest points from the training data based on the distance metric.
4. Predict output

    1. Classification:

        I. Count the classes of the K neighbors.
        II. Assign the class that occurs most frequently (majority voting).
        III. Example: If 3 neighbors are "A" and 2 are "B", predict "A".

    2. Regression:

        I. Take the average (or weighted average) of the K neighbors’ target values.
        II. Example: If neighbors’ values are 10, 12, 15 → predict (10+12+15)/3 = 12.33

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

  - The Curse of Dimensionality refers to the problems that arise when working with high-dimensional data—i.e., datasets with a very large number of features. It particularly affects algorithms like K-Nearest Neighbors (KNN) that rely on distance metrics.

  Effects on KNN Performance:
1.   Distance metrics lose significance
2.   Increased overfitting
3.   Higher computational cost



Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

  - Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and statistics. Its goal is to reduce the number of features in a dataset while retaining as much variance (information) as possible.

How PCA Works
1. **Standardize the data**
*   Scale features so that each has mean 0 and standard deviation 1 (important if features have different scales).

2. **Compute covariance matrix**
*   Measures how features vary together.

3. **Calculate eigenvectors and eigenvalues**
*   Eigenvectors (principal components) define new axes in the feature space.
*   Eigenvalues indicate how much variance is captured by each component.

3. **Select top principal components**
*   Keep the first k components that capture most of the variance.

4. **Transform the original data**
*   Project data onto these components, reducing dimensions while retaining essential patterns.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

  - In Principal Component Analysis (PCA), eigenvalues and eigenvectors are fundamental because they define the principal components, which determine how the data is transformed and reduced in dimensions.

1. Eigenvectors:
*   An eigenvector is a direction in the feature space along which the data varies the most.
*   In PCA, eigenvectors of the covariance matrix represent the axes of the new feature space (principal components).
*   They are unit vectors that point in the directions where the data has maximum variance.

    * Intuition:
*   If your dataset is like a cloud of points, eigenvectors are the directions along which the cloud stretches the most.

2. Eigenvalues:
*   An eigenvalue corresponds to an eigenvector and indicates how much variance is captured along that direction.
*   Larger eigenvalues → more variance captured → more “information” along that component.
*   PCA ranks principal components by eigenvalues to decide which components to keep.

    * Intuition:
*   Think of eigenvalue as the “importance” of the corresponding eigenvector: higher means that direction explains more of the data’s spread.

  * Why They Are Important in PCA

1. Determine principal components:

*   Eigenvectors define the directions of the new axes (components).
*   Eigenvalues rank these axes by importance.

2. Dimensionality reduction:
*   By keeping the eigenvectors with the largest eigenvalues, you retain most of the variance while reducing the number of features.


3. Data transformation
*   Projecting data onto eigenvectors (principal components) creates uncorrelated features that summarize the original data efficiently.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

  - Complementarity

1. PCA reduces dimensionality
*   The Wine dataset has 13 numeric features.
*   PCA projects them onto fewer components that retain most of the variance, removing redundant or noisy information.

2. KNN is distance-based
*   KNN relies on distances between points.
*   Fewer, meaningful dimensions from PCA make these distances more reliable, avoiding the curse of dimensionality.

3. Efficiency and accuracy
*   Reduced dimensions → faster KNN computations.
*   Focused features → better neighbor selection → improved classification accuracy.






In [1]:
#Python Example: KNN + PCA on Wine Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"KNN with PCA Accuracy: {accuracy:.4f}")

KNN with PCA Accuracy: 0.9444


  - Explanation:

StandardScaler: Standardizes features before PCA.
PCA(n_components=5): Keeps the 5 components that capture most variance.
KNeighborsClassifier: Runs KNN in the reduced PCA space.
Pipeline: Ensures all steps are applied consistently to train and test data.

Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
  - Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
print(f"KNN Accuracy without scaling: {accuracy_no_scaling:.4f}")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"KNN Accuracy with scaling: {accuracy_scaled:.4f}")

KNN Accuracy without scaling: 0.7222
KNN Accuracy with scaling: 0.9444


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

data = load_wine()
X = data.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_
for i, ratio in enumerate(explained_variance, start=1):
    print(f"Principal Component {i}: {ratio:.4f}")

cumulative_variance = explained_variance.cumsum()
for i, cum_var in enumerate(cumulative_variance, start=1):
    print(f"Cumulative variance after PC{i}: {cum_var:.4f}")

Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080
Cumulative variance after PC1: 0.3620
Cumulative variance after PC2: 0.5541
Cumulative variance after PC3: 0.6653
Cumulative variance after PC4: 0.7360
Cumulative variance after PC5: 0.8016
Cumulative variance after PC6: 0.8510
Cumulative variance after PC7: 0.8934
Cumulative variance after PC8: 0.9202
Cumulative variance after PC9: 0.9424
Cumulative variance after PC10: 0.9617
Cumulative variance after PC11: 0.9791
Cumulative variance after PC12: 0.9920
Cumulative variance after PC13: 1.0000


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)
print(f"KNN Accuracy on Original Data: {accuracy_original:.4f}")

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"KNN Accuracy on PCA-transformed Data (2 components): {accuracy_pca:.4f}")

KNN Accuracy on Original Data: 0.9444
KNN Accuracy on PCA-transformed Data (2 components): 1.0000


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

metrics = ['euclidean', 'manhattan']

for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN Accuracy with {metric} distance: {accuracy:.4f}")


KNN Accuracy with euclidean distance: 0.9444
KNN Accuracy with manhattan distance: 0.9444


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

  - Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction ● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

(Include your Python code and output in the code box below.)

Here’s a complete explanation and Python example for handling a high-dimensional gene expression dataset using PCA + KNN, addressing overfitting and evaluation.

1. Pipeline Explanation
a. Use PCA to reduce dimensionality
High-dimensional data (e.g., thousands of genes) can cause overfitting, especially with few samples.
PCA transforms the data into principal components that capture most variance, compressing information while removing noise.
b. Decide how many components to keep
Compute cumulative explained variance.
Choose the minimum number of components that explain a high percentage (e.g., 90–95%) of total variance.
c. Use KNN for classification
After PCA, KNN can classify patients based on distances in reduced space, avoiding the curse of dimensionality.
d. Evaluate the model
Use metrics suitable for multi-class classification:

Accuracy, F1-score, confusion matrix, and cross-validation.
K-fold cross-validation helps ensure robustness on small datasets.

e. Justification to stakeholders
Reduces noise and dimensionality → less overfitting.
PCA ensures critical gene patterns are retained.
KNN is simple, interpretable, and non-parametric.
Pipeline is scalable and validated with cross-validation, making it suitable for real-world biomedical datasets.


In [6]:
#2. Python Code (Simulated High-Dimensional Data)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

X, y = make_classification(
    n_samples=100,
    n_features=1000,
    n_informative=50,
    n_classes=3,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA()
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumulative_variance >= 0.90) + 1
print(f"Number of components to retain 90% variance: {n_components}")

pca = PCA(n_components=n_components)
X_train_reduced = pca.fit_transform(X_train_scaled)
X_test_reduced = pca.transform(X_test_scaled)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_reduced, y_train)
y_pred = knn.predict(X_test_reduced)

accuracy = knn.score(X_test_reduced, y_test)
print(f"KNN Accuracy on PCA-reduced data: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

cv_scores = cross_val_score(knn, X_train_reduced, y_train, cv=5)
print(f"5-Fold CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Number of components to retain 90% variance: 67
KNN Accuracy on PCA-reduced data: 0.3500

Classification Report:
              precision    recall  f1-score   support

           0       0.33      0.33      0.33         6
           1       0.36      0.71      0.48         7
           2       0.00      0.00      0.00         7

    accuracy                           0.35        20
   macro avg       0.23      0.35      0.27        20
weighted avg       0.23      0.35      0.27        20

Confusion Matrix:
[[2 4 0]
 [2 5 0]
 [2 5 0]]
5-Fold CV Accuracy: 0.3375 ± 0.1458
