# Assignment - KNN & PCA

1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
  - K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm that makes predictions based on the similarity (distance) between data points.

      - Classification: KNN assigns a class to a new data point by looking at the majority class among its k nearest neighbors.

      - Regression: KNN predicts a continuous value by taking the average (or weighted average) of the target values of its k nearest neighbors.

  The performance depends heavily on:

      - Choice of k (small k → overfitting, large k → underfitting)

      - Choice of distance metric (Euclidean, Manhattan, etc.)

      - Feature scaling, since KNN is distance-based.

2. What is the Curse of Dimensionality and how does it affect KNN
performance?
  - The Curse of Dimensionality refers to the fact that as the number of features (dimensions) increases:

      - Data points become sparser.

      - Distances between points become less meaningful (all points seem equidistant).

      - Computational cost increases exponentially.

  - In KNN, this causes distance metrics to lose discriminative power, making it harder to identify true nearest neighbors, thus degrading model accuracy.

3.  What is Principal Component Analysis (PCA)? How is it different from
feature selection?
  - PCA is a dimensionality reduction technique that transforms original features into a new set of uncorrelated features (principal components) that maximize variance.

      - PCA is feature extraction (creates new features).

      - Feature selection simply chooses a subset of existing features without transforming them.

4. What are eigenvalues and eigenvectors in PCA, and why are they
important?
  - Eigenvectors: Directions of maximum variance in the data (principal components).

  - Eigenvalues: The amount of variance explained by each eigenvector.

      They are important because:

  - The largest eigenvalues indicate the most informative directions.

  - The ratio of eigenvalues helps decide how many components to retain.

5. How do KNN and PCA complement each other when applied in a single
pipeline?
  - PCA reduces dimensionality, removing noise and redundancy.

  - This makes distance computations in KNN more effective, avoiding the curse of dimensionality.

  - Together, PCA + KNN provide faster, more accurate, and robust models.




In [3]:
# 6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


In [4]:
# 7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(StandardScaler().fit_transform(X))

print("Explained variance ratio:", pca.explained_variance_ratio_)


Explained variance ratio: [0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [5]:
# 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

pca = PCA(n_components=2)
X_pca = pca.fit_transform(StandardScaler().fit_transform(X))

X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)
y_pred_pca = knn.predict(X_test_pca)
print("Accuracy with PCA (2 components):", accuracy_score(y_test, y_pred_pca))


Accuracy with PCA (2 components): 0.9814814814814815


In [6]:
# 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

# Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_eu = knn_euclidean.predict(X_test_scaled)
print("Accuracy with Euclidean:", accuracy_score(y_test, y_pred_eu))

# Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
print("Accuracy with Manhattan:", accuracy_score(y_test, y_pred_manhattan))


Accuracy with Euclidean: 0.9629629629629629
Accuracy with Manhattan: 0.9629629629629629


10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.

      Due to the large number of features and a small number of samples, traditional models overfit.
  
  Explain how you would:

      ● Use PCA to reduce dimensionality

      ● Decide how many components to keep
    
      ● Use KNN for classification post-dimensionality reduction
  
      ● Evaluate the model

      ● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

      (Include your Python code and output in the code box below.)
  - Explanation for Stakeholders:

      - PCA reduces thousands of gene features into a smaller set of informative components, removing noise and redundancy.

      - Choosing components that explain 95% variance ensures we retain meaningful biological signals.

      - KNN is a simple, interpretable model that works well after PCA since distance metrics become meaningful in reduced dimensions.

      - This pipeline prevents overfitting, improves computational efficiency, and is a robust choice for biomedical data with small samples but many features.

In [2]:
# PCA + KNN on High-Dimensional Biomedical Data (Example with Cancer Dataset)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load dataset (simulating gene expression dataset)
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features (important for PCA & KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# PCA: keep 95% variance (decides components automatically)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Number of components retained:", pca.n_components_)

# Train KNN on PCA-transformed data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)
y_pred = knn.predict(X_test_pca)

# Evaluate model
print("Accuracy with PCA + KNN:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Number of components retained: 10
Accuracy with PCA + KNN: 0.9649122807017544

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.94      0.95        63
           1       0.96      0.98      0.97       108

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171

