#KNN & PCA | Assignment

**Question 1:** What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

**Ans.** K-Nearest Neighbors (KNN) is a simple, non-parametric, and instance-based supervised learning algorithm used for both classification and regression tasks. It works by finding the ‘k’ closest data points (neighbors) to a given input and making predictions based on the majority label (for classification) or the average value (for regression) of those neighbors. In classification, KNN assigns the class most common among its k nearest neighbors to the input data point. In regression, KNN calculates the mean (or sometimes the median) of the k nearest neighbors’ output values and assigns this as the predicted value. The distance between data points is usually measured using metrics such as Euclidean distance. KNN does not learn a model in the training phase; instead, it stores the entire training dataset, and predictions are made during the testing phase. The choice of ‘k’ and the distance metric significantly impact the performance of the algorithm.

**Question 2:** What is the Curse of Dimensionality and how does it affect KNN
performance?

**Ans.** The Curse of Dimensionality refers to the various problems and challenges that arise when analyzing and organizing data in high-dimensional spaces. As the number of dimensions (features) increases, the volume of the space increases exponentially, and data points become sparse and spread out. This sparsity makes it difficult for machine learning algorithms, especially distance-based models like K-Nearest Neighbors (KNN), to find meaningful patterns.

KNN relies heavily on the concept of proximity or similarity between data points, typically using distance metrics such as Euclidean distance. In low-dimensional spaces, the difference in distances between the nearest and farthest neighbors is usually significant, making it easier to distinguish between relevant and irrelevant neighbors. However, in high-dimensional spaces, the distances between data points tend to become similar, making it difficult to identify truly "nearest" neighbors. This leads to poor model performance, as the algorithm may include noisy, irrelevant, or distant points in its neighborhood calculations.

Moreover, high-dimensional data often includes many irrelevant or redundant features, which can further distort the distance metrics and negatively impact the accuracy of KNN. The increased computational complexity due to more dimensions also slows down the model during both training (data storage) and prediction (distance calculation).

To mitigate the effects of the Curse of Dimensionality in KNN, it is essential to apply dimensionality reduction techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or feature selection methods. These techniques help reduce the number of dimensions by keeping only the most informative features, which improves the reliability of distance measurements and enhances the overall performance of the KNN algorithm.

**Question 3:** What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Ans.** Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much of the original variability (information) as possible. It transforms the original correlated features into a new set of uncorrelated variables called principal components. These principal components are ordered so that the first captures the maximum variance in the data, the second captures the next highest variance, and so on. By keeping only the top few principal components, PCA reduces the dimensionality of the data, helping to simplify models, reduce noise, and speed up computation.

How PCA Differs from Feature Selection

* Feature Extraction vs. Feature Selection: PCA is a feature extraction method that creates new features by combining the original ones into principal components. Feature selection, on the other hand, involves selecting a subset of the existing features without creating new ones.
* Interpretability: The principal components created by PCA are linear combinations of the original features and are usually not easy to interpret. Feature selection retains the original features, making the results more interpretable.
* Objective: PCA focuses on capturing the maximum variance in the data, regardless of the relevance to the target variable. Feature selection aims to select features that are most relevant and useful for predicting the target.
* Methodology: PCA uses mathematical transformations based on eigenvalues and eigenvectors. Feature selection uses statistical tests, correlation analysis, or model-based importance measures to pick relevant features.
* Use Cases: PCA is useful when dimensionality reduction is needed but interpretability is less critical. Feature selection is preferred when preserving the meaning of features is important for understanding and explaining the model.

For example, if you have 50 features and apply PCA to reduce them to 10 principal components, those 10 components are new features made from combinations of the original 50. In contrast, feature selection would simply keep 10 of the original 50 features based on their importance.

**Question 4:** What are eigenvalues and eigenvectors in PCA, and why are they
important?

**Ans.**  Eigenvalues and eigenvectors are fundamental concepts in PCA that help identify the directions in which the data varies the most and quantify the importance of those directions.

What Are Eigenvalues and Eigenvectors?

* Eigenvectors are vectors that define directions in the feature space along which data variation is measured. In PCA, each eigenvector represents a principal component — a new axis that the data can be projected onto.
* Eigenvalues are scalars associated with each eigenvector that indicate the amount of variance (or information) captured along that direction. A higher eigenvalue means the corresponding eigenvector captures more of the data’s variance.

Why Are They Important in PCA?

* Dimensionality Reduction: PCA uses eigenvectors to find new axes (principal components) that maximize data variance. These axes are the directions where the data spreads out the most.
* Variance Quantification: Eigenvalues tell us how much variance each principal component explains. This helps in deciding how many principal components to keep—usually, components with larger eigenvalues are retained because they capture the most important information.
* Data Transformation: By projecting the original data onto the eigenvectors, PCA transforms the data into a new coordinate system where features are uncorrelated, making it easier to analyze and model.
* Noise Reduction: Components with small eigenvalues often represent noise or less informative parts of the data. Removing these helps in cleaning the dataset and improving model performance.

In summary, eigenvectors determine the directions of maximum variance in the data, and eigenvalues quantify how significant each direction is. Together, they enable PCA to effectively reduce dimensionality while preserving essential information.

**Question 5:** How do KNN and PCA complement each other when applied in a single
pipeline?

**Ans.** KNN and PCA complement each other well when combined in a single machine learning pipeline, especially when dealing with high-dimensional data.

Why Combine PCA and KNN

* Dimensionality Reduction: PCA reduces the number of features by transforming the data into a lower-dimensional space while preserving most of the variance. This helps simplify the data and reduces noise, which improves KNN’s effectiveness.
* Mitigating the Curse of Dimensionality: KNN relies on distance calculations, which become less meaningful as the number of dimensions increases. By applying PCA first, the feature space is compressed, making distance metrics more reliable and helping KNN find truly nearest neighbors.
* Speeding Up Computation: High-dimensional data increases the computational cost of KNN because distances must be calculated across many features. PCA reduces the feature count, which lowers computational complexity and speeds up predictions.
* Improving Accuracy: By removing irrelevant or noisy features through PCA, KNN can focus on the most informative components, leading to more accurate classification or regression results.
* Simplifying Data Visualization: PCA’s reduced dimensions make it easier to visualize data and understand how KNN classifies or predicts, aiding in model interpretation.

In summary, PCA serves as a preprocessing step that enhances KNN’s performance by reducing dimensionality, improving distance calculations, and speeding up computation, making the combined pipeline more efficient and effective.

Python example that applies PCA for dimensionality reduction followed by K-Nearest Neighbors (KNN) classification on the Wine dataset from sklearn.

    from sklearn.datasets import load_wine
    from sklearn.decomposition import PCA
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import accuracy_score

    # Load the Wine dataset
    data = load_wine()
    X, y = data.data, data.target

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

    # Standardize features (important before PCA and KNN)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Apply PCA to reduce dimensionality (e.g., to 2 components for visualization or 5 for balance)
    pca = PCA(n_components=5)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    # Initialize KNN classifier (k=3 as an example)
    knn = KNeighborsClassifier(n_neighbors=3)

    # Train the KNN model on PCA-transformed data
    knn.fit(X_train_pca, y_train)

    # Predict on test data
    y_pred = knn.predict(X_test_pca)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN classification accuracy after PCA: {accuracy:.4f}")

    # Optional: Explained variance ratio by PCA components
    print("Explained variance ratio by PCA components:", pca.explained_variance_ratio_)

In [1]:
'''Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.'''

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# ---- Without Feature Scaling ----
knn_no_scaling = KNeighborsClassifier(n_neighbors=3)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ---- With Feature Scaling ----
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_with_scaling = KNeighborsClassifier(n_neighbors=3)
knn_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = knn_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# Print the results
print(f"Accuracy without feature scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with feature scaling: {accuracy_with_scaling:.4f}")

Accuracy without feature scaling: 0.6852
Accuracy with feature scaling: 0.9444


In [2]:
'''Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.'''

from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the Wine dataset
data = load_wine()
X = data.data

# Standardize the features before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model (keep all components)
pca = PCA(n_components=X.shape[1])
pca.fit(X_scaled)

# Print explained variance ratio for each principal component
explained_variance = pca.explained_variance_ratio_
for i, variance_ratio in enumerate(explained_variance, start=1):
    print(f"Principal Component {i}: {variance_ratio:.4f}")

Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


In [3]:
'''Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.'''

from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize the features (important for both PCA and KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ----- KNN on original data -----
knn_original = KNeighborsClassifier(n_neighbors=3)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# ----- PCA transformation (top 2 components) -----
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# ----- KNN on PCA-transformed data -----
knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Print accuracies
print(f"Accuracy with original features: {accuracy_original:.4f}")
print(f"Accuracy with top 2 PCA components: {accuracy_pca:.4f}")

Accuracy with original features: 0.9444
Accuracy with top 2 PCA components: 0.9444


In [5]:
'''Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.'''

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN with Euclidean distance (default)
knn_euclidean = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print the results
print(f"Accuracy with Euclidean distance: {accuracy_euclidean:.4f}")
print(f"Accuracy with Manhattan distance: {accuracy_manhattan:.4f}")

Accuracy with Euclidean distance: 0.9444
Accuracy with Manhattan distance: 0.9630


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output

**Ans.**

✅ Step-by-Step Solution:
1. Use PCA to Reduce Dimensionality

High-dimensional data can contain redundant or noisy features. PCA transforms the data into a smaller set of uncorrelated components that retain the majority of the variance (information).

2. Decide How Many Components to Keep

We'll examine the explained variance ratio and choose the smallest number of components that together explain at least 95% of the total variance. This balances dimensionality reduction and information retention.

3. Use KNN for Classification

Once PCA reduces dimensionality, KNN is used to classify cancer types based on the reduced data. KNN is simple and effective when dimensionality is controlled.

4. Evaluate the Model

We’ll use accuracy score along with cross-validation to get a reliable estimate of performance, since sample size is small.

5. Justify the Pipeline

This PCA + KNN pipeline:

* Reduces overfitting by eliminating redundant dimensions

* Improves generalization by retaining key signals in data

* Simplifies the model and makes it computationally efficient

* Is a well-established method in biomedical fields for omics data


      from sklearn.datasets import make_classification
      from sklearn.decomposition import PCA
      from sklearn.neighbors import KNeighborsClassifier
      from sklearn.model_selection import train_test_split, cross_val_score
      from sklearn.preprocessing import StandardScaler
      import matplotlib.pyplot as plt
      import numpy as np

      # Simulate a gene expression-like dataset: 100 samples, 1000 features
      X, y = make_classification(n_samples=100, n_features=1000, n_informative=50, n_classes=3, random_state=42)

      # Standardize the dataset
      scaler = StandardScaler()
      X_scaled = scaler.fit_transform(X)

      # Apply PCA (keep all components initially to check explained variance)
      pca_full = PCA()
      X_pca_full = pca_full.fit_transform(X_scaled)

      # Plot cumulative explained variance
      cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
      plt.figure(figsize=(8,5))
      plt.plot(cumulative_variance, marker='o')
      plt.xlabel('Number of Principal Components')
      plt.ylabel('Cumulative Explained Variance')
      plt.title('Explained Variance vs. Number of Components')
      plt.grid(True)
      plt.axhline(y=0.95, color='r', linestyle='--', label='95% Variance')
      plt.legend()
      plt.show()

      # Decide number of components to retain 95% variance
      n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
      print(f"Number of components to retain 95% variance: {n_components_95}")

      # Apply PCA with optimal number of components
      pca = PCA(n_components=n_components_95)
      X_reduced = pca.fit_transform(X_scaled)

      # Train/Test split
      X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42, stratify=y)

      # KNN classifier
      knn = KNeighborsClassifier(n_neighbors=3)
      knn.fit(X_train, y_train)
      accuracy = knn.score(X_test, y_test)
      print(f"Test Accuracy (PCA + KNN): {accuracy:.4f}")

      # Cross-validation for robustness
      cv_scores = cross_val_score(knn, X_reduced, y, cv=5)
      print(f"Cross-Validation Accuracy (mean ± std): {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")