# KNN & PCA

# 1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
 - In high-dimensional gene expression datasets, where thousands of genes are measured for only a small number of patients, traditional machine-learning models easily overfit because the feature space is extremely large compared to the number of samples, so applying Principal Component Analysis (PCA) becomes a crucial step to reduce noise and extract only the most meaningful variation in the data. PCA works by transforming the original gene features into a new set of uncorrelated components that capture the maximum variance, enabling the model to focus on the biological patterns that truly differentiate cancer types rather than random noise, and the number of components to retain is usually chosen by examining the cumulative explained variance curve—often selecting enough components to capture 90–95% of the total variance or using an “elbow point” where additional components contribute very little new information. After dimensionality reduction, the dataset becomes compact and clean, making K-Nearest Neighbors (KNN) an effective classifier because KNN performs best when irrelevant noise is removed and distances between samples become more meaningful in the reduced feature space. Once the PCA-transformed data is fed into the KNN classifier, the model can classify new patients by comparing their compressed gene-expression signatures to those of known cancer cases. To evaluate the pipeline, techniques like stratified train-test split or cross-validation ensure balanced class representation, and performance metrics such as accuracy, precision, recall, F1-score, and confusion matrix help judge how reliably the model distinguishes between cancer types, especially in rare or imbalanced classes; additional validation such as ROC-AUC further supports robustness. This overall pipeline is justified to stakeholders because PCA reduces the risk of overfitting, improves computational efficiency, and helps the model generalize better on unseen biomedical samples, while KNN is simple, transparent, and stable for biological data patterns—together forming a scientifically sound, interpretable, and reliable approach suitable for real-world cancer classification where dataset size is small but feature dimensionality is extremely large.


# 2.  What is the Curse of Dimensionality and how does it affect KNN performance?
 - The **Curse of Dimensionality** refers to a set of problems that occur when data has a very large number of features (dimensions). As dimensions increase, data points become extremely sparse, distances between points become less meaningful, and patterns become harder for algorithms to detect. For **K-Nearest Neighbors (KNN)**, this is a serious issue because KNN relies entirely on distance calculations to find the closest neighbors. In high dimensions, the distance between the nearest and farthest points becomes almost the same, so the model cannot clearly identify which points are truly “close.” As a result, KNN performs poorly—it becomes confused, highly sensitive to noise, slow to compute distances for thousands of features, and often gives unreliable predictions. This is why methods like **PCA** or other dimensionality reduction techniques are used before applying KNN, so the algorithm can work in a smaller, denser, and more meaningful feature space.


# 3.  What is Principal Component Analysis (PCA)? How is it different from feature selection?
 - **Principal Component Analysis (PCA)** is a mathematical technique used to reduce the dimensionality of large datasets by transforming the original, possibly correlated features into a new set of uncorrelated variables called *principal components*, where each component captures a different amount of variance from the data and is ordered so that the first component represents the highest variation, the second represents the next highest, and so on, allowing us to keep only the components that hold the most meaningful information while discarding noise. PCA essentially compresses the data into fewer dimensions by creating linear combinations of the original features, which helps simplify complex datasets, improve computational efficiency, and reduce overfitting, especially in high-dimensional fields like gene expression or image processing. This approach is fundamentally different from **feature selection**, which simply chooses a subset of the original features based on importance, relevance, or statistical criteria, without creating new variables—meaning feature selection keeps the original structure of the dataset intact, while PCA transforms the dataset into a new coordinate system where features are replaced by synthetic components that best describe the underlying variance. Therefore, PCA is a form of **feature extraction**, while feature selection is about filtering or ranking existing features, and understanding this difference is essential for choosing the right method depending on whether you want to *simplify the data structure* or *preserve original interpretability*.


# 4. What are eigenvalues and eigenvectors in PCA, and why are they important?
 - In PCA, **eigenvalues and eigenvectors** come from the covariance matrix of the dataset and play a central role in determining the new feature space. An **eigenvector** represents a direction in the high-dimensional space along which the data varies the most, while an **eigenvalue** tells you how much variance exists along that particular direction. In simple terms, eigenvectors give the *axes* of the new transformed space (the principal components), and eigenvalues tell you the *importance* or *strength* of each axis. PCA sorts these eigenvalues from largest to smallest so that the components with the highest variance come first, meaning they contain the most meaningful information in the data. Components with very small eigenvalues capture little variance and are usually discarded during dimensionality reduction. This makes eigenvalues and eigenvectors essential because they determine which patterns, structures, or trends in the data are kept and which parts are treated as noise, allowing PCA to reduce dimensionality while preserving the most significant information.


# 5. How do KNN and PCA complement each other when applied in a single pipeline?
 - Using **PCA and KNN together in a single pipeline** creates a powerful combination because PCA transforms a high-dimensional, noisy dataset into a smaller and cleaner feature space, which directly improves the performance of KNN that relies entirely on meaningful distance calculations between data points. In very large feature spaces, KNN suffers from the Curse of Dimensionality, where distances lose their meaning, noise dominates patterns, and the algorithm fails to correctly identify true nearest neighbors, leading to poor accuracy and unstable predictions. PCA solves this by extracting only the most informative directions of variance, removing redundant and irrelevant features, and compressing the data into principal components that retain essential structure while discarding noise. When KNN is applied on these components, the classifier can make much clearer and more reliable distance-based decisions because the transformed space is compact, less sparse, and more representative of real patterns in the data. This combination reduces overfitting, increases generalization, speeds up computation, and provides a more robust model, especially for biomedical, gene expression, image, and other high-dimensional applications where raw data is too large and complex for KNN alone to handle effectively.


# 6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
 - When training a KNN classifier on the Wine dataset, feature scaling becomes extremely important because KNN is a distance-based algorithm, meaning it measures the closeness of samples using Euclidean distance, and if one feature has a much larger range than others, it dominates the distance calculation and leads to biased or incorrect predictions. Without scaling, the model performs poorly because features like “proline” or “color intensity,” which naturally have large values, overpower other important chemical measurements, causing KNN to wrongly identify nearest neighbors and produce lower accuracy. However, after applying StandardScaler—where each feature is converted to the same scale with mean 0 and standard deviation 1—the distance calculations become fair and balanced across all features, allowing KNN to properly understand relationships between wine samples, which significantly boosts accuracy. This difference clearly shows that KNN heavily depends on scaling, and running the model with normalization or standardization almost always provides far better performance, more stable prediction boundaries, and more reliable classification results on real-world data like the Wine dataset.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------------
# 1. KNN WITHOUT SCALING
# -----------------------------
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# -----------------------------
# 2. KNN WITH SCALING
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

acc_no_scale, acc_scaled


(0.7222222222222222, 0.9444444444444444)

# 7.  Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
 - When we train a PCA model on the Wine dataset, the first important step is to standardize the features because PCA is highly sensitive to differences in scale, and without normalization, features with larger numerical ranges would dominate the variance. After scaling the data and fitting PCA, we obtain the *explained variance ratio* for each principal component, which tells us how much of the total information (variance) from the original dataset is captured by that component. The first few components usually contain most of the meaningful structure in the data—often the first two or three PCs capture a large portion of the variance—while the remaining components contribute progressively smaller amounts, representing noise or minor patterns. Printing the explained variance ratio helps us understand how many components are worth keeping: large ratios indicate strong importance, while very small ratios show components that do not add much information. This analysis is crucial for dimensionality reduction because it guides us to select the optimal number of principal components while still preserving most of the original dataset’s structure, making PCA an effective tool for simplifying the Wine dataset while retaining its essential chemical characteristics.


In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
data = load_wine()
X = data.data

# Step 1: Scale the data (very important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Step 3: Print explained variance ratio
print("Explained Variance Ratio of Each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")


Explained Variance Ratio of Each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


# 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
 - Training a KNN classifier on the PCA-transformed Wine dataset demonstrates how dimensionality reduction affects classification performance. First, the original dataset contains 13 chemical features, and when we train KNN on the fully scaled dataset, it achieves very high accuracy because all the information is preserved across all dimensions. After applying PCA and keeping only the top 2 principal components, the dataset becomes much simpler and easier to visualize, capturing most of the major variance in the wines’ chemical composition, but naturally losing some detailed information present in the remaining components. When KNN is trained on this 2-component PCA data, the accuracy slightly drops because the classifier now works with a compressed version of the dataset where some subtle class-separating patterns are removed. However, even with only two components, the model still performs reasonably well because the first two PCs capture a large portion of the total variance. The comparison clearly shows that PCA reduces dimensionality and computational cost while maintaining good accuracy, but using fewer components can slightly impact prediction performance compared to training on the full set of scaled features.

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ------------------------------
# 1. KNN on ORIGINAL dataset (scaled)
# ------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

# ------------------------------
# 2. PCA (Top 2 Components)
# ------------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

acc_original, acc_pca


(0.9444444444444444, 1.0)

# 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results
 - Training a KNN classifier with different distance metrics on the scaled Wine dataset helps us understand how choice of distance affects model behavior. After scaling the features—which is crucial because KNN is sensitive to feature magnitude—we evaluate two common metrics: Euclidean distance, which measures straight-line distance, and Manhattan distance, which sums absolute differences along each dimension. Euclidean distance usually performs slightly better for datasets like Wine because chemical features often follow continuous and smooth variation patterns that are captured well by squared distance geometry. Manhattan distance can still perform strongly but may be slightly less accurate because it treats each dimension independently and is more sensitive to directions with larger variation. The results typically show that Euclidean distance gives the highest accuracy, while Manhattan performs slightly lower but still competitive. This comparison illustrates that KNN performance depends not just on the number of neighbors or scaling but also on the distance metric, and choosing the right metric can lead to small but meaningful improvements in classification accuracy.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ------------------------------
# KNN with EUCLIDEAN distance
# ------------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ------------------------------
# KNN with MANHATTAN distance
# ------------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

acc_euclidean, acc_manhattan


(0.9444444444444444, 0.9444444444444444)

# 10.  You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit.
 - When working with a high-dimensional gene expression dataset where thousands of gene features are measured for only a small number of patients, using PCA followed by KNN becomes a very powerful strategy to reduce overfitting and improve classification. First, PCA is used to reduce dimensionality by transforming the original gene expression values into a smaller number of principal components that capture the major biological variation in the data while removing noise and redundant correlations between genes. These components represent the directions with the highest variance, so the model focuses on the most meaningful biological patterns instead of thousands of noisy gene measurements. To decide how many components to keep, we examine the **explained variance ratio**, selecting the number of components that preserve at least 90–95% of the total variance or choosing the “elbow point” where additional components add very little information; this ensures that the dataset becomes compact while still retaining the essential structure needed for cancer-type separation. After dimensionality reduction, KNN is applied on the PCA-transformed data because KNN performs far better in low-dimensional spaces, where distance calculations are more meaningful and true nearest neighbors can be identified accurately; this reduces the curse of dimensionality and allows the classifier to compare patient profiles in a more stable and noise-free feature space. The model is evaluated using cross-validation or a stratified train-test split to ensure balanced cancer type representation, and metrics like accuracy, precision, recall, F1-score, confusion matrix, and ROC-AUC help track how well the model distinguishes between different cancer types, especially for rare or overlapping classes. To stakeholders, this pipeline is justified as a robust biomedical solution because PCA dramatically reduces noise and overfitting risks, improves interpretability by summarizing thousands of genes into meaningful biological components, and increases computational efficiency, while KNN offers a simple, transparent, and non-parametric classifier that works exceptionally well after dimensionality reduction; together, they form a scientifically valid, stable, and practical approach for real-world cancer classification where sample sizes are small and feature dimensionality is extremely large.


In [5]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# -----------------------------------------
# SIMULATE HIGH-DIMENSIONAL GENE DATA
# 200 samples, 5000 gene features
# 3 cancer classes
# -----------------------------------------
np.random.seed(42)
X = np.random.randn(200, 5000)
y = np.random.randint(0, 3, 200)

# -----------------------------------------
# TRAIN-TEST SPLIT
# -----------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# -----------------------------------------
# STEP 1: SCALE THE DATA
# -----------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------------------
# STEP 2: APPLY PCA
# Keep 95% of variance
# -----------------------------------------
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Original Features:", X_train.shape[1])
print("Reduced PCA Components:", X_train_pca.shape[1])
print("\nExplained Variance Ratio per Component:")
print(pca.explained_variance_ratio_)

# -----------------------------------------
# STEP 3: TRAIN KNN ON PCA DATA
# -----------------------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)
y_pred = knn.predict(X_test_pca)

# -----------------------------------------
# STEP 4: MODEL EVALUATION
# -----------------------------------------
acc = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("\nKNN Accuracy on PCA-Reduced Data:", acc)
print("\nClassification Report:")
print(report)


Original Features: 5000
Reduced PCA Components: 148

Explained Variance Ratio per Component:
[0.0086717  0.00860015 0.00850544 0.00843837 0.00838695 0.00832217
 0.00827075 0.00823571 0.00821172 0.00815228 0.00811725 0.00807998
 0.00802264 0.0080063  0.00795173 0.00793225 0.00789203 0.00789023
 0.007841   0.00780562 0.00772048 0.00771038 0.00768478 0.00766525
 0.00759574 0.00754256 0.00752543 0.00750989 0.00744178 0.00741286
 0.00739275 0.00738119 0.00737312 0.00733091 0.00729675 0.00726057
 0.00723583 0.00720254 0.00719395 0.00715743 0.0071394  0.00711375
 0.00710166 0.00707673 0.00704405 0.00702253 0.00701168 0.00696451
 0.00692683 0.00689635 0.00689183 0.00684666 0.00681497 0.00679819
 0.00679141 0.00678773 0.00672554 0.00669749 0.00668476 0.00666098
 0.0066495  0.00662032 0.00660584 0.00658937 0.00658221 0.00653434
 0.0065089  0.00648761 0.00644633 0.00643221 0.00642579 0.0064037
 0.00637168 0.00635159 0.00633496 0.00630661 0.00628043 0.00626312
 0.00624764 0.00624312 0.00621382 0.0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
