# KNN & PCA Assignment 

Q 1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

->  K-Nearest Neighbors (KNN) is a simple but powerful supervised learning algorithm that makes predictions based on the similarity between data points. Instead of building a mathematical model during training, KNN stores the entire dataset and makes decisions only when a new sample is given. This is why it is called a lazy learner.

In classification, KNN finds the “k” closest neighbors and assigns the class that appears most frequently among them. It assumes that points located closer to one another are more likely to belong to the same category.

In regression, KNN predicts continuous values by averaging the output values of the k nearest neighbors. Because the algorithm depends heavily on the distance between points, proper scaling is very important. Overall, KNN works by comparing distances and making predictions based on majority voting or average values.

Q 2. What is the Curse of Dimensionality and how does it affect KNN performance?

->  The Curse of Dimensionality refers to the problems that arise when the number of features in a dataset becomes very large. In high-dimensional space, data points become sparse and distances between them become less meaningful. As a result, even points that are supposed to be close appear far apart mathematically.

Since KNN depends directly on distance calculations, this causes the algorithm to make poor decisions. The model becomes slow, inaccurate, and requires much more data to work correctly.

This is why dimensionality reduction techniques like PCA are often applied before using KNN, especially when datasets contain a large number of features.

Q 3. What is Principal Component Analysis (PCA)? How is it different from feature selection?

->  Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original high-dimensional data into a smaller number of new variables called principal components. These components capture as much information (variance) as possible while reducing redundancy among features.

The key difference between PCA and feature selection is that PCA creates new features by combining existing ones, whereas feature selection simply chooses the most important existing features without altering them.

PCA is very useful for simplifying complex datasets, improving model performance, and reducing noise. It also helps in visualizing high-dimensional data more effectively.

Q 4. What are eigenvalues and eigenvectors in PCA, and why are they important? 

->  In PCA, eigenvalues and eigenvectors come from the covariance matrix of the dataset. Eigenvectors represent the directions in which the data varies the most, while eigenvalues represent how much variance lies along each direction.

Eigenvectors determine the orientation of the principal components, and eigenvalues determine their importance. Components with high eigenvalues carry more information about the dataset.

PCA ranks components based on their eigenvalues, helping us decide how many components to keep. Without eigenvalues and eigenvectors, PCA would not be able to identify meaningful patterns and reduce dimensions properly.

Q 5. How do KNN and PCA complement each other when applied in a single pipeline?

->  KNN works best when the number of features is small and well-scaled, while PCA reduces the number of features by extracting only the most important components. When PCA is applied before KNN, it removes noise, eliminates redundant information, and makes distances more meaningful.

This improves KNN’s accuracy, reduces overfitting, and speeds up computation. Therefore, using PCA followed by KNN creates a strong and efficient pipeline, especially for datasets with many correlated features.

In [2]:
# Dataset: 
# Use the Wine Dataset from sklearn.datasets.load_wine().

Q 6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

-> Python Code

In [8]:
# Python Code
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Without Scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn_no_scale.predict(X_test))

# With Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
knn_scale = KNeighborsClassifier(n_neighbors=5)
knn_scale.fit(X_train_s, y_train_s)
acc_scale = accuracy_score(y_test_s, knn_scale.predict(X_test_s))

print("Accuracy Without Scaling:", acc_no_scale)
print("Accuracy With Scaling:", acc_scale)

Accuracy Without Scaling: 0.7222222222222222
Accuracy With Scaling: 0.9444444444444444


Q 7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

-> Python Code

In [12]:
# Python Code
from sklearn.decomposition import PCA

# PCA Model
pca = PCA()
pca.fit(X)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)

Explained Variance Ratio:
[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


Q 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

-> Python Code

In [14]:
# Python Code
# PCA with 2 components
pca_2 = PCA(n_components=2)
X_pca = pca_2.fit_transform(X)

# Train-test split
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# KNN
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print("Accuracy with PCA (2 components):", acc_pca)
print("Original Accuracy (Scaled):", acc_scale)

Accuracy with PCA (2 components): 0.7222222222222222
Original Accuracy (Scaled): 0.9444444444444444


Q 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

->  Python Code

In [15]:
# Python Code
# Euclidean
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_s, y_train_s)
acc_eu = accuracy_score(y_test_s, knn_euclidean.predict(X_test_s))

# Manhattan
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_s, y_train_s)
acc_man = accuracy_score(y_test_s, knn_manhattan.predict(X_test_s))

print("Euclidean Accuracy:", acc_eu)
print("Manhattan Accuracy:", acc_man)

Euclidean Accuracy: 0.9444444444444444
Manhattan Accuracy: 0.9444444444444444


Q 10. You are working with a high-dimensional gene expression dataset to 
classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models 
overfit.

Explain how you would: 

● Use PCA to reduce dimensionality 

● Decide how many components to keep 

● Use KNN for classification post-dimensionality reduction

● Evaluate the model 

● Justify this pipeline to your stakeholders as a robust solution for real-world 
biomedical data

->  In high-dimensional gene expression datasets, the number of features is extremely large, while the number of samples is very small. This creates a major challenge because most traditional machine learning models start overfitting—they memorize noise instead of learning true patterns. To handle this problem effectively, the first step I would take is applying Principal Component Analysis (PCA). PCA reduces dimensionality by converting thousands of gene features into a smaller set of principal components that still capture most of the important biological variation.

To decide how many components to keep, I would look at the explained variance ratio and choose enough components to preserve around 90–95% of the total information. By selecting components based on variance, we ensure that noise and irrelevant gene expressions are removed, while key biological signals are retained. This step not only makes the dataset smaller but also improves the stability of further classification.

After dimensionality reduction, I would apply the K-Nearest Neighbors (KNN) algorithm for classification. KNN works well in lower-dimensional space because distance relationships become more meaningful once redundant features are removed. Using PCA before KNN ensures that the distance calculations are not affected by irrelevant gene expressions. I would also tune the value of ‘k’ to achieve the best accuracy.

For evaluating the model, I would use accuracy, confusion matrix, and cross-validation to check if the model performs consistently across different folds. Cross-validation is especially important in biomedical datasets where sample sizes are very small. This helps confirm that the model is generalizing and not overfitting to specific patients.

This PCA + KNN pipeline is highly suitable for real-world biomedical data because it reduces complexity, removes noise, and allows KNN to make reliable predictions. It also creates a more interpretable system for stakeholders. PCA helps visualize patterns in genetic profiles, and KNN provides clear classification outcomes. Together, they offer a robust, scientific, and practical solution for cancer prediction problems where feature count is extremely high and accuracy is critical.