1] What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
- K-Nearest Neighbors (KNN) is a simple yet powerful supervised machine learning algorithm that can be used for both classification and regression tasks. It is based on the idea that data points with similar features are likely to have similar outcomes. Instead of building an explicit model during training, KNN is a lazy learner, meaning it simply stores the training data and makes predictions at query time by comparing the input with stored examples.

In classification, KNN works by finding the k closest data points (neighbors) to a new input using a distance metric such as Euclidean distance. The algorithm then assigns the class label that is most common among these neighbors. For example, if k = 5 and among the five nearest neighbors, three belong to class A and two belong to class B, the new point will be classified as class A.

In regression, the same principle is applied, but instead of voting for the most frequent class, KNN predicts the value by averaging (or sometimes taking the weighted average of) the target values of the k nearest neighbors. For instance, if you want to predict house prices based on location and size, KNN would find the nearest houses in the dataset and take their average price as the prediction.

The effectiveness of KNN depends heavily on the choice of k and the distance metric. A small k can make the model sensitive to noise, while a very large k may oversmooth the decision boundary. Feature scaling (e.g., normalization) is also crucial because KNN relies on distance calculations. Despite its simplicity, KNN works well for problems where decision boundaries are irregular and nonlinear.

2] What is the Curse of Dimensionality and how does it affect KNN
performance?
- The Curse of Dimensionality refers to the set of problems that arise when data is represented in a very high-dimensional space. As the number of features (dimensions) increases, the data becomes sparse, distances between points lose their meaning, and models that rely on distance metrics—like KNN—struggle to perform well.

For KNN specifically, this curse affects performance in several ways. Since KNN relies on measuring the distance between points to find the nearest neighbors, high-dimensional data makes all points appear almost equally distant from each other. This weakens the distinction between “close” and “far” neighbors, which reduces the effectiveness of the algorithm in identifying truly similar examples. Additionally, because the feature space grows exponentially with dimensions, much more training data is required to adequately represent the space. Without enough data, the model risks becoming inaccurate or overfitting.

In practice, this means that KNN may work well in low-dimensional problems but struggles when the dataset has hundreds or thousands of features, such as in text or image data. To mitigate this, techniques like feature selection (keeping only the most informative features) or dimensionality reduction methods such as PCA (Principal Component Analysis) are often applied before using KNN. This helps concentrate the data into a space where distance metrics are more meaningful, improving both accuracy and efficiency.

3] What is Principal Component Analysis (PCA)? How is it different from
feature selection?
- Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset with many features into a smaller set of new features called principal components, while retaining as much of the original variance (information) as possible. These principal components are linear combinations of the original features and are ordered in such a way that the first component captures the maximum variance, the second captures the next highest variance orthogonal to the first, and so on. By keeping only the top few components, PCA reduces the dimensionality of the dataset, which helps improve computational efficiency and often enhances model performance by removing noise and redundancy.

The key difference between PCA and feature selection lies in how they reduce dimensionality. Feature selection chooses a subset of the original features based on some criterion, such as correlation with the target, mutual information, or importance scores from a model. The selected features remain interpretable because they come directly from the dataset. PCA, on the other hand, creates new features (principal components) that are combinations of the originals. While these new components are powerful in capturing variance, they are not directly interpretable in terms of the original dataset’s features.

4] What are eigenvalues and eigenvectors in PCA, and why are they
important?
- In the context of PCA, eigenvalues and eigenvectors come from the covariance matrix of the dataset and play a central role in identifying the new feature space.

An eigenvector represents the direction of a principal component, essentially showing the axis along which the data varies the most. Each eigenvector defines one such axis, and these axes are orthogonal (uncorrelated) to each other. In PCA, the first eigenvector points in the direction of the maximum variance in the data, the second eigenvector points in the next most significant variance direction orthogonal to the first, and so on.

An eigenvalue, on the other hand, represents the magnitude of variance captured by its corresponding eigenvector. The larger the eigenvalue, the more variance that principal component explains. For instance, if the first eigenvalue is much larger than the rest, the first principal component explains the majority of the dataset’s variability.

They are important in PCA because they tell us both where the important patterns in the data lie (eigenvectors) and how significant those patterns are (eigenvalues). By selecting the top eigenvectors associated with the largest eigenvalues, PCA reduces dimensionality while retaining the maximum possible information. This balance of variance preservation and dimensionality reduction is what makes PCA effective

5] How do KNN and PCA complement each other when applied in a single
pipeline?
- KNN and PCA complement each other well because they address each other’s weaknesses. KNN is a simple, non-parametric algorithm that relies heavily on distance calculations to classify or predict, but its performance degrades badly in high-dimensional spaces due to the curse of dimensionality—where distances between points become less meaningful and noise dominates. PCA, on the other hand, is a dimensionality reduction technique that compresses data into fewer, more informative components by removing redundancy and focusing on directions of maximum variance.

When combined in a single pipeline, PCA reduces the number of features, making distance comparisons in KNN more meaningful and computationally efficient. By projecting the data into a lower-dimensional space, PCA helps KNN avoid overfitting, reduces noise, and speeds up neighbor searches. For example, in image recognition, raw pixel data may have thousands of dimensions, making KNN impractical. Applying PCA first condenses the image data into a smaller number of components that still capture the essential patterns, allowing KNN to work effectively.

In [1]:
# 6]Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- KNN without scaling ---
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# --- KNN with scaling ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy without Scaling:", acc_no_scaling)
print("Accuracy with Scaling:", acc_scaled)


Accuracy without Scaling: 0.7222222222222222
Accuracy with Scaling: 0.9444444444444444


In [2]:
# 7] : Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
wine = load_wine()
X = wine.data

# Scale features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (keep all components)
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio of each Principal Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [3]:
# 8] Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Scale features (important for KNN and PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# --- KNN on Original Dataset ---
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train, y_train)
y_pred_original = knn_original.predict(X_test)
acc_original = accuracy_score(y_test, y_pred_original)

# --- PCA Transformation (retain top 2 components) ---
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Split PCA data
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca, y, test_size=0.3, random_state=42, stratify=y
)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test_pca, y_pred_pca)

# Print Results
print("Accuracy with Original Dataset:", acc_original)
print("Accuracy with PCA (2 components):", acc_pca)


Accuracy with Original Dataset: 0.9444444444444444
Accuracy with PCA (2 components): 0.9629629629629629


In [4]:
# 9] Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# --- KNN with Euclidean distance (default: p=2) ---
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# --- KNN with Manhattan distance (p=1) ---
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("Accuracy with Euclidean Distance:", acc_euclidean)
print("Accuracy with Manhattan Distance:", acc_manhattan)


Accuracy with Euclidean Distance: 0.9444444444444444
Accuracy with Manhattan Distance: 0.9814814814814815


10] You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


- We start with a gene-expression dataset that has thousands of features but only a small number of patients. This usually causes overfitting because the model sees too much noise compared to real signal. To fix this, we use PCA to compress the data into a smaller set of “summary features” (principal components) that capture most of the meaningful variation while ignoring redundant noise.

To decide how many components to keep, we’d look at two things: how much variance they explain and how well they help the model perform in cross-validation. Usually, keeping enough components to explain about 90–95% of the variance works well, but we’d also test performance across different numbers of components to be safe.

After reducing dimensionality, we apply KNN. Normally, KNN struggles in very high dimensions, but with PCA it works better because distances between patients become more meaningful. KNN is also intuitive to explain: it predicts a patient’s cancer type by looking at the most similar past cases.

For evaluation, we would use cross-validation to get a fair estimate of accuracy and also focus on metrics like F1-score and ROC-AUC, which are better than plain accuracy when classes are imbalanced. If possible, we’d also test on an independent dataset to prove the model generalizes.

For stakeholders, this pipeline is a good fit because it reduces overfitting, keeps the model interpretable, and directly links predictions to “similar past patients,” which is easy to understand in a biomedical setting. The end result is a more reliable, practical tool that helps avoid missed diagnoses while reducing unnecessary tests.