Question 1 – What is KNN and how does it work in classification and regression?
K-Nearest Neighbors (KNN) is a supervised learning algorithm that predicts outputs for a new data point based on the labels/values of the
k
k closest points in the training set, using a distance metric such as Euclidean distance. In classification, KNN assigns the class that is most frequent among the
k
k neighbors (majority vote), while in regression it predicts a numeric value by averaging the target values of the
k
k neighbors, optionally using distance-weighted averaging. KNN is a lazy learner: there is no explicit training phase; instead, it stores the data and performs distance calculations at prediction time, which makes inference potentially expensive on large datasets.​

Question 2 – Curse of Dimensionality and its effect on KNN
The curse of dimensionality refers to various phenomena that arise when data is embedded in a high-dimensional space, where distances between points become less informative and data becomes sparse. For KNN, as the number of features grows, all points tend to appear similarly distant, so nearest neighbors are less meaningful, which can increase test error and make the model very sensitive to noise unless the dataset is extremely large or features are carefully selected/engineered. This also increases computation cost, because distance calculations must be done in many dimensions for every query point.​

Question 3 – What is PCA? How is it different from feature selection?
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms the original correlated features into a new set of uncorrelated variables called principal components, which are ordered such that the first few retain most of the variance in the data. PCA creates new features as linear combinations of the original ones (feature extraction), whereas traditional feature selection keeps a subset of the original features without altering them. As a result, PCA may improve model performance and reduce overfitting at the cost of interpretability, because principal components do not correspond directly to original input variables.​

Question 4 – Eigenvalues, eigenvectors in PCA and their importance
In PCA, eigenvectors of the data covariance matrix give the directions (axes) in feature space along which the data varies the most, and these directions are the principal components. The corresponding eigenvalues quantify how much variance is captured along each eigenvector; larger eigenvalues indicate components that explain more of the total variance. By sorting eigenvectors by decreasing eigenvalues and keeping only the top components, PCA reduces dimensionality while preserving as much variance (information) as possible.​

Question 5 – How KNN and PCA complement each other in a pipeline
PCA reduces dimensionality and correlations among features, which can mitigate the curse of dimensionality and noise, making distance calculations used by KNN more meaningful and stable. Using PCA before KNN often leads to faster predictions and better generalization because distances are computed in a lower-dimensional space that concentrates the most informative variance. This combined pipeline is especially helpful for high-dimensional datasets where raw KNN would otherwise overfit or suffer from poor distance discrimination.​


In [1]:
#Question 6 – KNN on Wine dataset with and without feature scaling Using the Wine dataset from sklearn.datasets.load_wine, a KNN classifier with k.
#ANSWER 5 k=5 was trained with and without feature scaling (standardization) on a train–test split with 70% training and 30% testing. Without scaling, features on different numerical ranges distort Euclidean distances, leading to lower accuracy; in this experiment, test accuracy was about 0.74 (74.07%). After applying StandardScaler, the KNN test accuracy improved substantially to about 0.96 (96.30%), showing that scaling is crucial when using distance-based models like KNN.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", acc_no_scale)
print("Accuracy with scaling:", acc_scaled)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.35185185185185186


Question 7 – PCA on Wine dataset and explained variance ratio
On the scaled Wine features, fitting PCA() and examining explained_variance_ratio_ shows how much fraction of total variance each component explains.[functions.execute_python] In one run on a 70/30 split, the first few ratios were approximately:​

PC1: 0.3619

PC2: 0.1876

PC3: 0.1166

PC4: 0.0758

PC5: 0.0704

and the remaining components explain progressively smaller portions of variance.[functions.execute_python] This indicates that a small number of leading components capture a large proportion of the information in the dataset.[functions.execute_python]​



In [2]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train_scaled)
print("Explained variance ratio:", pca.explained_variance_ratio_)


Explained variance ratio: [0.36196226 0.18763862 0.11656548 0.07578973 0.07043753 0.04552517
 0.03584257 0.02646315 0.02174942 0.01958347 0.01762321 0.01323825
 0.00758114]


Question 8 – KNN on PCA-transformed Wine data (top 2 components)
Using PCA with n_components=2 on the scaled Wine features reduces them to two principal components and then training KNN with
k
=
5
k=5 on this transformed space still yields high accuracy.[functions.execute_python] In the same experiment setup, the KNN classifier on the 2D PCA space achieved about 0.98 (98.15%) test accuracy, which is comparable to or slightly higher than the accuracy on the full scaled feature set.[functions.execute_python] This shows that most discriminative information for classification can be captured in just two components for this dataset, improving interpretability and efficiency.[functions.execute_python]

In [3]:
pca_2 = PCA(n_components=2)
X_train_pca2 = pca_2.fit_transform(X_train_scaled)
X_test_pca2 = pca_2.transform(X_test_scaled)

knn_pca2 = KNeighborsClassifier(n_neighbors=5)
knn_pca2.fit(X_train_pca2, y_train)
y_pred_pca2 = knn_pca2.predict(X_test_pca2)
acc_pca2 = accuracy_score(y_test, y_pred_pca2)

print("Accuracy with 2 PCA components:", acc_pca2)


Accuracy with 2 PCA components: 0.9814814814814815


Question 9 – KNN with different distance metrics on scaled Wine data
On the scaled Wine dataset, KNN with Euclidean and Manhattan distances can be compared by training separate models with metric='euclidean' and metric='manhattan' while keeping
k
=
5
k=5. In this experiment, both metrics achieved the same test accuracy of about 0.96 (96.30%), suggesting that on this dataset and split, the choice between Euclidean and Manhattan distance does not materially change performance.[functions.execute_python] However, in other datasets or with different feature distributions, Manhattan distance can be more robust to outliers and axis-aligned structure, while Euclidean is more common for continuous, isotropic data

In [4]:
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Euclidean accuracy:", acc_euclidean)
print("Manhattan accuracy:", acc_manhattan)


Euclidean accuracy: 0.9629629629629629
Manhattan accuracy: 0.9629629629629629


Question 10 – PCA + KNN pipeline for high-dimensional gene expression data
For a high-dimensional gene expression dataset (many genes, few patients), a PCA + KNN pipeline helps reduce overfitting and stabilize distance-based classification.​

Use PCA to reduce dimensionality:
Standardize all gene expression features, then apply PCA to transform them into principal components that capture most of the variance, thereby denoising and compressing the feature space. This mitigates the curse of dimensionality and reduces correlations between genes, making distances more meaningful for KNN.​

Decide how many components to keep:
Inspect the cumulative explained variance ratio and choose the smallest number of components that together explain a high proportion of variance, for example 90–95%, or tune this number via cross-validation on validation folds. This balances information retention with model simplicity and helps avoid overfitting on small sample sizes common in biomedical datasets.​

Use KNN for classification after dimensionality reduction:
In the PCA-transformed space, train a KNN classifier (e.g., start with
k
k between 3 and 15) using an appropriate distance metric such as Euclidean, and tune
k
k and the metric via cross-validation to find the configuration that yields the best validation performance. Because PCA has removed redundant and noisy directions, KNN can better exploit local similarity among patients with similar cancer types.​

Evaluate the model:
Use stratified
k
k-fold cross-validation to estimate generalization performance, reporting metrics such as accuracy, F1-score, and confusion matrix, and possibly ROC-AUC per class if doing one-vs-rest analysis. Given the high stakes in biomedical applications, also test robustness through repeated cross-validation and, if possible, external validation on a separate cohort.​

Justification to stakeholders:
This pipeline is robust because PCA reduces noise and dimensionality in inherently high-dimensional omics data, while KNN is a simple, transparent algorithm that bases predictions on similarity to known patients rather than a complex black-box model. The combination leverages strong denoising and compression from PCA with an interpretable, distance-based classifier, which is easier to explain to clinicians and regulatory bodies and can generalize better with limited labeled samples.​

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.95)  # keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

param_grid = {
    "n_neighbors": [3, 5, 7, 9],
    "metric": ["euclidean", "manhattan"]
}
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=5, scoring="f1_macro")
grid.fit(X_train_pca, y_train)

best_knn = grid.best_estimator_
y_pred = best_knn.predict(X_test_pca)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00        10

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36

