# Machine Learning â€“ KNN and PCA


## Q1: What is K-Nearest Neighbors (KNN)?

KNN is a supervised learning algorithm that classifies or predicts values based on the majority class or average of the K closest data points in the feature space.

## Q2: Curse of Dimensionality

As the number of dimensions increases, distances between data points become less meaningful, reducing KNN effectiveness and increasing computational cost.

## Q3: What is PCA?

PCA is a dimensionality reduction technique that transforms features into a smaller set of uncorrelated components while retaining most of the variance.

## Q4: Eigenvalues and Eigenvectors in PCA

Eigenvectors define principal component directions, while eigenvalues indicate the amount of variance captured by each component.

## Q5: How KNN and PCA work together

PCA reduces dimensionality and noise, improving KNN performance and reducing computation.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
print("Accuracy without scaling:", accuracy_score(y_test, knn_raw.predict(X_test)))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
print("Accuracy with scaling:", accuracy_score(y_test, knn_scaled.predict(X_test_scaled)))

Accuracy without scaling: 0.7777777777777778
Accuracy with scaling: 0.9333333333333333


In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train_scaled)

for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {var}")

PC1: 0.3574992453956305
PC2: 0.19209250901858554
PC3: 0.10845609661045434
PC4: 0.07418330228706682
PC5: 0.06935667393547071
PC6: 0.05203091811029154
PC7: 0.043914797205546266
PC8: 0.025005533310710207
PC9: 0.02202075119142149
PC10: 0.019160301810529766
PC11: 0.0165172341635893
PC12: 0.012489177891226515
PC13: 0.007273459069476921


In [None]:
pca_2 = PCA(n_components=2)
X_train_pca = pca_2.fit_transform(X_train_scaled)
X_test_pca = pca_2.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

print("Accuracy with PCA (2 components):", accuracy_score(y_test, knn_pca.predict(X_test_pca)))

Accuracy with PCA (2 components): 0.9333333333333333


In [None]:
knn_euclidean = KNeighborsClassifier(metric="euclidean")
knn_manhattan = KNeighborsClassifier(metric="manhattan")

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

print("Euclidean accuracy:", accuracy_score(y_test, knn_euclidean.predict(X_test_scaled)))
print("Manhattan accuracy:", accuracy_score(y_test, knn_manhattan.predict(X_test_scaled)))

Euclidean accuracy: 0.9333333333333333
Manhattan accuracy: 0.9777777777777777


## Q10: High-dimensional biomedical data solution

PCA reduces dimensionality by retaining components that explain most variance (e.g., 95%). KNN is then applied on the reduced space. Model performance is evaluated using cross-validation, accuracy, precision, recall, and ROC-AUC. This pipeline reduces overfitting, improves efficiency, and is suitable for real-world biomedical datasets.