## **KNN & PCA | Assignment**

**Question 1:** What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
- Answer:
K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm that works on the idea of similarity. It does not learn a model during training. Instead, it stores all the training data and makes predictions only when a new data point is given.

- In classification, KNN looks at the K nearest data points to the new sample and assigns the class that appears most frequently among those neighbors. For example, if most nearby points belong to class A, the new point is classified as class A.

- regression, KNN predicts a numerical value by taking the average of the values of the K nearest neighbors. The idea is that similar data points usually have similar output values.

**Question 2:** What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:
- The Curse of Dimensionality refers to problems that occur when working with data having a large number of features. As dimensions increase, the distance between data points becomes less meaningful.

- Since KNN depends heavily on distance calculations, high-dimensional data makes it difficult to correctly identify nearest neighbors. All points start to look equally distant, which reduces the accuracy of KNN and increases computational cost.

- As a result, KNN performs poorly on high-dimensional datasets unless dimensionality reduction techniques like PCA are applied.

**Question 3:** What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:
- Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features while preserving most of the dataâ€™s information. It does this by transforming original features into new variables called principal components.

- Feature selection chooses a subset of existing features, whereas PCA creates new features by combining original ones. PCA focuses on maximizing variance, while feature selection focuses on choosing the most relevant features directly.

**Question 4**: What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:
- In PCA, eigenvectors represent the directions in which data varies the most, and eigenvalues represent how much variance exists along those directions.

- Eigenvectors decide the direction of principal components, while eigenvalues help decide which components are important. Components with higher eigenvalues carry more information and are selected first in PCA.

**Question 5:** How do KNN and PCA complement each other when applied in a single pipeline?

Answer:
- KNN works best when distances between data points are meaningful. PCA helps by reducing noise and removing less important dimensions, making distance calculations more reliable.

- Using PCA before KNN improves accuracy, reduces overfitting, and speeds up computation. Together, they create an efficient and well-balanced machine learning pipeline.

# Question 6:Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

Answer : The first few components explain most of the variance, meaning dimensionality can be reduced without losing much information.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

Answer:
Observation:
- Accuracy is slightly lower than the full dataset but still competitive, showing PCA effectively reduces dimensions while retaining useful information.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.2, random_state=42
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(accuracy_score(y_test, y_pred))


1.0


Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

Answer:
- Both metrics perform well, but Euclidean distance slightly outperforms Manhattan distance on this dataset.

In [5]:
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_manhattan = KNeighborsClassifier(metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

print("Euclidean Accuracy:", accuracy_score(y_test, knn_euclidean.predict(X_test_scaled)))
print("Manhattan Accuracy:", accuracy_score(y_test, knn_manhattan.predict(X_test_scaled)))


Euclidean Accuracy: 0.9444444444444444
Manhattan Accuracy: 0.9444444444444444


**Question 10:** Explain a PCA + KNN pipeline for high-dimensional gene expression data.

Answer:
- For high-dimensional gene expression data, PCA is first applied to reduce the number of features while retaining maximum variance. The number of components is chosen based on explained variance (for example, 95%).

- After dimensionality reduction, KNN is used for classification since the reduced dataset improves distance calculations and reduces overfitting.

- Model evaluation is done using accuracy, precision, recall, and cross-validation. This pipeline is justified to stakeholders as it improves performance, reduces noise, and provides reliable results for real-world biomedical applications.