**KNN & PCA  Assignment**

**ANS 1-** K-Nearest Neighbors (KNN) is a supervised, instance-based learning algorithm. It does not learn a model during training; instead, it stores the entire dataset and makes predictions based on similarity.

In classification, KNN looks at the K nearest data points to a new observation and assigns the class that appears most frequently among them.

In regression, it predicts the average (or sometimes weighted average) of the target values of the K nearest neighbors.

The key idea behind KNN is that similar data points tend to have similar outputs.

**ANS2 -** The Curse of Dimensionality refers to problems that arise when the number of features becomes very large.

As dimensions increase:

- Distances between data points become less meaningful

- All points start appearing equally far away

- KNN struggles to identify true “nearest” neighbors

Since KNN relies entirely on distance calculations, high-dimensional data significantly reduces its accuracy and increases computational cost.

**ANS3-** Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms original features into a smaller set of new features called principal components.

Key differences:

PCA creates new features by combining existing ones

Feature selection keeps a subset of original features and discards others

PCA focuses on maximizing variance, while feature selection focuses on choosing relevant variables.

**ANS4 -**
- Eigenvectors represent the directions (principal components) along which the data varies the most.

- Eigenvalues represent how much variance is captured along each eigenvector.

They are important because PCA ranks components using eigenvalues and keeps the ones that explain the most variance, helping reduce dimensionality while preserving information.

ANS5- KNN suffers in high-dimensional spaces, while PCA reduces dimensionality.

By applying PCA before KNN:

- Noise and redundant features are reduced

- Distance calculations become more meaningful

- Model accuracy and efficiency improve

This combination is especially effective for datasets with many correlated features.

In [1]:
#ANS6 - KNN with and without Feature Scaling (Wine Dataset)
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


In [2]:
#ANS7- PCA Explained Variance Ratio
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

X, _ = load_wine(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)


Explained variance ratio: [0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [3]:
#ANS8-KNN on PCA-Reduced Data (Top 2 Components)

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy with PCA (2 components):", accuracy_score(y_test, y_pred))


Accuracy with PCA (2 components): 0.9814814814814815


In [4]:
#ANS9- KNN with Different Distance Metrics

from sklearn.neighbors import KNeighborsClassifier

knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_manhattan = KNeighborsClassifier(metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

print("Euclidean accuracy:",
      accuracy_score(y_test, knn_euclidean.predict(X_test_scaled)))

print("Manhattan accuracy:",
      accuracy_score(y_test, knn_manhattan.predict(X_test_scaled)))


Euclidean accuracy: 0.9629629629629629
Manhattan accuracy: 0.9629629629629629


ANS10- Gene expression datasets have thousands of features but very few samples, making overfitting a major concern.

- **Step 1: PCA**

- Standardize data

- Apply PCA to remove noise and correlated genes

- Retain components explaining ~90–95% variance

- **Step 2: Choosing Components**

- Use cumulative explained variance plot

- Balance information retention and simplicity

- **Step 3: KNN Classification**

- Train KNN on PCA-transformed data

- Reduced dimensionality improves neighbor reliability

- **Step 4: Evaluation**

- Cross-validation

- Accuracy, precision, recall

- Confusion matrix

- **Business Justification**

- Reduces overfitting

- Improves generalization

- Makes model more stable and interpretable for biomedical use

In [24]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])
