# KNN & PCA Assignment

1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
 - K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based learning  algorithm.
    It works by finding the K closest data points to a new input based on a distance metric such as Euclidean distance.
    
    In classification, the class is decided by majority voting among the K nearest neighbors.
    In regression, the output is the average of the target values of the K nearest neighbors.
    KNN does not build an explicit model and stores the entire training dataset.
    Its performance heavily depends on the choice of K and distance metric.

2. What is the Curse of Dimensionality and how does it affect KNN performance?
 - The Curse of Dimensionality refers to problems that arise when data has too many features (dimensions).As dimensions increase, data points become sparse, and distances between points become less meaningful.
    
    In KNN, this makes it difficult to identify true nearest neighbors.
    The algorithm requires more data to maintain performance in high dimensions.
    It also increases computation time and memory usage.
    Dimensionality reduction techniques like PCA are commonly used to mitigate this issue.

3. What is Principal Component Analysis (PCA)? How is it different from
feature selection?
 - Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique.It transforms original features into a smaller set of new orthogonal features called principal components.These components capture the maximum variance in the data.
    
    Feature selection chooses a subset of original features, while PCA creates new features.PCA may reduce interpretability but improves performance and reduces noise.Feature selection preserves original feature meaning but may miss hidden patterns.

4. What are eigenvalues and eigenvectors in PCA, and why are they
important?
 - Eigenvectors represent the direction of maximum variance in the data.
    
    Eigenvalues indicate the amount of variance captured by each eigenvector.
    
    In PCA, eigenvectors become principal components.
    Eigenvalues help rank components based on importance.
    Components with higher eigenvalues are retained, while others are discarded.
    This allows effective dimensionality reduction with minimal information loss.

5. How do KNN and PCA complement each other when applied in a single
pipeline?
 - PCA reduces the dimensionality of data, removing noise and redundant features.
This directly improves KNN performance by making distance calculations more meaningful.
Lower dimensions reduce computational cost and overfitting risk.
KNN benefits from cleaner, compact feature space.
Together, PCA + KNN handle high-dimensional datasets effectively.
This combination is widely used in real-world ML pipelines.



In [None]:
'''
6. Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
'''
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print("Accuracy without scaling:", accuracy_score(y_test, knn.predict(X_test)))

# With scaling
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn.fit(X_train_s, y_train)
print("Accuracy with scaling:", accuracy_score(y_test, knn.predict(X_test_s)))


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


In [None]:
'''
7. Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

'''

from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

pca = PCA()
pca.fit(X)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)

Explained Variance Ratio:
[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


In [None]:
'''
8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

'''


from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

print("Accuracy with PCA:", accuracy_score(y_test, knn.predict(X_test)))

Accuracy with PCA: 1.0


In [None]:
'''
9.  Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

X, y = load_wine(return_X_y=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
print("Accuracy with Euclidean Distance:", knn_euclidean.score(X_test, y_test))

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
print("Accuracy with Manhattan Distance:", knn_manhattan.score(X_test, y_test))


Accuracy with Euclidean Distance: 0.9444444444444444
Accuracy with Manhattan Distance: 0.9444444444444444


10.  You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:
1.  Use PCA to reduce dimensionality
2.  Decide how many components to keep
3. Use KNN for classification post-dimensionality reduction
4. Evaluate the model
5. Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


-  In high-dimensional biomedical data, models often overfit due to many features and few samples.
    
    We first apply PCA to reduce dimensionality while preserving maximum variance.
    
    The number of components is chosen using explained variance (e.g., 95%).
    
    KNN is then trained on the reduced feature space to improve generalization.
    
    Model evaluation is done using cross-validation and accuracy or F1-score.
    
    This pipeline reduces noise, improves stability, and is computationally efficient.
    It provides a robust, interpretable, and scalable solution suitable for real-world medical applications.    
