Q1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?  
- K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based machine learning algorithm.
It works by finding the K closest data points to a new sample using a distance metric.

- Classification:
The class is decided by majority voting among K neighbors.

- Regression:
The output is the average (mean) of the K neighbors’ values.


Q2: What is the Curse of Dimensionality and how does it affect KNN performance?   
-  The Curse of Dimensionality means that as the number of features increases:

    Distance between points becomes less meaningful

    All points appear almost equally distant

    KNN accuracy degrades and computation increases                                                                             

Q3: What is Principal Component Analysis (PCA)? How is it different from feature selection?
- Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms features into new orthogonal components that capture maximum variance.
    | Feature Selection         | PCA                      |
    | ------------------------- | ------------------------ |
    | Selects existing features | Creates new features     |
    | Uses subset               | Uses linear combinations |
    | Interpretable             | Less interpretable       |



Q4: What are eigenvalues and eigenvectors in PCA, and why are they important?

- Eigenvectors: Directions of maximum variance (principal components)

- Eigenvalues: Amount of variance captured by each eigenvector

    Higher eigenvalue ⇒ more important component
    

Q5: How do KNN and PCA complement each other when applied in a single pipeline?
KNN and PCA work well together because PCA prepares the data, and KNN performs better decisions on that prepared data.

- PCA (Principal Component Analysis) reduces the number of features in the Wine dataset by transforming the original chemical measurements into a smaller set of uncorrelated principal components that retain most of the variance.

- This solves the curse of dimensionality, removes noise, and eliminates redundant features.

- KNN, being a distance-based algorithm, benefits directly from this because distances become more meaningful in a lower-dimensional space.

- As a result, KNN becomes faster, less prone to overfitting, and often more accurate.

In [1]:
#Q6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
pred1 = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, pred1))

# With scaling
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

knn.fit(X_train_sc, y_train)
pred2 = knn.predict(X_test_sc)
print("Accuracy with scaling:", accuracy_score(y_test, pred2))


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


In [2]:
#Q7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train_sc)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio:
[0.35900066 0.18691934 0.11606557 0.07371716 0.0665386  0.04854582
 0.04195042 0.02683922 0.0234746  0.01889734 0.01715943 0.01262928
 0.00826257]


In [3]:
#Q8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
pca2 = PCA(n_components=2)
X_train_pca = pca2.fit_transform(X_train_sc)
X_test_pca = pca2.transform(X_test_sc)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
pred_pca = knn_pca.predict(X_test_pca)

print("Accuracy with PCA + KNN:", accuracy_score(y_test, pred_pca))


Accuracy with PCA + KNN: 1.0


In [4]:
#Q9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_sc, y_train)
    preds = knn.predict(X_test_sc)
    print(f"Accuracy ({metric}):", accuracy_score(y_test, preds))


Accuracy (euclidean): 0.9444444444444444
Accuracy (manhattan): 0.9444444444444444


Q10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data


- Step-by-Step Pipeline
    1. PCA for Dimensionality Reduction

        Thousands of genes → reduce noise

        Prevent overfitting

    2. Choosing Number of Components

        Use explained variance (90–95%)

        Scree plot / cumulative variance

    3. Apply KNN

        Works well in reduced feature space

        Simple and interpretable

    4. Evaluation Metrics

        Accuracy

        ROC-AUC

        F1-score (imbalanced data)

    5. Business / Stakeholder Justification

    “This pipeline reduces noise, prevents overfitting, improves generalization, and ensures reliable predictions in sensitive biomedical applications.”
