# KNN & PCA

**Q1.** What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

 - K-Nearest Neighbors (KNN) is one of the simplest and most intuitive supervised machine learning algorithms used for both classification and regression.
It belongs to the family of instance-based or lazy learning algorithms because it does not explicitly learn a model during training — instead, it stores all the training data and makes predictions only when asked.

 - KNN in Classification:

1. Choose a value of K (number of neighbors).

2. Calculate the distance between the query point and all training data points.

3. Select the K nearest neighbors based on smallest distances.

4. Count the class labels of these K neighbors.

5. Assign the query point to the majority class among the neighbors.

   - (Optional: use distance-weighted voting → closer neighbors have more influence).

- KNN in Regression:

1. Choose a value of K.

2. Calculate the distance between the query point and all training data points.

3. Select the K nearest neighbors based on smallest distances.

4. Take the average (or weighted average) of the neighbors’ target values.

5. Assign this average as the predicted value for the query point.

**Q2.**  What is the Curse of Dimensionality and how does it affect KNN
performance?

 - The curse of dimensionality refers to the challenges that arise when analyzing data in very high-dimensional spaces. As the number of features (dimensions) increases, data points become sparse and the concept of distance loses its effectiveness. In K-Nearest Neighbors (KNN), which relies on measuring distances to find the nearest neighbors, this leads to difficulties because in high dimensions, the difference between the nearest and farthest neighbors becomes negligible. As a result, KNN struggles to identify truly similar points, making its predictions less accurate and more computationally expensive.

- How It Affects KNN Performance:

1. Distances lose significance:

   - In high dimensions, the distance between the closest and farthest neighbors tends to become almost the same.

   - This makes it hard for KNN to identify truly “nearest” neighbors.

2. Increased computation:

   - More dimensions = more features to compute distance → KNN becomes computationally heavy.

3. Overfitting risk

   - With too many features, KNN may fit to noise rather than useful patterns.

   - Unless features are reduced or selected carefully, performance degrades.

4. Need for feature scaling & selection

   - Some features may dominate the distance calculation, leading to biased predictions.

    - KNN becomes very sensitive to irrelevant or redundant features in high-dimensional space.

**Q3.** What is Principal Component Analysis (PCA)? How is it different from
feature selection?

- Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify large datasets by transforming the original features into a new set of uncorrelated features called principal components. These components are ordered in such a way that the first few retain most of the variation (information) present in the original data. Unlike feature selection, PCA does not discard features but instead creates new ones that are linear combinations of the original variables, making the dataset smaller while still capturing its essential patterns.

- Difference Between PCA and Feature Selection:

1. Nature of Features

   - PCA: Creates new features (principal components).

   - Feature Selection: Keeps original features.

2. Method

   - PCA: Transformation-based (projects data into a new space).

   - Feature Selection: Selection-based (chooses best subset of features).

3. Interpretability

   - PCA: Components are often less interpretable.

   - Feature Selection: Features are easy to interpret.

4. Goal

   - PCA: Preserve maximum variance in fewer dimensions.

   - Feature Selection: Keep only most relevant features for prediction.

5. Output

   - PCA: Produces a new dataset with reduced dimensions.

   - Feature Selection: Produces a smaller version of the original dataset.

**Q4.**   What are eigenvalues and eigenvectors in PCA, and why are they
important?

- Eigenvectors

    - In PCA, eigenvectors represent the directions (axes) in the feature space along which the data varies the most.

   - Each eigenvector corresponds to a principal component.

   - They define the “new axes” after PCA transformation.

- Eigenvalues

   - Eigenvalues tell us the amount of variance captured by each eigenvector (principal component).

   - A higher eigenvalue means that component captures more information (variance) from the data.

- Why They Are Important in PCA

1. Identify Principal Components

   - Eigenvectors determine the orientation of the new feature space (principal components).

2. Measure Variance (Importance)

   - Eigenvalues tell us how much variance each principal component explains.

   - Example: If the first eigenvalue is very large, then the first principal component explains most of the data variance.

3. Dimensionality Reduction

   - By sorting eigenvalues in descending order and keeping only the top ones, PCA reduces the dataset to fewer dimensions while still preserving most of the information.

**Q5.** How do KNN and PCA complement each other when applied in a single
pipeline?

- How KNN and PCA Complement Each Other

1. PCA reduces dimensionality

   - KNN struggles in high-dimensional spaces because of the curse of dimensionality (distances lose meaning).

   - PCA projects data onto fewer dimensions (principal components) while preserving most variance.

   - This makes distance calculations in KNN more reliable.

2. Improved Efficiency

   - KNN is computationally expensive since it compares distances to all training points.

   - With PCA, fewer dimensions → fewer distance calculations → faster KNN predictions.

3. Noise Reduction

   - PCA removes less important components (low variance features = often noise).

   - This helps KNN focus only on the most informative features, improving accuracy.

4. Better Generalization

   - By reducing overfitting caused by irrelevant features, PCA helps KNN generalize better on unseen data.
   

In [1]:
#Q6.  Train a KNN Classifier on the Wine dataset with and without feature
# scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. KNN WITHOUT SCALING
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
acc_no_scaling = knn_no_scaling.score(X_test, y_test)

# 2. KNN WITH SCALING
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_with_scaling = KNeighborsClassifier(n_neighbors=5)
knn_with_scaling.fit(X_train_scaled, y_train)
acc_with_scaling = knn_with_scaling.score(X_test_scaled, y_test)

print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling   :", acc_with_scaling)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling   : 0.9629629629629629


In [2]:
# Q7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

wine = load_wine()
X = wine.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

print("Explained variance ratio of each principal component:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {var:.4f}")


Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


In [3]:
# Q8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
acc_original = knn_original.score(X_test_scaled, y_test)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = knn_pca.score(X_test_pca, y_test)

print("Accuracy on Original Scaled Data:", acc_original)
print("Accuracy on PCA (2 components)  :", acc_pca)


Accuracy on Original Scaled Data: 0.9629629629629629
Accuracy on PCA (2 components)  : 0.9814814814814815


In [4]:
# Q9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train_scaled, y_train)
acc_euclidean = knn_euclidean.score(X_test_scaled, y_test)

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn_manhattan.fit(X_train_scaled, y_train)
acc_manhattan = knn_manhattan.score(X_test_scaled, y_test)

print("KNN Accuracy with Euclidean distance:", acc_euclidean)
print("KNN Accuracy with Manhattan distance:", acc_manhattan)


KNN Accuracy with Euclidean distance: 0.9629629629629629
KNN Accuracy with Manhattan distance: 0.9629629629629629


**Q10.** You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


 - We are dealing with a gene expression dataset that has thousands of features (genes) but only a small number of patient samples. This imbalance makes traditional models prone to overfitting, since they try to learn from too many features with too little data. To address this, I would build a pipeline using PCA + KNN, explained below:


#### 1. **Using PCA to Reduce Dimensionality**

* Gene expression data is typically very high-dimensional, and many genes may carry redundant or noisy information.
* I would apply Principal Component Analysis (PCA) to transform the data into a smaller number of new features (principal components).
* These components capture the most important patterns (variance) in the dataset, while filtering out noise.
* This step makes the dataset more compact and less prone to overfitting.


#### 2. **Deciding How Many Components to Keep**

* After applying PCA, I would check the explained variance ratio of each component.
* I’d then create a cumulative variance plot (scree plot) to see how much total variance is explained as we add more components.
* A common practice is to keep enough components to explain around 90–95% of the variance, balancing information retention and dimensionality reduction.
* This ensures we don’t lose important biological signals while still reducing noise.


#### 3. **Using KNN for Classification**

* Once the data is reduced using PCA, I would train a K-Nearest Neighbors (KNN) classifier on the transformed dataset.
* KNN is a simple, non-parametric algorithm that works well when combined with PCA because distances in the reduced space become more meaningful.
* I would tune the hyperparameter K (number of neighbors) using cross-validation to find the best value for classification performance.


#### 4. **Evaluating the Model**

* To evaluate the model, I would use **stratified cross-validation**, ensuring class proportions are preserved (important in biomedical data with imbalanced classes).
* Metrics I would focus on include:

  * Accuracy (overall performance).
  * Precision, Recall, and F1-score (important for medical diagnosis to avoid false negatives).
  * Confusion Matrix (to see misclassification patterns between cancer types).


#### 5. **Justifying the Pipeline to Stakeholders**

* Why PCA? It reduces thousands of noisy gene features into a smaller, informative set, preventing overfitting and speeding up training.
* Why KNN? It is interpretable, requires no assumptions about data distribution, and works well in the PCA-reduced space.
* Why is this robust? This pipeline reduces complexity, improves generalization to unseen patients, and provides reproducible results. In biomedical research, simpler and transparent models are often more trusted than black-box models.
* Real-world relevance: Many genomic studies use PCA for dimensionality reduction, and combining it with a straightforward classifier like KNN ensures a balance between accuracy and interpretability — which is critical for medical decision-making.

- **In summary**:
The pipeline **PCA → KNN → Evaluation** provides a practical and reliable way to classify cancer patients using gene expression data, while controlling overfitting and ensuring the results are both accurate and interpretable for real-world biomedical use.
