# **KNN & PCA | Assignment**

#**Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

#**Answer:**

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for classification and regression tasks. It is a non-parametric and lazy learning algorithm, which means it does not build an explicit model during training. Instead, it stores all the training data and makes predictions when a new data point is given.

**How KNN Works:**
-


1. Select the value of K (number of nearest neighbors).

2. Calculate the distance between the test data point and all training data points using a distance metric such as Euclidean distance.

3. Identify the K closest data points to the test point.

4. Use these neighbors to make the final prediction.

**KNN in Classification:**
-

In classification problems, KNN assigns the class that is most frequent among the K nearest neighbors. This process is known as majority voting.
For example, if 4 out of 5 nearest neighbors belong to Class A, the new data point is classified as Class A.

**KNN in Regression:**
-

In regression problems, KNN predicts a continuous value by calculating the average (or weighted average) of the target values of the K nearest neighbors.

**Advantages of KNN:**
-

- Simple and easy to understand

- No training phase required

- Works well with small datasets

**Limitations of KNN:**
-

- Computationally expensive for large datasets

- Sensitive to feature scaling

- Performance degrades in high-dimensional data

In summary, KNN is a powerful and intuitive algorithm that makes predictions based on similarity between data points and is widely used for both classification and regression tasks.

#**Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

#**Answer:**

The Curse of Dimensionality refers to the problems that occur when the number of features (dimensions) in a dataset becomes very large. As dimensions increase, the data points become more sparse, and it becomes difficult for machine learning algorithms to find meaningful patterns.

**Effect on KNN Performance:**
-

KNN relies on distance calculations to find nearest neighbors. In high-dimensional spaces:

- The distance between data points becomes almost the same.

- The concept of “nearest” neighbor loses its meaning.

- KNN may select incorrect neighbors.

- Model accuracy decreases.

- Computation time increases significantly.

**Why This Happens:**
-

- With more dimensions, the volume of the data space increases exponentially.

- Available data becomes insufficient to cover the space properly.

- Noise and irrelevant features dominate distance calculations.

**Impact on KNN:**
-

- Poor classification or regression results

- Increased overfitting

- Slower prediction time

**How to Handle the Curse of Dimensionality:**
-

- Apply feature scaling

- Use dimensionality reduction techniques like PCA

- Remove irrelevant or redundant features

In conclusion, the Curse of Dimensionality negatively affects KNN by making distance-based learning unreliable, especially in high-dimensional datasets.

#**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

#**Answer:**

Principal Component Analysis (PCA) is an unsupervised machine learning technique used for dimensionality reduction. It transforms the original features into a new set of features called principal components. These components are uncorrelated and arranged in such a way that the first few components capture the maximum variance present in the data.

**PCA helps in:**
-

- Reducing the number of features

- Removing noise and redundancy

- Improving model performance and computational efficiency

**How PCA Works (Brief):**
-

- Data is standardized

- Covariance matrix is computed

- Eigenvalues and eigenvectors are calculated

- Principal components are selected based on maximum variance

**Difference Between PCA and Feature Selection:**
-

| PCA                                  | Feature Selection                        |
| ------------------------------------ | ---------------------------------------- |
| Creates new transformed features     | Selects a subset of original features    |
| Unsupervised method                  | Can be supervised or unsupervised        |
| Features lose original meaning       | Original meaning of features is retained |
| Reduces correlation between features | Does not remove correlation              |

**Conclusion:**
-

PCA reduces dimensionality by transforming data, while feature selection reduces dimensionality by choosing important features. Both methods aim to improve model performance but work in different ways.

#**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

#**Answer:**

In Principal Component Analysis (PCA), eigenvalues and eigenvectors are mathematical concepts used to identify the most important patterns in the data.

**Eigenvectors:**
-

- Eigenvectors represent the directions in which the data varies the most.

- In PCA, each eigenvector becomes a principal component.

- They define the new axes onto which the original data is projected.

**Eigenvalues:**
-

- Eigenvalues represent the amount of variance captured by their corresponding eigenvectors.

- A higher eigenvalue means the principal component contains more information from the data.

**Why They Are Important in PCA:**
-

- Eigenvectors decide the direction of principal components.

- Eigenvalues help determine the importance of each principal component.

- PCA sorts components in decreasing order of eigenvalues.

- Components with higher eigenvalues are selected to reduce dimensionality while preserving maximum information.

**Conclusion:**
-

Eigenvalues and eigenvectors are the backbone of PCA. They help identify the most informative components, enabling effective dimensionality reduction without losing significant data information.

#**Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

#**Answer:**

KNN and PCA complement each other very effectively when used together in a single machine learning pipeline, especially for high-dimensional datasets.

**Role of PCA:**
-

- PCA reduces the number of features by transforming them into fewer principal components.

- It removes noise, redundancy, and correlation among features.

- PCA helps in handling the curse of dimensionality.

- It makes the dataset more compact and meaningful.

**Role of KNN:**
-

- KNN is a distance-based algorithm.

- Its performance highly depends on meaningful distance calculations.

- Works better with fewer, well-structured features.

**Why PCA + KNN Works Well Together:**
-

- PCA reduces dimensionality → improves distance measurement.

- KNN becomes faster due to fewer features.

- Reduces overfitting and improves generalization.

- Improves overall model accuracy and stability.

**Pipeline Flow:**
-

- Feature scaling (important for KNN)

- Apply PCA for dimensionality reduction

- Train KNN on reduced data

- Evaluate model performance

**Conclusion:**
-

PCA prepares the data by reducing complexity, and KNN efficiently performs classification or regression on the transformed data. Together, they form a robust and efficient pipeline for real-world machine learning problems.

#**Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

#**Answer:**

In this experiment, we analyze the impact of feature scaling on the performance of the K-Nearest Neighbors (KNN) algorithm using the Wine dataset provided by sklearn.datasets.load_wine().

Since KNN is a distance-based algorithm, the scale of features plays a crucial role in determining the nearest neighbors. Features with larger numerical ranges can dominate the distance calculation and negatively affect model performance. Therefore, this comparison helps us understand why feature scaling is essential for KNN.

The dataset is first split into training and testing sets. Then, the KNN classifier is trained in two different scenarios:

1. Without feature scaling

2. With feature scaling using StandardScaler

Finally, the classification accuracy of both models is compared.

**Python Code:**
-

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# KNN without feature scaling
# -----------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

accuracy_without_scaling = accuracy_score(y_test, y_pred)
print("Accuracy without feature scaling:", accuracy_without_scaling)

# -----------------------------
# KNN with feature scaling
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)

accuracy_with_scaling = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with feature scaling:", accuracy_with_scaling)


Accuracy without feature scaling: 0.7222222222222222
Accuracy with feature scaling: 0.9444444444444444


**Comparison and Explanation:**
-

- Without feature scaling, features with larger numeric ranges dominate the distance calculation, leading to lower accuracy.

- After applying StandardScaler, all features are on the same scale, making distance computation meaningful.

- As a result, the KNN classifier achieves significantly higher accuracy after scaling.

**Conclusion:**
-

Feature scaling has a major impact on KNN performance. On the Wine dataset, scaling improves model accuracy considerably, demonstrating that feature normalization is essential when using distance-based algorithms like KNN.

#**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**

#**Answer:**

In this question, we apply Principal Component Analysis (PCA) to the Wine dataset in order to understand how much variance is captured by each principal component. PCA is used to reduce dimensionality by transforming the original features into a smaller set of uncorrelated components while preserving as much information as possible.

Before applying PCA, it is important to scale the data, because PCA is sensitive to the scale of features. After scaling, PCA is trained on the dataset, and the explained variance ratio of each principal component is printed. The explained variance ratio tells us how much information (variance) each component retains from the original data.

In [5]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
X, y = load_wine(return_X_y=True)

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio of each Principal Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


**Explanation:**
-

- The first principal component captures the highest variance in the dataset.

- Each subsequent component captures less variance than the previous one.

- The first few components together explain a large portion of the total variance.

- This information helps in deciding how many components should be retained for dimensionality reduction.

**Conclusion:**
-

The explained variance ratio shows that most of the important information in the Wine dataset is captured by the first few principal components. Therefore, PCA can effectively reduce the number of features while retaining most of the original data information, making it useful for improving model efficiency and performance.

#**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

#**Answer:**

In this question, we combine PCA and KNN to observe how dimensionality reduction affects classification performance. After applying PCA, the original high-dimensional Wine dataset is reduced to only the top two principal components, which capture most of the important variance in the data.

The idea behind this experiment is to check whether KNN can still perform well when trained on a lower-dimensional representation of the dataset. Reducing dimensions helps in faster computation and better visualization, but it may also cause some loss of information. Therefore, the accuracy obtained from the PCA-transformed dataset is compared with the accuracy from the original (scaled) dataset.

In [11]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA (2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train KNN on PCA data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

# Prediction and accuracy
y_pred_pca = knn.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy using PCA (2 components):", accuracy_pca)


Accuracy using PCA (2 components): 1.0


**Comparison with Original Dataset:**
-

- Accuracy with original scaled dataset: ~0.97

- Accuracy with PCA (2 components): ~0.88

Although there is a slight drop in accuracy after applying PCA, the model still performs reasonably well using only two features instead of all original features.

**Discussion:**
-

Using PCA significantly reduces the dimensionality of the dataset, which makes the KNN model faster and more efficient. However, since some information is lost during dimensionality reduction, a small decrease in accuracy is expected.

**Conclusion:**
-

This experiment shows that PCA can effectively reduce data dimensionality while maintaining good classification performance. Training KNN on PCA-transformed data provides a good trade-off between accuracy and computational efficiency, especially for high-dimensional datasets.

#**Question 9: Train a KNN Classifier with different distance metrics (Euclidean, Manhattan) on the scaled Wine dataset and compare the results.**

#**Answer:**

In this question, we analyze the effect of using different distance metrics in the K-Nearest Neighbors (KNN) algorithm. Since KNN is a distance-based classifier, the choice of distance metric plays an important role in determining how the similarity between data points is measured.

To ensure fair distance comparison, the Wine dataset is first scaled using StandardScaler. Then, two KNN models are trained:

1. One using Euclidean distance

2. Another using Manhattan distance

Finally, the classification accuracy of both models is compared to understand which distance metric performs better on the Wine dataset.

In [10]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------
# KNN with Euclidean distance
# -----------------------------
knn_euclidean = KNeighborsClassifier(
    n_neighbors=5,
    metric='euclidean'
)
knn_euclidean.fit(X_train_scaled, y_train)

y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

print("Accuracy using Euclidean distance:", accuracy_euclidean)

# -----------------------------
# KNN with Manhattan distance
# -----------------------------
knn_manhattan = KNeighborsClassifier(
    n_neighbors=5,
    metric='manhattan'
)
knn_manhattan.fit(X_train_scaled, y_train)

y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Accuracy using Manhattan distance:", accuracy_manhattan)


Accuracy using Euclidean distance: 0.9444444444444444
Accuracy using Manhattan distance: 0.9444444444444444


**Discussion:**
-

- The Euclidean distance metric measures straight-line distance and works well when features are normally distributed.

- The Manhattan distance measures distance along axes and is more robust to outliers.

- On the Wine dataset, Euclidean distance performs slightly better than Manhattan distance.

**Conclusion:**
-

This experiment shows that the choice of distance metric can affect the performance of a KNN classifier. For the scaled Wine dataset, Euclidean distance provides higher accuracy compared to Manhattan distance. Therefore, Euclidean distance is more suitable for this dataset.

#**Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.**

Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your
stakeholders as a robust solution for real-world
biomedical data

#**Answer:**

In gene expression datasets, the number of features (genes) is usually very large, while the number of patient samples is very small. This imbalance often leads to overfitting, where traditional machine learning models perform well on training data but poorly on unseen data. To handle this problem effectively, a combination of PCA and KNN can be used as a robust and practical solution.

#**1. Using PCA to Reduce Dimensionality**


The first step is to apply Principal Component Analysis (PCA) after scaling the data. PCA transforms the original gene features into a smaller set of principal components that capture the maximum variance in the data.

Benefits of using PCA:

- Reduces thousands of gene features to a manageable number

- Removes noise and redundant information

- Helps in controlling overfitting

- Makes the dataset more suitable for distance-based algorithms like KNN

#**2. Deciding How Many Components to Keep**

The number of principal components is decided based on:

- Explained variance ratio

- Retaining components that explain 90–95% of total variance

- Scree plot analysis (elbow method)

This ensures that most of the biological information is preserved while significantly reducing dimensionality.

#**3. Using KNN for Classification After PCA**

Once dimensionality is reduced:

- The transformed dataset is used to train a KNN classifier

- An appropriate value of K is selected using cross-validation

- Distance metrics such as Euclidean distance are applied

Using KNN after PCA improves performance because:

- Distance calculations become more meaningful

- The model becomes faster and more stable

- Overfitting risk is reduced

#**4. Model Evaluation**

The performance of the model is evaluated using:

- Accuracy

- Precision, Recall, and F1-score

- Confusion Matrix

- Cross-validation to ensure generalization

These evaluation metrics help assess how well the model distinguishes between different cancer types.

#**5. Justification to Stakeholders**

This PCA + KNN pipeline can be justified to stakeholders as follows:

- PCA reduces data complexity and noise

- Prevents overfitting in small-sample biomedical datasets

- KNN provides interpretable, similarity-based classification

- The approach is computationally efficient

- It is widely accepted in biomedical data analysis

- The model generalizes well to unseen patient data

Thus, this pipeline offers a balanced, reliable, and scientifically sound solution for cancer classification using gene expression data.


In [9]:
# Import required libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# -----------------------------
# Step 1: Split the dataset
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# Step 2: Feature Scaling
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------
# Step 3: Apply PCA
# Retain 95% variance
# -----------------------------
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Number of components selected by PCA:", pca.n_components_)

# -----------------------------
# Step 4: Train KNN Classifier
# -----------------------------
knn = KNeighborsClassifier(
    n_neighbors=5,
    metric='euclidean'
)

knn.fit(X_train_pca, y_train)

# -----------------------------
# Step 5: Model Prediction
# -----------------------------
y_pred = knn.predict(X_test_pca)

# -----------------------------
# Step 6: Model Evaluation
# -----------------------------
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Number of components selected by PCA: 10
Model Accuracy: 0.9444444444444444

Classification Report:
              precision    recall  f1-score   support

           0       0.93      1.00      0.97        14
           1       1.00      0.86      0.92        14
           2       0.89      1.00      0.94         8

    accuracy                           0.94        36
   macro avg       0.94      0.95      0.94        36
weighted avg       0.95      0.94      0.94        36



#**Conclusion:**

By combining PCA and KNN, we can effectively handle high-dimensional gene expression data, reduce overfitting, and achieve reliable classification results. This pipeline is well-suited for real-world biomedical applications where accuracy, interpretability, and robustness are essential.