# KNN & PCA | Assignment
***

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
***
Ans:- K-Nearest Neighbors (KNN) is a simple, non-parametric, supervised machine learning algorithm that can be used for both classification and regression tasks.

**How it works:**

The core idea behind KNN is to predict the value (for regression) or class (for classification) of a new data point based on the majority vote of its 'k' nearest neighbors in the training dataset.

**In Classification:**

1. **Distance Calculation:** For a new data point, calculate the distance between this point and every point in the training dataset. Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
2. **Neighbor Selection:** Select the 'k' data points from the training dataset that are closest to the new data point based on the calculated distances.
3. **Class Prediction:** Determine the class of the new data point by taking a majority vote among the 'k' selected neighbors. The class that appears most frequently among the neighbors is assigned to the new data point.

**In Regression:**

1. **Distance Calculation:** Similar to classification, calculate the distance between the new data point and every point in the training dataset.
2. **Neighbor Selection:** Select the 'k' data points from the training dataset that are closest to the new data point.
3. **Value Prediction:** Predict the value of the new data point by taking the average (or weighted average) of the values of the 'k' selected neighbors.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?
***
Ans:- The Curse of Dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (i.e., spaces with a large number of features or attributes). As the number of dimensions increases, the volume of the space grows exponentially, and the available data becomes sparse. This sparsity means that data points in a high-dimensional space tend to be very far apart from each other, making it difficult to find meaningful relationships or patterns.

**How the Curse of Dimensionality affects KNN performance:**

The Curse of Dimensionality significantly impacts KNN, primarily due to the way it relies on distance calculations. Here's why:

1. **Increased Sparsity:** In high-dimensional spaces, data points become extremely sparse. This means that even the nearest neighbors of a data point can be quite far away. When calculating distances, most pairs of points will have distances that are very similar to each other. This makes it difficult to distinguish between true nearest neighbors and points that are only slightly further away, leading to less reliable neighbor identification.

2. **Distance Metrics Become Less Meaningful:** As the number of dimensions increases, the notion of distance becomes less intuitive. In high dimensions, the variance of distances between points tends to become concentrated, meaning that the difference between the closest and furthest points becomes smaller relative to the average distance. This "concentration of measure" makes distance-based methods like KNN less effective at finding meaningful clusters or identifying true nearest neighbors.

3. **Increased Computational Cost:** Calculating distances between a new data point and all training points in a high-dimensional space is computationally expensive. As the number of dimensions grows, the time and resources required for distance calculations increase significantly.

4. **Overfitting:** With high-dimensional data, there is a higher risk of overfitting in KNN. Since the data is sparse, the algorithm might find spurious relationships based on random chance rather than true patterns. This can lead to a model that performs well on the training data but generalizes poorly to new, unseen data.

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?
***
Ans:- Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets while retaining as much of the original information as possible. It works by transforming the data into a new set of variables called principal components, which are linear combinations of the original features. These principal components are orthogonal (uncorrelated) and are ordered by the amount of variance they explain in the data, with the first principal component explaining the most variance.

**How PCA is different from feature selection:**

While both PCA and feature selection are dimensionality reduction techniques, they differ in their approach:

* **PCA (Feature Extraction):** PCA creates new, transformed features (principal components) that are linear combinations of the original features. It doesn't discard any original features but rather combines them to create a smaller set of uncorrelated features that capture the most variance. This is a form of **feature extraction**.
* **Feature Selection:** Feature selection methods, on the other hand, select a subset of the original features based on certain criteria (e.g., correlation with the target variable, statistical significance). They discard the less relevant or redundant features while keeping the most important ones. This is a form of **feature elimination**.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?
***
Ans:- In the context of Principal Component Analysis (PCA), eigenvalues and eigenvectors are crucial for understanding the variance and direction of the data in different dimensions.

**Eigenvectors:**

*   Eigenvectors are special vectors that, when a linear transformation (like multiplying by a matrix) is applied to them, only change in magnitude (scaled by a factor) but not in direction.
*   In PCA, the eigenvectors of the covariance matrix represent the **principal components**. They point in the directions of maximum variance in the data.
*   Each eigenvector corresponds to a principal component, and the principal components are orthogonal (perpendicular) to each other, meaning they are uncorrelated.

**Eigenvalues:**

*   Eigenvalues are the scalar values that correspond to each eigenvector.
*   In PCA, the eigenvalues of the covariance matrix represent the **magnitude of the variance** along the direction of the corresponding eigenvector (principal component).
*   A larger eigenvalue indicates that there is more variance in the data along the direction of its corresponding eigenvector.
*   Eigenvalues are typically sorted in descending order, with the largest eigenvalue corresponding to the first principal component (the direction of most variance), the second largest to the second principal component, and so on.

**Why are they important in PCA?**

Eigenvalues and eigenvectors are important in PCA for several reasons:

1.  **Identifying Principal Components:** Eigenvectors define the directions of the principal components, which are the new axes in the transformed data space. These principal components capture the most important information (variance) in the data.
2.  **Determining the Amount of Variance:** Eigenvalues quantify the amount of variance explained by each principal component. By examining the eigenvalues, we can determine how much information is captured by each component and decide how many principal components to retain for dimensionality reduction.
3.  **Dimensionality Reduction:** By selecting the eigenvectors with the largest eigenvalues, we can keep the principal components that explain the most variance and discard those with smaller eigenvalues. This allows us to reduce the dimensionality of the data while minimizing the loss of important information.
4.  **Data Transformation:** The eigenvectors are used as the basis for transforming the original data into the new principal component space. This transformation projects the data onto the directions of maximum variance.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?
***
Ans:- K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) can be effectively combined in a single pipeline to address some of the limitations of KNN, particularly when dealing with high-dimensional data. Here's how they complement each other:

1.  **Addressing the Curse of Dimensionality:** As discussed earlier, KNN performance can degrade significantly in high-dimensional spaces due to the curse of dimensionality. PCA can be used as a pre-processing step to reduce the dimensionality of the data before applying KNN. By transforming the data into a lower-dimensional space defined by the principal components, PCA can mitigate the issues of sparsity and meaningless distance metrics, leading to improved KNN performance.

2.  **Noise Reduction:** PCA can help in noise reduction. The principal components that capture the most variance often represent the underlying structure of the data, while components with smaller variances may correspond to noise. By keeping only the top principal components, we can effectively filter out some of the noise in the data, which can lead to a more robust KNN model.

3.  **Improved Computational Efficiency:** Calculating distances in high-dimensional spaces is computationally expensive for KNN. By reducing the dimensionality of the data using PCA, the distance calculations become faster and more efficient, making the KNN algorithm more scalable for larger datasets.

4.  **Highlighting Important Features:** Although PCA creates new features (principal components) that are linear combinations of the original features, the principal components are ordered by the amount of variance they explain. The first few principal components capture the most important information in the data. By using these principal components as input to KNN, we are essentially providing the KNN algorithm with a set of features that are more informative and less redundant than the original high-dimensional features.

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
***

In [9]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train knn without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)

# Evaluate knn without scaling
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"Accuracy of the unscaled KNN model: {accuracy_unscaled:.4f}")

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train knn with scaling
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

# Evaluate knn with scaling
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy of the scaled KNN model: {accuracy_scaled:.4f}")

# Compare results
print(f"Accuracy of the unscaled KNN model: {accuracy_unscaled:.4f}")
print(f"Accuracy of the scaled KNN model: {accuracy_scaled:.4f}")

if accuracy_scaled > accuracy_unscaled:
    print("The scaled KNN model performed better.")
elif accuracy_scaled < accuracy_unscaled:
    print("The unscaled KNN model performed better.")
else:
    print("Both models performed equally well.")

Accuracy of the unscaled KNN model: 0.7222
Accuracy of the scaled KNN model: 0.9444
Accuracy of the unscaled KNN model: 0.7222
Accuracy of the scaled KNN model: 0.9444
The scaled KNN model performed better.


Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
***

In [10]:
from sklearn.decomposition import PCA
import pandas as pd

# Train PCA
pca = PCA()
X_pca = pca.fit_transform(X_train_scaled) # Using scaled data from previous step

# Print explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.4f}")

# You can also see the cumulative explained variance
cumulative_explained_variance = explained_variance_ratio.cumsum()
print("\nCumulative explained variance:")
for i, cumulative_ratio in enumerate(cumulative_explained_variance):
    print(f"Principal Component {i+1}: {cumulative_ratio:.4f}")

Explained variance ratio of each principal component:
Principal Component 1: 0.3590
Principal Component 2: 0.1869
Principal Component 3: 0.1161
Principal Component 4: 0.0737
Principal Component 5: 0.0665
Principal Component 6: 0.0485
Principal Component 7: 0.0420
Principal Component 8: 0.0268
Principal Component 9: 0.0235
Principal Component 10: 0.0189
Principal Component 11: 0.0172
Principal Component 12: 0.0126
Principal Component 13: 0.0083

Cumulative explained variance:
Principal Component 1: 0.3590
Principal Component 2: 0.5459
Principal Component 3: 0.6620
Principal Component 4: 0.7357
Principal Component 5: 0.8022
Principal Component 6: 0.8508
Principal Component 7: 0.8927
Principal Component 8: 0.9196
Principal Component 9: 0.9431
Principal Component 10: 0.9619
Principal Component 11: 0.9791
Principal Component 12: 0.9917
Principal Component 13: 1.0000


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
***

In [11]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Apply PCA, retaining the top 2 components
pca_2_components = PCA(n_components=2)
X_train_pca = pca_2_components.fit_transform(X_train_scaled)
X_test_pca = pca_2_components.transform(X_test_scaled)

# Train KNN on the PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

# Evaluate KNN on the PCA-transformed data
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"Accuracy of the KNN model on PCA-transformed data (2 components): {accuracy_pca:.4f}")

# Compare with the accuracy of the unscaled and scaled models
print(f"Accuracy of the unscaled KNN model: {accuracy_unscaled:.4f}")
print(f"Accuracy of the scaled KNN model: {accuracy_scaled:.4f}")
print(f"Accuracy of the KNN model on PCA-transformed data (2 components): {accuracy_pca:.4f}")

if accuracy_pca > accuracy_scaled and accuracy_pca > accuracy_unscaled:
    print("The KNN model on PCA-transformed data performed best.")
elif accuracy_scaled > accuracy_pca and accuracy_scaled > accuracy_unscaled:
    print("The scaled KNN model performed best.")
elif accuracy_unscaled > accuracy_pca and accuracy_unscaled > accuracy_scaled:
     print("The unscaled KNN model performed best.")
elif accuracy_pca == accuracy_scaled and accuracy_pca > accuracy_unscaled:
    print("The scaled KNN model and KNN on PCA-transformed data performed equally well and better than the unscaled model.")
elif accuracy_pca == accuracy_unscaled and accuracy_pca > accuracy_scaled:
     print("The unscaled KNN model and KNN on PCA-transformed data performed equally well and better than the scaled model.")
elif accuracy_scaled == accuracy_unscaled and accuracy_scaled > accuracy_pca:
     print("The unscaled and scaled KNN models performed equally well and better than the KNN on PCA-transformed data.")
else:
    print("All models performed equally well or there was no clear best model.")

Accuracy of the KNN model on PCA-transformed data (2 components): 1.0000
Accuracy of the unscaled KNN model: 0.7222
Accuracy of the scaled KNN model: 0.9444
Accuracy of the KNN model on PCA-transformed data (2 components): 1.0000
The KNN model on PCA-transformed data performed best.


Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
***

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Train KNN with Euclidean distance on scaled data
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)

# Evaluate KNN with Euclidean distance
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)
print(f"Accuracy of the scaled KNN model with Euclidean distance: {accuracy_euclidean:.4f}")

# Train KNN with Manhattan distance on scaled data
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)

# Evaluate KNN with Manhattan distance
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)
print(f"Accuracy of the scaled KNN model with Manhattan distance: {accuracy_manhattan:.4f}")

# Compare results
print("\nComparison of KNN models with different distance metrics on scaled data:")
print(f"Euclidean distance accuracy: {accuracy_euclidean:.4f}")
print(f"Manhattan distance accuracy: {accuracy_manhattan:.4f}")

if accuracy_euclidean > accuracy_manhattan:
    print("The scaled KNN model with Euclidean distance performed better.")
elif accuracy_euclidean < accuracy_manhattan:
    print("The scaled KNN model with Manhattan distance performed better.")
else:
    print("Both scaled KNN models with different distance metrics performed equally well.")

Accuracy of the scaled KNN model with Euclidean distance: 0.9444
Accuracy of the scaled KNN model with Manhattan distance: 0.9444

Comparison of KNN models with different distance metrics on scaled data:
Euclidean distance accuracy: 0.9444
Manhattan distance accuracy: 0.9444
Both scaled KNN models with different distance metrics performed equally well.


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:
* Use PCA to reduce dimensionality
* Decide how many components to keep
* Use KNN for classification post-dimensionality reduction
* Evaluate the model
* Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
***

**1. Use PCA to reduce dimensionality:**

*   **Data Preparation:** Start by loading and preprocessing your gene expression data. This will likely involve handling missing values, normalizing the data (e.g., using techniques like log transformation or variance stabilization), and potentially standardizing the features (scaling them to have zero mean and unit variance). Standardization is crucial for PCA as it is sensitive to the scale of the features.
*   **Apply PCA:** Instantiate a PCA object from a library like scikit-learn. You would then fit the PCA model to your *training* data. It's important to fit PCA only on the training data to avoid data leakage from the test set.
*   **Transform the Data:** After fitting PCA, transform both your training and testing datasets using the fitted PCA model. This will project the original high-dimensional data onto the lower-dimensional space defined by the principal components.

**2. Decide how many components to keep:**

This is a critical step in PCA and there are several methods to decide the optimal number of components to retain:

*   **Explained Variance Ratio:** Train a PCA model without specifying the number of components (i.e., `n_components=None` in scikit-learn). This will compute all possible principal components and their explained variance ratio. The explained variance ratio of a principal component indicates the proportion of the total variance in the data that is captured by that component. You can then examine the cumulative explained variance. A common approach is to select the number of components that explain a certain percentage of the total variance (e.g., 95% or 99%).
*   **Scree Plot:** A scree plot is a visualization that plots the eigenvalues (or explained variance) against the number of principal components. You look for an "elbow point" in the plot, which indicates where the rate of decrease in explained variance slows down. Components before the elbow are typically considered more important.
*   **Cross-validation:** You can use cross-validation to evaluate the performance of your downstream model (KNN in this case) with different numbers of principal components. Select the number of components that yields the best performance on the validation set.
*   **Domain Knowledge:** In some cases, domain expertise can guide the selection of the number of components. For example, if you know that certain biological pathways contribute significantly to the cancer types, you might look for principal components that capture the variance related to those pathways.

**3. Use KNN for classification post-dimensionality reduction:**

*   **Instantiate KNN:** Create a KNeighborsClassifier object from scikit-learn.
*   **Train KNN:** Train the KNN classifier on the PCA-transformed training data (the reduced-dimensional representation).
*   **Choose 'k' and Distance Metric:** Select an appropriate value for 'k' (the number of neighbors) and a distance metric (e.g., Euclidean or Manhattan distance). These choices can be made through experimentation or cross-validation.

**4. Evaluate the model:**

*   **Prediction:** Make predictions on the PCA-transformed testing data using the trained KNN model.
*   **Evaluation Metrics:** Evaluate the performance of the model using appropriate classification metrics. For cancer classification, important metrics might include:
    *   **Accuracy:** Overall proportion of correctly classified samples.
    *   **Precision:** Ability of the model to correctly identify positive cases (e.g., a specific cancer type) among all samples predicted as positive.
    *   **Recall (Sensitivity):** Ability of the model to find all positive cases.
    *   **F1-score:** Harmonic mean of precision and recall, providing a balance between the two.
    *   **AUC-ROC curve:** Measures the ability of the classifier to distinguish between classes.
*   **Cross-validation:** Use cross-validation during the training phase to get a more reliable estimate of the model's performance and to tune hyperparameters like 'k' and the number of principal components.

**5. Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data:**

When presenting this approach to stakeholders, emphasize the following points:

*   **Addressing High Dimensionality:** Explain that gene expression data is inherently high-dimensional, leading to challenges like overfitting in traditional models, especially with limited sample sizes. PCA is a widely accepted technique to handle this by reducing the number of features while preserving the most important information.
*   **Reducing Noise:** Highlight that PCA can help in filtering out noise in the data, which is common in biological measurements. By focusing on the principal components that capture the most variance, you are essentially retaining the underlying biological signals while discarding less informative noise.
*   **Improved Model Robustness:** Explain that by reducing dimensionality and potentially noise, the KNN model trained on PCA-transformed data is less prone to overfitting the training data and is likely to generalize better to new, unseen patient samples. This leads to a more robust and reliable model for real-world application.
*   **Computational Efficiency:** Mention that reducing the dimensionality makes the training and prediction processes of KNN faster and more computationally efficient, which can be important for large datasets.
*   **Interpretability (with caveats):** While the principal components themselves may not have direct biological interpretations in terms of individual genes, you can explain that they represent the main axes of variation in gene expression patterns across the different cancer types. Further analysis can sometimes link principal components back to biological pathways or processes.
*   **Validation:** Emphasize that the model's performance has been rigorously evaluated using appropriate metrics and potentially cross-validation, providing confidence in its ability to accurately classify cancer types.