Que 1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
- K-Nearest Neighbors (KNN) is a simple, supervised algorithm that classifies or predicts a new data point by looking at the majority class or average value of its 'K' closest training data points, using distance metrics to define "closeness" without building a complex model, making it a "lazy" and non-parametric learner.

- In classification, it uses a majority vote & in regression, it takes the mean of neighbor values, often weighting closer points more heavily.


Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?
- As dimensions (features) increase, the volume of the feature space grows exponentially, making data points very spread out (sparse). In high dimensions, the difference between the closest and farthest points to a query point diminishes; all points seem almost equidistant, making distance metrics less useful, making it hard to find truly representative neighbors.

- **How it Affects KNN Performance :**
    - **Loss of Locality:** The concept of "nearest" becomes unreliable because points that seem close in many dimensions might actually be far apart in a meaningful sense.
    
    - **Increased Data Requirement:** To effectively sample the sparse high-dimensional space and find reliable neighbors, KNN needs exponentially more data, often becoming computationally infeasible.
    - **Overfitting & Noise:** With many irrelevant features, it's harder to distinguish signal from noise, leading KNN to focus on irrelevant dimensions, causing poor generalization and overfitting.
    - **Computational Burden:** Calculating distances to all neighbors becomes computationally expensive as dimensions and data size grow, increasing processing time.

Que 3. What is Principal Component Analysis (PCA)? How is it different from feature selection?
- Principal Component Analysis (PCA) is a feature extraction method that transforms correlated original features into fewer, uncorrelated "artificial" features called Principal Components (PCs), maximizing variance capture for dimensionality reduction, while Feature Selection directly picks a subset of the original features, evaluating their predictive power for a target, without creating new ones. PCA creates new, combined features (less interpretable) to reduce dimensions, whereas Feature Selection chooses the best existing features (more interpretable).

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?
- **Eigenvectors:** These are the directions in the data space where the data varies the most. Each eigenvector is a new axis (a principal component).

- **Eigenvalues:** These are scalar values associated with each eigenvector, representing the magnitude of variance along that eigenvector's direction. A larger eigenvalue means more variance, thus more information, is captured.
- **Why They Are Important in PCA:**

    - **Dimensionality Reduction:** By sorting eigenvectors by their eigenvalues, we find the most important directions (principal components) that capture the bulk of the data's variability. We can then discard components with small eigenvalues, reducing dimensions while minimizing information loss.
    
    - **Data Transformation:** They transform the data into a new, lower-dimensional space where features are uncorrelated (orthogonal) and ordered by importance.
    - **Identifying Key Patterns:** They reveal the underlying structure and most significant patterns in the data, separating signal (high variance) from noise (low variance).
    - **Data Compression & Visualization:** They enable efficient data compression and make high-dimensional data easier to visualize and process for machine learning models.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?
- When applied in a single pipeline, PCA and KNN complement each other by addressing each other's limitations. PCA (Principal Component Analysis) reduces data complexity, which directly enhances the speed and accuracy of the distance-based KNN (K-Nearest Neighbors) algorithm.

- **Here is how they complement each other:**

    - **Mitigating the "Curse of Dimensionality":** KNN relies on distance metrics (Euclidean, etc.) to find the closest data points. In high-dimensional spaces, distances become less meaningful. PCA reduces the feature space while retaining maximum variance, which helps KNN find more relevant neighbors.
    
    - **Improving Computational Efficiency:** KNN is computationally expensive (lazy learner) because it calculates the distance between the test point and all training points. By reducing the number of dimensions, PCA significantly speeds up the training and inference time for KNN.
    - **Noise Reduction and Improved Accuracy:** Original data often contains noisy or redundant features that can mislead KNN. PCA filters out these components, often increasing the classification accuracy, sometimes by as much as 7.5% or more compared to KNN alone.
    - **Eliminating Multicollinearity:** KNN assumes features are independent. PCA transforms correlated features into a smaller set of uncorrelated components, allowing for more stable, reliable distance calculations.


In [4]:
# Que 6. Train a KNN Classifier on the Wine dataset with and without feature scaling.
# Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_wine()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy Without Scaling : {accuracy_score(y_test, y_pred)}')

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred_scaled = clf.predict(X_test)
print(f'Accuracy With Scaling : {accuracy_score(y_test, y_pred_scaled)}')

Accuracy Without Scaling : 0.7407407407407407
Accuracy With Scaling : 0.9629629629629629


In [10]:
# Que 7. Train a PCA model on the Wine dataset and
# print the explained variance ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

data = load_wine()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

pca.explained_variance_ratio_

array([0.36196226, 0.18763862, 0.11656548, 0.07578973, 0.07043753,
       0.04552517, 0.03584257, 0.02646315, 0.02174942, 0.01958347,
       0.01762321, 0.01323825, 0.00758114])

In [12]:
# Que 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components).
# Compare the accuracy with the original dataset.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

clf = KNeighborsClassifier()
clf.fit(X_train_pca, y_train)
y_pred_pca = clf.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f'Accuracy with KNN on 2 PCA Components: {accuracy_pca}')

clf_scaled = KNeighborsClassifier()
clf_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = clf_scaled.predict(X_test_scaled)
print(f'Accuracy with KNN on Scaled Original Dataset : {accuracy_score(y_test, y_pred_scaled)}')

Accuracy with KNN on 2 PCA Components: 0.9814814814814815
Accuracy with KNN on Scaled Original Dataset : 0.9629629629629629


In [15]:
# Que 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan)
# on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_euclidean = KNeighborsClassifier(metric='euclidean')
clf_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = clf_euclidean.predict(X_test_scaled)
print(f'Accuracy With KNN (Euclidean Distance) : {accuracy_score(y_test, y_pred_euclidean)}')

clf_manhattan = KNeighborsClassifier(metric='manhattan')
clf_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = clf_manhattan.predict(X_test_scaled)
print(f'Accuracy With KNN (Manhattan Distance) : {accuracy_score(y_test, y_pred_manhattan)}')

Accuracy With KNN (Euclidean Distance) : 0.9629629629629629
Accuracy With KNN (Manhattan Distance) : 0.9629629629629629


Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.  
Due to the large number of features and a small number of samples, traditional models overfit.  
Explain how you would:
1. Use PCA to reduce dimensionality
2. Decide how many components to keep
3. Use KNN for classification post-dimensionality reduction
4. Evaluate the model
5. Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

**Answer**
- To address the challenge of high-dimensional gene expression data causing overfitting in cancer classification, I would implement a Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN) pipeline. This combination reduces noise and dimensionality while maintaining maximum variance, allowing for robust classification.

- **1. PCA for Dimensionality Reduction :**
    - **Preprocessing:** First, standardize the gene expression data (Z-score scaling) so that genes with higher absolute expression levels do not dominate the principal components.

    - **Implementation:** Apply PCA to transform the thousands of correlated gene features into a new coordinate system of orthogonal, uncorrelated components (principal components - PCs), which are linear combinations of the original genes.
    - **Effect:** The algorithm identifies directions (PCs) that capture the maximum variance in the data.

- **2. Deciding Number of Components to Keep :**
    - **Cumulative Variance Method:** I would calculate the cumulative explained variance ratio and select the smallest number of components that explain a high threshold of the variance, typically 90%–95%.

    - **Scree Plot (Elbow Method):** I would plot the variance explained by each component and look for the "elbow" point—where the marginal gain in explained variance drops significantly—to select the optimal number of PCs.

- **3. KNN for Classification Post-Reduction :**
    - **Data Transformation:** Project the original standardized training and testing data onto the chosen \(k\) principal components.

    - **KNN Application:** Feed these reduced components into the KNN algorithm. KNN computes the distance (e.g., Euclidean distance) between a new patient's sample and the labeled training samples.
    - **Classification:** Assign the cancer type based on the majority vote of the \(k\) closest neighbors.

- **4. Evaluating the Model :** Given the small sample size, I would use Nested Cross-Validation (e.g., 5-fold or Leave-One-Out) to prevent leakage during hyperparameter tuning (both for PCA components and \(K\) in KNN).

    - **Metrics:** I would evaluate the model using Accuracy, Precision, Recall, and F1-Score.
    - **Confusion Matrix:** Specifically check for false positives/negatives between cancer subtypes to ensure the model is robust.

- **5. Justification to Stakeholders :**
    - **Mitigates Overfitting:** By reducing thousands of genes to a few, high-variance components, the model stops "learning" noise and focuses on structural biological patterns.

    - **Handles High-Dimensionality:** PCA solves the "curse of dimensionality," making it ideal for microarrays where genes >> patient samples.
    - **Computational Efficiency:** KNN, which can be slow on large feature sets, becomes much faster and more accurate on a lower-dimensional PCA-reduced dataset.
    - **Interpretability:** PCA identifies "metagenes"—linear combinations of genes that contribute most to the variance—allowing clinicians to understand which underlying biological processes drive the classification.