Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer: K-Nearest Neighbors (KNN) is a simple, non-parametric, supervised learning algorithm used for both classification and regression by finding the 'k' closest data points to a new data point and using their properties to make a prediction. For classification, it assigns the majority class of these neighbors to the new point.

How KNN Works:-

Choose 'k': Select the number of neighbors (k) to consider.

Calculate Distances: Compute the distance between the new data point and all points in the training dataset.

Identify K-Nearest Neighbors: Find the 'k' training data points that are closest to the new point.

Make a Prediction:

Classification: Assign the class that is most frequent among the 'k' nearest neighbors.

Regression: Predict a continuous value by averaging the target values of the 'k' nearest neighbors.

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Answer: The Curse of Dimensionality describes how high-dimensional data leads to an exponential growth in the volume of the feature space, causing data points to become sparse and making it harder to find meaningful patterns or distances. This negatively impacts KNN (K-Nearest Neighbors) because the "nearby" data points, crucial for KNN's distance-based predictions, become increasingly far apart and potentially misleading due to the added dimensions

How the Curse of Dimensionality Affects KNN:

1. Data Sparsity:
As the number of dimensions increases, the data points spread out, making the feature space very sparse.

2. Meaningless Distances:
In a high-dimensional space, the concept of "nearest neighbors" becomes less meaningful because the difference in distance between the closest and furthest neighbors can be small relative to the total distance, making distance-based comparisons less reliable.

3. Increased Need for Data:
To adequately cover the exponentially larger feature space and maintain similar data density

4. Overfitting:
With too little data to represent the vast, sparse space, KNN models are more prone to overfitting.

5. Computational Cost:
Searching for nearest neighbors in a high-dimensional space is computationally more intensive

6. Noisy Features:
In high-dimensional data, the presence of noisy or irrelevant features can have a more significant impact










Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:- Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms original features into new, uncorrelated components that capture the maximum variance in the data.
In contrast, feature selection is a process of selecting a subset of the most relevant original features, discarding the rest, to simplify a model and improve its performance. The key difference is that PCA creates new, composite features, while feature selection retains and selects from the existing ones.

Principal Component Analysis (PCA)

What it is:
PCA transforms a set of high-dimensional, correlated variables into a smaller set of new, uncorrelated variables called principal components.

How it works:
It finds directions (principal components) in the data that capture the most variance, essentially a new set of axes that better represent the data's information.

Purpose:
To reduce the dimensionality of complex datasets, which simplifies data processing, improves visualization, and can help reduce noise in the data.

Outcome:
Produces new features (principal components) that are linear combinations of the original features.

Interpretation:
The new principal components can be difficult to interpret because they are mixes of the original features.


**Feature Selection**
What it is:
The process of identifying and selecting a subset of the most useful original input variables for a model.

How it works:
It involves methods to rank and select features based on their relevance to the target variable, often considering their relationship to the problem being solved.

Purpose:
To remove irrelevant or redundant features, improve model explainability, reduce overfitting, and enhance model accuracy.

Outcome:
Retains a smaller set of the original features.

Interpretation:
The resulting features are original and thus more interpretable, making it easier to understand which factors influence the model's predictions

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Answer: In PCA, eigenvectors identify the directions of maximum variance in the data (the principal components), while eigenvalues quantify the amount of variance along those directions. They are important because they enable dimensionality reduction by prioritizing components that retain the most information, allowing for simplification of complex datasets and more efficient analysis.

**Why are they Important in PCA?**

1. Dimensionality Reduction:
PCA uses eigenvalues and eigenvectors to find the most important features (principal components) in a dataset

2. Feature Extraction:
Eigenvectors serve as the new, lower-dimensional basis for the data. This new feature set is orthogonal and decorrelated, simplifying data representation and making it easier for downstream tasks.

3. Identifying Data Patterns:
By finding the directions of maximum variance, eigenvalues and eigenvectors reveal underlying patterns and structures within the data.

4. Informing Model Selection:
The eigenvalues help in deciding how many principal components to retain. Components with small eigenvalues might contribute little to the overall variance and can be discarded, leading to a more parsimonious model.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Answer: PCA complements KNN in a pipeline by first reducing data dimensionality and noise, which mitigates the "curse of dimensionality" and the computational burden of KNN, and then by identifying and emphasizing the most discriminative features for KNN to use.

**How PCA enhances KNN **

1. Dimensionality Reduction:
KNN's effectiveness is hindered by the "curse of dimensionality," where performance degrades as the number of features increases. PCA transforms the data into a lower-dimensional space using principal components, effectively reducing noise and retaining the most significant variance.

2. Improved Computational Efficiency:
By reducing the number of features, PCA significantly decreases the computational cost and time required for KNN to calculate distances between data points and find neighbors.

3. Noise Reduction:
PCA can filter out noisy features by focusing on the principal components that capture the most significant variation in the data, leading to cleaner and more effective input for KNN.

4. Feature Correlation Elimination:
PCA decorrelates features, ensuring that the principal components used by KNN are orthogonal, thus preventing multicollinearity issues and providing a more stable basis for distance calculations.

5. Enhanced Pattern Recognition:
By compressing the data into its most essential components, PCA highlights underlying patterns and structures, making it easier for KNN to correctly classify data points based on their similarity in the transformed space.

In [1]:
# Question 6: Train a KNN Classifier on the Wine dataset with and without feature
# scaling. Compare model accuracy in both cases.


In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score



In [3]:
wine = load_wine()
X = wine.data
y = wine.target

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [5]:
knn1 = KNeighborsClassifier(n_neighbors=5)
knn1.fit(X_train, y_train)
y_pred1 = knn1.predict(X_test)
acc1 = accuracy_score(y_test, y_pred1)


In [6]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [7]:

knn2 = KNeighborsClassifier(n_neighbors=5)
knn2.fit(X_train_scaled, y_train)
y_pred2 = knn2.predict(X_test_scaled)
acc2 = accuracy_score(y_test, y_pred2)

In [8]:
print("Accuracy without Scaling: ", acc1)
print("Accuracy with Scaling   : ", acc2)

Accuracy without Scaling:  0.7407407407407407
Accuracy with Scaling   :  0.9629629629629629


In [9]:
# Question 7: Train a PCA model on the Wine dataset and print the explained variance
# ratio of each principal component.
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [10]:
wine = load_wine()
X = wine.data
y = wine.target

In [11]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [12]:
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

In [13]:
print("Explained Variance Ratio of each component:")
print(pca.explained_variance_ratio_)

Explained Variance Ratio of each component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [14]:
# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
# components). Compare the accuracy with the original dataset.
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [15]:
wine = load_wine()
X = wine.data
y = wine.target

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [17]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [18]:
knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(X_train_scaled, y_train)
y_pred_orig = knn_orig.predict(X_test_scaled)
acc_orig = accuracy_score(y_test, y_pred_orig)

In [19]:
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [20]:
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

In [21]:
print("Accuracy on Original Scaled Data: ", acc_orig)
print("Accuracy on PCA (2 components):   ", acc_pca)

Accuracy on Original Scaled Data:  0.9629629629629629
Accuracy on PCA (2 components):    0.9814814814814815


In [22]:
# Question 9: Train a KNN Classifier with different distance metrics (euclidean,
# manhattan) on the scaled Wine dataset and compare the results.
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [23]:
wine = load_wine()
X = wine.data
y = wine.target

In [24]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [25]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [26]:
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

In [27]:
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

In [28]:
print("Accuracy with Euclidean distance: ", acc_euclidean)
print("Accuracy with Manhattan distance: ", acc_manhattan)

Accuracy with Euclidean distance:  0.9629629629629629
Accuracy with Manhattan distance:  0.9629629629629629


In [29]:
# Question 10: You are working with a high-dimensional gene expression dataset to
# classify patients with different types of cancer.
# Due to the large number of features and a small number of samples, traditional models
# overfit.
# Explain how you would:
# ● Use PCA to reduce dimensionality
# ● Decide how many components to keep
# ● Use KNN for classification post-dimensionality reduction
# ● Evaluate the model
# ● Justify this pipeline to your stakeholders as a robust solution for real-world
# biomedical data
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import make_scorer, f1_score, balanced_accuracy_score


In [30]:
pipe = Pipeline(steps=[
    ("scaler", StandardScaler(with_mean=True, with_std=True)),
    ("pca", PCA(svd_solver="full", whiten=False)),
    ("knn", KNeighborsClassifier())
])

In [31]:
param_grid = {
    "pca__n_components": [5, 10, 15, 20, 30, 50, 100],
    "knn__n_neighbors": [3, 5, 7, 9, 11],
    "knn__metric": ["euclidean", "manhattan"],
    # optionally: "knn__weights": ["uniform", "distance"]
}

In [32]:
scorer = make_scorer(f1_score, average="macro")

inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring=scorer, n_jobs=-1)

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)
scores = cross_val_score(grid, X, y, cv=outer_cv, scoring=scorer, n_jobs=-1)

print("Nested CV Macro-F1: %.3f ± %.3f" % (scores.mean(), scores.std()))

Nested CV Macro-F1: 0.962 ± 0.014
