Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?


In [None]:
'''
K-Nearest Neighbors (KNN) is a simple, intuitive algorithm that makes predictions by looking at the closest data points around a given sample.
In classification, it finds the K nearest neighbors and assigns the class that appears most often among them (majority voting).
In regression, it takes the average (or weighted average) of their numerical values.
KNN doesn’t learn a model beforehand — it just stores the data and makes decisions based on proximity, making it easy to understand but slower on large datasets.
'''

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?


In [None]:
'''
The Curse of Dimensionality refers to the problems that arise when data has too many features (dimensions).
As dimensions increase, data points become more spread out, and the notion of “closeness” or “distance” becomes less meaningful.
For KNN, this is a big issue — since it relies on distance to find neighbors, high-dimensional data can make all points seem equally far apart.
This leads to poor neighbor selection, lower accuracy, and slower computation.
To handle it, techniques like feature selection, PCA (dimensionality reduction), or scaling are often used before applying KNN.
'''

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

In [None]:
'''
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a large set of features into a smaller set of new features called principal components.
These components capture most of the variance (information) in the data while reducing noise and redundancy.
The key idea: it creates new features (linear combinations of the original ones) rather than just picking some of them.
Difference from feature selection:
PCA - Transforms features into new combinations to reduce dimensions.
Feature selection - Chooses the most important original features without changing them.
'''


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

In [None]:
'''
In PCA, eigenvalues and eigenvectors come from the covariance matrix of the data and play a key role in identifying patterns.
Eigenvectors show the direction of the new feature axes (principal components).
Eigenvalues tell us how much variance or information each eigenvector captures.
The larger the eigenvalue, the more important that component is.
In simple terms — eigenvectors define where the data varies the most, and eigenvalues tell how much it varies there.
PCA keeps components with the highest eigenvalues to represent most of the data efficiently.
'''

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

In [None]:
'''
KNN and PCA work well together because PCA helps KNN perform better on complex, high-dimensional data.
PCA reduces the number of features while keeping the most important patterns, which helps remove noise and redundant information.
This makes KNN faster and more accurate since it relies on distance — and in fewer dimensions, distances become more meaningful.
'''


Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine()

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [2]:
data = load_wine()
X, y = data.data, data.target

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
knn = KNeighborsClassifier(n_neighbors=5)

In [5]:
knn.fit(X_train, y_train)

In [6]:
y_pred = knn.predict(X_test)

In [7]:
acc_without_scaling = accuracy_score(y_test, y_pred)
print("Accuracy without scaling:", acc_without_scaling)

Accuracy without scaling: 0.7222222222222222


In [8]:
scaler = StandardScaler()

In [9]:
X_train_scaled = scaler.fit_transform(X_train)

In [10]:
X_test_scaled = scaler.transform(X_test)

In [11]:
knn_scaled = KNeighborsClassifier(n_neighbors=5)

In [12]:
knn_scaled.fit(X_train_scaled, y_train)

In [13]:
y_pred_scaled = knn_scaled.predict(X_test_scaled)

In [14]:
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with scaling:", acc_with_scaling)

Accuracy with scaling: 0.9444444444444444


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [15]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [16]:
data = load_wine()
X = data.data

In [17]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [18]:
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

In [19]:
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.


In [20]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [21]:
data = load_wine()
X, y = data.data, data.target

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [23]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [24]:
knn_original = KNeighborsClassifier(n_neighbors=5)

In [25]:
knn_original.fit(X_train_scaled, y_train)

In [26]:
y_pred_original = knn_original.predict(X_test_scaled)

In [27]:
acc_original = accuracy_score(y_test, y_pred_original)
print("Accuracy on original dataset:", acc_original)


Accuracy on original dataset: 0.9444444444444444


In [28]:
pca = PCA(n_components=2)

In [29]:
X_train_pca = pca.fit_transform(X_train_scaled)

In [30]:
X_test_pca = pca.transform(X_test_scaled)

In [31]:
knn_pca = KNeighborsClassifier(n_neighbors=5)

In [32]:
knn_pca.fit(X_train_pca, y_train)

In [33]:
y_pred_pca = knn_pca.predict(X_test_pca)

In [34]:
acc_pca = accuracy_score(y_test, y_pred_pca)
print("Accuracy on PCA-transformed dataset (2 PCs):", acc_pca)

Accuracy on PCA-transformed dataset (2 PCs): 1.0


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [35]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [36]:
data = load_wine()
X, y = data.data, data.target

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [38]:
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euc = knn_euclidean.predict(X_test_scaled)
acc_euc = accuracy_score(y_test, y_pred_euc)
print("Accuracy with Euclidean distance:", acc_euc)


Accuracy with Euclidean distance: 0.9444444444444444


In [39]:
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_man = knn_manhattan.predict(X_test_scaled)
acc_man = accuracy_score(y_test, y_pred_man)
print("Accuracy with Manhattan distance:", acc_man)

Accuracy with Manhattan distance: 0.9444444444444444


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


In [None]:
'''
Step 1: Reduce Dimensionality with PCA
        High-dimensional gene expression data can overwhelm models. Use PCA to transform the original features into a smaller
        set of principal components that capture most of the variance while removing noise and redundancy. This makes the dataset more manageable for KNN.
Step 2: Decide How Many Components to Keep
        Look at the explained variance ratio of each principal component.
        Retain enough components to cover, say, 90–95% of total variance — this ensures you keep the most important biological signals without overfitting.
Step 3: Use KNN for Classification
        Train KNN on the PCA-transformed data.
        Feature scaling is already handled in PCA, so distances are meaningful.
        KNN’s simplicity is advantageous for small sample sizes, as it avoids overly complex models that overfit.
Step 4: Evaluate the Model
        Use cross-validation (e.g., stratified K-fold) to reliably estimate performance on small datasets.
        Evaluate using accuracy, precision, recall, F1-score, or ROC-AUC depending on the problem.
Step 5: Justify to Stakeholders
        This pipeline balances simplicity and robustness: PCA reduces dimensionality and noise, while KNN provides an interpretable, distance-based classification.
        It’s suitable for real-world biomedical data, where samples are limited but features are numerous, and overfitting must be avoided.
Summary: PCA + KNN is a robust, interpretable, and efficient solution for high-dimensional datasets like gene expression,
         giving reliable predictions without overcomplicating the model.
'''