Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Ans- K-Nearest Neighbors (KNN) is a simple, non-parametric, "lazy learning" algorithm.1 It is called "lazy" because it doesn't learn a model or mathematical formula during training; instead, it simply memorizes the entire training dataset.

How KNN Works:When you want to predict the outcome for a new data point, the algorithm calculates the distance (usually Euclidean) between that new point and every other point in the stored dataset.3 It then identifies the 4$K$ closest points (neighbors).

- In Classification: The algorithm uses a majority vote.6 It looks at the classes of the $K$ nearest neighbors. If the majority of neighbors are "Class A," the new point is assigned to "Class A.
- In Regression: The algorithm calculates the average (mean). It looks at the values of the $K$ nearest neighbors and takes their average to predict the value for the new point.


Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Ans : The Curse of Dimensionality refers to the various problems that arise when analyzing data in high-dimensional spaces (data with many features/columns).

Effect on KNN Performance: KNN is particularly sensitive to this "curse" because it relies entirely on calculating distances between points.

1. Loss of "Distance" Meaning: As the number of dimensions increases, all data points tend to become equidistant (equally far away) from each other. The concept of "nearest" neighbor becomes meaningless because the difference in distance between the nearest and farthest point becomes negligible.


2. Sparsity: In high dimensions, data becomes very sparse (spread out). To maintain a reliable density of data to find good neighbors, the amount of data needed grows exponentially, which is rarely available.

3. Computational Cost: Calculating distances in hundreds or thousands of dimensions is computationally very expensive and slow.


Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Ans : Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique. Its goal is to reduce the number of features in a dataset while retaining as much of the original information (variance) as possible.

Difference from Feature Selection:

- Feature Selection: This involves selecting a subset of the original features and discarding the rest. The features you keep are unchanged. (e.g., Keeping "Age" and "Salary" but deleting "Height").

- PCA: This involves transforming the original features into a new set of features called Principal Components. These new components are mathematical combinations of the original features. (e.g., Creating "Component 1" which is a mix of Age, Salary, and Height). You lose the original feature names, but you keep their information compressed.


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Ans : Eigenvalues and eigenvectors are mathematical concepts used to calculate the Principal Components (PCs) from the data's covariance matrix.

- Eigenvectors (The Direction): These represent the direction of the new axes (Principal Components) where the data has the most spread (variance). The first eigenvector points in the direction of the highest variance.

- Eigenvalues (The Magnitude): These numbers represent the amount of variance carried by each eigenvector. A high eigenvalue means that specific Principal Component contains a lot of important information about the data.

Importance: They allow us to rank the components. We keep the eigenvectors with the highest eigenvalues (most information) and discard the ones with low eigenvalues (noise), thereby reducing dimensionality.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Ans : KNN and PCA are often used together because PCA solves the major weaknesses of KNN.

1. Solving the Curse of Dimensionality: KNN struggles with high-dimensional data (see Q2). PCA reduces the data to a few meaningful components (e.g., from 100 features to 10), making the distance calculations in KNN reliable again.

2. Noise Reduction: KNN is sensitive to noisy data (outliers). PCA filters out noise by discarding the components with low variance (low eigenvalues), leaving a cleaner signal for the KNN algorithm to classify.

3. Speed: KNN is slow because it calculates distances for every point. Running KNN on PCA-reduced data is significantly faster.

In [1]:
# Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# 1. Load Data
wine = load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 2. Train WITHOUT Scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
acc_raw = accuracy_score(y_test, knn_raw.predict(X_test))

# 3. Train WITH Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

# 4. Print Results
print("--- KNN Accuracy Comparison ---")
print(f"Accuracy without Scaling: {acc_raw:.4f}")
print(f"Accuracy with Scaling:    {acc_scaled:.4f}")

--- KNN Accuracy Comparison ---
Accuracy without Scaling: 0.7407
Accuracy with Scaling:    0.9630


In [2]:
# Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.decomposition import PCA
import numpy as np

# Note: We continue using the 'X_train_scaled' from the previous answer
# because PCA requires scaled data.

# 1. Initialize PCA (Keep all components for analysis)
pca_full = PCA(n_components=None)
pca_full.fit(X_train_scaled)

# 2. Get Variance Ratios
variance_ratios = pca_full.explained_variance_ratio_

print("--- PCA Explained Variance Ratio ---")
for i, ratio in enumerate(variance_ratios):
    print(f"Principal Component {i+1}: {ratio:.4f} ({ratio*100:.2f}%)")

print(f"\nTotal Variance Explained by Top 2: {np.sum(variance_ratios[:2])*100:.2f}%")

--- PCA Explained Variance Ratio ---
Principal Component 1: 0.3620 (36.20%)
Principal Component 2: 0.1876 (18.76%)
Principal Component 3: 0.1166 (11.66%)
Principal Component 4: 0.0758 (7.58%)
Principal Component 5: 0.0704 (7.04%)
Principal Component 6: 0.0455 (4.55%)
Principal Component 7: 0.0358 (3.58%)
Principal Component 8: 0.0265 (2.65%)
Principal Component 9: 0.0217 (2.17%)
Principal Component 10: 0.0196 (1.96%)
Principal Component 11: 0.0176 (1.76%)
Principal Component 12: 0.0132 (1.32%)
Principal Component 13: 0.0076 (0.76%)

Total Variance Explained by Top 2: 54.96%


In [3]:
# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

# 1. Transform Data using PCA (Keep only Top 2 components)
pca_2 = PCA(n_components=2)
X_train_pca = pca_2.fit_transform(X_train_scaled)
X_test_pca = pca_2.transform(X_test_scaled)

# 2. Train KNN on PCA Data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

# 3. Comparison
print("--- Accuracy Comparison: Original vs PCA ---")
print(f"Original Scaled Data (13 Features): {acc_scaled:.4f}")
print(f"PCA Reduced Data (2 Features):      {acc_pca:.4f}")

--- Accuracy Comparison: Original vs PCA ---
Original Scaled Data (13 Features): 0.9630
PCA Reduced Data (2 Features):      0.9815


In [4]:
# Question 9: Train a KNN Classifier with different distance metrics (Euclidean, Manhattan) on the scaled Wine dataset and compare the results.

# 1. Train with Euclidean Distance (p=2)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))

# 2. Train with Manhattan Distance (p=1)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

# 3. Print Results
print("--- KNN Distance Metric Comparison ---")
print(f"Euclidean Distance Accuracy: {acc_euclidean:.4f}")
print(f"Manhattan Distance Accuracy: {acc_manhattan:.4f}")

--- KNN Distance Metric Comparison ---
Euclidean Distance Accuracy: 0.9630
Manhattan Distance Accuracy: 0.9630


In [2]:
''' Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit. '''


from sklearn.datasets import load_wine
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler  # Added this missing import

# 0. Load Data (Ensuring X and y are available)
wine = load_wine()
X = wine.data
y = wine.target

# 1. Define the Pipeline
# Step 1: Scale the data (StandardScaler) - Essential for PCA & KNN
# Step 2: Apply PCA (keep 95% variance)
# Step 3: Classifier (KNN)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# 2. Evaluate using Cross-Validation
# We use 5-Fold Cross-Validation to simulate a robust evaluation
cv_scores = cross_val_score(pipeline, X, y, cv=5)

print("--- Biomedical Pipeline Strategy ---")
print(f"Strategy: PCA (95% variance) -> KNN")
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean():.4f}")

--- Biomedical Pipeline Strategy ---
Strategy: PCA (95% variance) -> KNN
Cross-Validation Scores: [0.91666667 0.94444444 0.97222222 1.         0.91428571]
Mean Accuracy: 0.9495
