# Assignment

Q.1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer ->

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks.
It is a non-parametric, instance-based (lazy) algorithm — meaning it makes predictions directly from the training data without learning a fixed model.

**KNN for Classification :**

- Each of the K nearest neighbors votes for its class.

- The majority class among these neighbors becomes the predicted class.

**KNN for Regression :**

- The algorithm takes the average (or weighted average) of the target values of the K nearest neighbors.

Q.2. : What is the Curse of Dimensionality and how does it affect KNN
performance?

Answer ->

The Curse of Dimensionality refers to the problems that arise when working with high-dimensional data — that is, data with many features (variables).

As the number of dimensions increases:

- The data becomes sparse (spread out).

- Distances between points become less meaningful.

- The computational cost increases drastically.

**How It Affects KNN Performance**

KNN relies heavily on distance metrics (like Euclidean distance).
When dimensions increase:

- Nearest neighbors aren’t truly “close” anymore
- Model has fewer relevant neighbors to learn from
- Noise dominates because similar points are rare
- Distance computation becomes slow and memory-heavy

Q.3. What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Answer ->

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a large set of correlated features into a smaller set of uncorrelated features called principal components — while preserving as much variance (information) as possible.

Different from feature selection :    

PCA → Creates new transformed features (reduces dimensionality by feature extraction).

Feature Selection → Chooses the best original features (reduces dimensionality by elimination).

Q.4.  What are eigenvalues and eigenvectors in PCA, and why are they
important?

Answer ->

An eigenvector is a direction in which a linear transformation (like PCA’s covariance matrix) acts by stretching or compressing the data.

The corresponding eigenvalue tells how much variance (information) is captured along that eigenvector’s direction.

why are they important :     

1. Determine principal components:
 The eigenvectors define the new axes (principal components) of the reduced feature space.

2. Measure importance (variance explained):
Eigenvalues tell how much variance (information) each principal component retains.

3. Dimensionality reduction:
By choosing the top k eigenvectors (those with largest eigenvalues), we can represent most of the information with fewer dimensions.

4. Noise reduction:
Small eigenvalues correspond to components with little variance (often noise), which can be discarded.

Q.5. How do KNN and PCA complement each other when applied in a single
pipeline?

Answer ->

1. Apply PCA :

- Process :
Reduce the dimensionality of the data by transforming it into a smaller number of principal components.
- Benefits : Removes noise, reduces feature redundancy, and simplifies distance calculations.
2. Apply KNN :
- Process : Perform classification or regression in the lower-dimensional PCA space.
- Benefits : Improves efficiency and accuracy by making distances more meaningful and reducing overfitting.

Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().

Q.6. Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


Answer ->>

In [1]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ----------------------------
# Case 1: Without Feature Scaling
# ----------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_no_scale = knn.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# ----------------------------
# Case 2: With Feature Scaling
# ----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# ----------------------------
# 3. Compare Results
# ----------------------------
print(f"Accuracy without Feature Scaling: {acc_no_scale:.4f}")
print(f"Accuracy with Feature Scaling:    {acc_scaled:.4f}")


Accuracy without Feature Scaling: 0.7407
Accuracy with Feature Scaling:    0.9630


Q.7. Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

Answer ->>

In [2]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Standardize the features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# 4. Print explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Display in a nice table
df_variance = pd.DataFrame({
    'Principal Component': [f'PC{i+1}' for i in range(len(explained_variance))],
    'Explained Variance Ratio': explained_variance
})

print(df_variance)
print("\nTotal Variance Explained:", explained_variance.sum())


   Principal Component  Explained Variance Ratio
0                  PC1                  0.361988
1                  PC2                  0.192075
2                  PC3                  0.111236
3                  PC4                  0.070690
4                  PC5                  0.065633
5                  PC6                  0.049358
6                  PC7                  0.042387
7                  PC8                  0.026807
8                  PC9                  0.022222
9                 PC10                  0.019300
10                PC11                  0.017368
11                PC12                  0.012982
12                PC13                  0.007952

Total Variance Explained: 1.0


Q.8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

Answer ->>

In [3]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 3. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ------------------------------------------------------
# Case 1: KNN on Original Scaled Data
# ------------------------------------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_original = knn.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

# ------------------------------------------------------
# Case 2: PCA Transformation (retain top 2 components)
# ------------------------------------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# ------------------------------------------------------
# 4. Compare Results
# ------------------------------------------------------
print(f"Accuracy on Original (Scaled) Dataset: {acc_original:.4f}")
print(f"Accuracy on PCA-Reduced Dataset (2 Components): {acc_pca:.4f}")


Accuracy on Original (Scaled) Dataset: 0.9444
Accuracy on PCA-Reduced Dataset (2 Components): 0.9444


Q.9. Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

Answer ->>

In [4]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------------------------
# Case 1: KNN with Euclidean distance (default)
# -----------------------------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -----------------------------------------------
# Case 2: KNN with Manhattan distance
# -----------------------------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# -----------------------------------------------
# 4. Compare Results
# -----------------------------------------------
print(f"Accuracy (Euclidean distance): {acc_euclidean:.4f}")
print(f"Accuracy (Manhattan distance): {acc_manhattan:.4f}")


Accuracy (Euclidean distance): 0.9444
Accuracy (Manhattan distance): 0.9815


Q.10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for       real-world biomedical data


Answer ->>

1. Use PCA for Dimensionality Reduction

    - Standardize data, then apply PCA.

    - PCA compresses correlated gene features into fewer uncorrelated components → removes noise & redundancy.

2. Decide Number of Components

    - Use cumulative explained variance (e.g., 90–95%) and cross-validation performance to pick optimal components.

3. Train KNN on PCA-Reduced Data

    - Use top PCs as input features.

    - Tune k, metric (Euclidean/Manhattan), and weights with cross-validation.

4. Evaluate the Model

    - Apply stratified cross-validation to avoid overfitting.

    - Use accuracy, F1-score, AUC (or AUPRC for imbalanced data).

    - Optionally validate on independent test data.

5. Justify to Stakeholders

    - PCA reduces noise and overfitting risk.

    - KNN is simple, transparent, and data-driven.

    - The pipeline generalizes better, is interpretable, and suitable for real-world biomedical data with small sample sizes.