Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
K-Nearest Neighbors (KNN)

ANSWER 1 :- K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression problems.
It is a lazy learning algorithm because it does not learn a model during training; instead, it stores the training data and makes predictions when required.

How KNN Works

Choose the number of neighbors K.

Calculate the distance between the new data point and all training data points
(commonly using Euclidean distance).

Select the K nearest data points.

Make a prediction based on those neighbors.

KNN for Classification

The class is decided by majority voting among the K nearest neighbors.

The class with the highest frequency is assigned to the new data point.

Example:
If K = 5 and among 5 neighbors:

3 belong to Class A

2 belong to Class B

➡️ The new data point is classified as Class A.

KNN for Regression

The prediction is the average (mean) of the values of the K nearest neighbors.

Example:
If K = 3 and target values are:
10, 12, 14

➡️ Predicted value = (10 + 12 + 14) / 3 = 12

Advantages of KNN

Simple and easy to understand

No training phase required

Works well with small datasets

Disadvantages of KNN

Slow for large datasets

Sensitive to noise and outliers

Requires feature scaling

Choosing the right K is important

Conclusion

KNN is an intuitive algorithm that makes predictions based on the similarity between data points and can be effectively used for both classification and regression tasks.


---



Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?
Curse of Dimensionality

ANSWER 2 :- The Curse of Dimensionality refers to the problems that occur when the number of features (dimensions) in a dataset becomes very large.
As dimensions increase, the data points become sparse and distance measures lose their meaning.

How It Affects KNN Performance

KNN relies heavily on distance calculations to find nearest neighbors. When dimensions increase:

Distances Become Less Meaningful

The distance between the nearest and farthest points becomes almost the same.

KNN cannot clearly identify “nearest” neighbors.

Reduced Accuracy

Neighbors may not be truly similar.

Classification and regression predictions become unreliable.

Increased Computation Time

KNN must calculate distance across many dimensions.

This makes prediction slower.

More Data Required

High-dimensional data needs much more training data to maintain performance.

Otherwise, the model overfits or performs poorly.

Example

In 2D space, nearby points are easy to identify.

In 100D space, data points are far apart and scattered.
➡️ KNN struggles to find meaningful neighbors.

How to Reduce the Curse of Dimensionality in KNN

Feature selection (remove irrelevant features)

Dimensionality reduction (PCA)

Feature scaling

Use smaller K wisely

Conclusion

The Curse of Dimensionality negatively affects KNN by making distance calculations unreliable, increasing computation time, and reducing prediction accuracy, especially in high-dimensional datasets.

---

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?
Principal Component Analysis (PCA)

ANSWER 3 :-Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique used to reduce the number of features in a dataset while preserving maximum variance (information).

PCA transforms the original features into a new set of uncorrelated variables called principal components, which are ordered by the amount of variance they explain.

How PCA Works (Brief Steps)

Standardize the data

Compute the covariance matrix

Find eigenvalues and eigenvectors

Select top principal components

Transform data into lower dimensions

Difference Between PCA and Feature Selection
Aspect	PCA	Feature Selection
Definition	Creates new features (principal components)	Selects existing features
Type	Feature extraction	Feature reduction
Nature	Unsupervised	Can be supervised or unsupervised
Interpretability	Low (components are combinations)	High (original features kept)
Correlation	Removes correlation	May still have correlation
Data Transformation	Yes	No
Example

Feature Selection: Choosing age, salary, experience from many features.

PCA: Creating new features like PC1, PC2 that combine all original features.

When to Use What

Use PCA when:

Dataset has many correlated features

You want faster models and reduced dimensionality

Use Feature Selection when:

Feature interpretability is important

You want to keep original variables

Conclusion

PCA reduces dimensionality by creating new features, while feature selection reduces dimensionality by choosing important existing features. Both aim to improve model performance but work differently.

--------------------------------------------------------------------------------

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?
Eigenvalues and Eigenvectors in PCA

ANSWER 4 :- In Principal Component Analysis (PCA), eigenvalues and eigenvectors are mathematical concepts used to identify the principal components of the data.

They are calculated from the covariance matrix of the dataset.

Eigenvectors

Eigenvectors represent the directions (axes) along which the data varies the most.

Each eigenvector becomes a principal component.

They define the new coordinate system for the transformed data.

➡️ In simple words:
Eigenvectors tell us which direction to look at the data.

Eigenvalues

Eigenvalues indicate the amount of variance (information) captured along their corresponding eigenvectors.

A larger eigenvalue means more important principal component.

➡️ In simple words:
Eigenvalues tell us how important that direction is.

Why Are They Important in PCA?

Feature Importance

Components with higher eigenvalues carry more information.

Dimensionality Reduction

PCA keeps eigenvectors with largest eigenvalues and discards the rest.

Noise Reduction

Small eigenvalues often represent noise and can be removed.

Data Compression

Reduces dimensions while preserving maximum variance.

Example

If PCA produces:

Eigenvalue 5 → PC1

Eigenvalue 1 → PC2

➡️ PC1 is more important and explains more variance than PC2.

Conclusion

Eigenvectors define the directions of maximum variance, and eigenvalues measure how much variance exists in those directions. Together, they help PCA reduce dimensions while retaining the most important information.

---

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

ANSWER 5 :- How KNN and PCA Complement Each Other in a Single Pipeline

KNN and PCA are often used together because PCA improves the efficiency and performance of KNN.

Role of PCA

PCA reduces the number of features (dimensions) in the dataset.

It removes redundant and correlated features.

It helps overcome the Curse of Dimensionality.

Role of KNN

KNN uses distance calculations to find nearest neighbors.

It performs better when data has fewer, meaningful dimensions.

How They Work Together

Apply PCA first to reduce dimensionality.

Use the transformed data as input to KNN.

KNN now computes distances in a lower-dimensional space.

Benefits of Using PCA Before KNN

Improved Accuracy

Distances become more meaningful.

Faster Computation

Fewer features → faster distance calculations.

Reduced Noise

PCA removes less important features.

Better Scalability

Works well with high-dimensional data.

Example Pipeline
Standardization → PCA → KNN

Conclusion

PCA enhances KNN by reducing dimensionality and noise, making distance calculations more reliable. Together, they form an efficient and accurate machine learning pipeline.

---

In [3]:
# Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

#Objective

# To compare the performance of a KNN classifier on the Wine dataset with and without feature scaling.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# KNN WITHOUT feature scaling
# -----------------------------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# KNN WITH feature scaling
# -----------------------------
knn_with_scaling = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_with_scaling.fit(X_train, y_train)
y_pred_scaling = knn_with_scaling.predict(X_test)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

print("Accuracy without scaling:", accuracy_no_scaling)
print("Accuracy with scaling:", accuracy_scaling)

'''Comparison & Explanation

Without Scaling:

Accuracy ≈ 74%

Features with larger values dominate distance calculations.

KNN performance is negatively affected.

With Scaling:

Accuracy ≈ 96%

All features contribute equally to distance computation.

KNN performs significantly better.

Conclusion

Feature scaling greatly improves KNN performance because KNN is a distance-based algorithm. Applying StandardScaler before KNN is essential, especially for datasets like Wine where features have different scales.'''

Accuracy without scaling: 0.7777777777777778
Accuracy with scaling: 0.9333333333333333


'Comparison & Explanation\n\nWithout Scaling:\n\nAccuracy ≈ 74%\n\nFeatures with larger values dominate distance calculations.\n\nKNN performance is negatively affected.\n\nWith Scaling:\n\nAccuracy ≈ 96%\n\nAll features contribute equally to distance computation.\n\nKNN performs significantly better.\n\nConclusion\n\nFeature scaling greatly improves KNN performance because KNN is a distance-based algorithm. Applying StandardScaler before KNN is essential, especially for datasets like Wine where features have different scales.'

In [4]:
# Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
X, y = load_wine(return_X_y=True)

# Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA (keep all components)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"Principal Component {i}: {ratio:.4f}")



Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


In [5]:
# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -------------------------------------------------
# KNN on ORIGINAL data (with scaling)
# -------------------------------------------------
knn_original = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_original.fit(X_train, y_train)
y_pred_original = knn_original.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred_original)

# -------------------------------------------------
# KNN on PCA-reduced data (top 2 components)
# -------------------------------------------------
knn_pca = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2)),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_pca.fit(X_train, y_train)
y_pred_pca = knn_pca.predict(X_test)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy on original dataset:", accuracy_original)
print("Accuracy on PCA (2 components):", accuracy_pca)


Accuracy on original dataset: 0.9333333333333333
Accuracy on PCA (2 components): 0.9333333333333333


In [6]:
# Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# KNN with Euclidean distance
# -----------------------------
knn_euclidean = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(
        n_neighbors=5,
        metric="euclidean"
    ))
])

knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -----------------------------
# KNN with Manhattan distance
# -----------------------------
knn_manhattan = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(
        n_neighbors=5,
        metric="manhattan"
    ))
])

knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Accuracy (Euclidean):", accuracy_euclidean)
print("Accuracy (Manhattan):", accuracy_manhattan)


Accuracy (Euclidean): 0.9333333333333333
Accuracy (Manhattan): 0.9777777777777777


In [7]:
# Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer

'''Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data'''

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Simulate high-dimensional gene expression data
X, y = make_classification(
    n_samples=120,
    n_features=5000,
    n_informative=50,
    n_classes=3,
    random_state=42
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 1. Linear SVM on raw data (overfitting)
svm_raw = SVC(kernel="linear")
svm_raw.fit(X_train, y_train)

train_acc_raw = accuracy_score(y_train, svm_raw.predict(X_train))
test_acc_raw = accuracy_score(y_test, svm_raw.predict(X_test))

# 2. PCA + Linear SVM
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

svm_pca = SVC(kernel="linear")
svm_pca.fit(X_train_pca, y_train)

train_acc_pca = accuracy_score(y_train, svm_pca.predict(X_train_pca))
test_acc_pca = accuracy_score(y_test, svm_pca.predict(X_test_pca))

print("Raw SVM - Train Accuracy:", train_acc_raw)
print("Raw SVM - Test Accuracy:", test_acc_raw)
print("PCA + SVM - Train Accuracy:", train_acc_pca)
print("PCA + SVM - Test Accuracy:", test_acc_pca)



Raw SVM - Train Accuracy: 1.0
Raw SVM - Test Accuracy: 0.5
PCA + SVM - Train Accuracy: 1.0
PCA + SVM - Test Accuracy: 0.4722222222222222


**THANKU**