<a href="https://colab.research.google.com/github/Himani954/Data-types-and-structure/blob/main/KNN_%26_PCA_%7C_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

# **Answer1:**
# K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a simple, non-parametric, and lazy learning algorithm used for both classification and regression tasks.

How KNN Works
1. Distance Calculation : KNN calculates the distance between the input data point and all other points in the training dataset. Common distance metrics include Euclidean, Manhattan, or Minkowski distance.
2. Finding Nearest Neighbors : It identifies the K closest data points (neighbors) to the input point based on the calculated distances.
3. Prediction :
    - Classification : The class of the input data point is determined by a majority vote of its K neighbors.
    - Regression : The prediction is the average of the target values of the K nearest neighbors.

Characteristics of KNN
- Lazy Learning : KNN doesn't build an explicit model during training; it defers computation until prediction time.
- Non-Parametric : KNN makes no assumptions about the underlying data distribution.
- Sensitive to K and Distance Metric : Performance depends on choosing the right value of K and the distance metric.

Use Cases
- Classification : KNN is used in tasks like handwriting recognition or pattern recognition.
- Regression : KNN can predict continuous outcomes based on neighbor averages.

# **Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

# **Answer2 :**
# The Curse of Dimensionality
The curse of dimensionality refers to the problems that arise when dealing with data in high-dimensional spaces. As the number of features (dimensions) increases, data becomes increasingly sparse, and distances between points become less meaningful.

How It Affects KNN Performance
1. Distance Becomes Less Meaningful : In high dimensions, distances between points tend to become similar, making it harder for KNN to find meaningful nearest neighbors.
2. Increased Computational Cost : Computing distances in high-dimensional spaces is more expensive.
3. Data Sparsity : High-dimensional data is often sparse, making it difficult for KNN to generalize well due to lack of nearby points.

Mitigating the Curse of Dimensionality for KNN
- Dimensionality Reduction : Techniques like PCA or t-SNE can reduce dimensions while preserving important information.
- Feature Selection : Selecting relevant features reduces dimensionality and can improve KNN performance.

# **Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

# **Answer3 :**
# Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much variance (information) as possible.

How PCA Works
1. Computes Covariance Matrix : PCA calculates the covariance matrix of the data to understand feature relationships.
2. Finds Eigenvectors and Eigenvalues : It computes eigenvectors (principal components) and eigenvalues of the covariance matrix. Eigenvalues indicate the amount of variance explained by each principal component.
3. Projects Data : Data is projected onto the top principal components to reduce dimensions.

Difference from Feature Selection
- Feature Selection : Selects a subset of the original feature based on criteria like relevance or importance.
- PCA : Creates new features (principal components) that are linear combinations of the original features, aiming to capture maximum variance.

Use Cases for PCA
- Dimensionality Reduction : PCA reduces dimensions while retaining data structure.
- Noise Reduction : By focusing on components with high variance, PCA can reduce noise.

# **Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**


# **Answer 4:**
# Eigenvalues and Eigenvectors in PCA
In Principal Component Analysis (PCA), eigenvalues and eigenvectors are key concepts derived from the covariance matrix of the data.

Importance of Eigenvalues and Eigenvectors
- Eigenvectors (Principal Components) : Eigenvectors represent the directions of maximum variance in the data. These are the principal components onto which the data is projected for dimensionality reduction.
- Eigenvalues : Eigenvalues indicate the amount of variance explained by each corresponding eigenvector (principal component). Larger eigenvalues mean more variance is captured by that component.

Why They Are Important
- Dimensionality Reduction : By sorting eigenvalues in descending order, you can choose the top principal components that capture most of the data's variance, reducing dimensions effectively.
- Explained Variance : Eigenvalues help determine how much of the data's variability is explained by each principal component.

# **Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

# **Answer5 :**
# KNN and PCA in a Single Pipeline
KNN (K-Nearest Neighbors) and PCA (Principal Component Analysis) can complement each other in a machine learning pipeline.

How They Complement Each Other
- PCA for Dimensionality Reduction : Applying PCA before KNN reduces the dimensionality of the data, addressing the curse of dimensionality that affects KNN in high dimensions.
- KNN for Classification/Regression Post-PCA : After PCA reduces dimensions, KNN can perform classification or regression in the lower-dimensional space more effectively.

Benefits of Combining PCA and KNN
- Improved KNN Performance : PCA helps by reducing dimensions, making distance calculations in KNN more meaningful.
- Reduced Computational Cost : Lower dimensions from PCA reduce computational costs for KNN.

Typical Pipeline
1. Apply PCA to reduce data dimensions.
2. Use KNN for classification or regression in the reduced space.

# **Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# ---------------- Without Scaling ----------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred)

# ---------------- With Scaling ----------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_scaled)

# Results
print("Accuracy without scaling:", accuracy_no_scaling)
print("Accuracy with scaling:", accuracy_with_scaling)

Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


# **Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


# **Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# ---------------- Original Dataset ----------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred)

# ---------------- PCA with 2 Components ----------------
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca, y, test_size=0.3, random_state=42, stratify=y
)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test_pca, y_pred_pca)

# Results
print("KNN Accuracy on Original Dataset:", accuracy_original)
print("KNN Accuracy on PCA (2 components):", accuracy_pca)

KNN Accuracy on Original Dataset: 0.9444444444444444
KNN Accuracy on PCA (2 components): 0.9629629629629629


# **Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.**

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# ---------------- Euclidean Distance ----------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ---------------- Manhattan Distance ----------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Results
print("KNN Accuracy with Euclidean distance:", accuracy_euclidean)
print("KNN Accuracy with Manhattan distance:", accuracy_manhattan)

KNN Accuracy with Euclidean distance: 0.9444444444444444
KNN Accuracy with Manhattan distance: 0.9814814814814815


# **Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.**
# **Due to the large number of features and a small number of samples, traditional models overfit.**

# **Explain how you would:**
# **● Use PCA to reduce dimensionality**

# **● Decide how many components to keep**
# **● Use KNN for classification post-dimensionality reduction**
# **● Evaluate the model**
# **● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data**


# **Answer10 :**

# Approach for Classifying Cancer Types Using PCA and KNN
● Use PCA to Reduce Dimensionality
- Apply PCA to the high-dimensional gene expression dataset to reduce dimensions. PCA transforms data into principal components capturing maximum variance.

● Decide How Many Components to Keep
- Use a scree plot or choose components explaining a cumulative variance threshold (e.g., 90-95%). This balances dimensionality reduction with retaining sufficient information.

● Use KNN for Classification Post-Dimensionality Reduction
- After reducing dimensions with PCA, apply KNN for classifying cancer types in the lower-dimensional space. KNN works better in reduced dimensions.

● Evaluate the Model
- Use cross-validation to assess model performance. Metrics like accuracy , F1-score, or AUC-ROC evaluate classification performance, considering class imbalance in cancer types.

● Justify Pipeline to Stakeholders
- Handling High Dimensionality : PCA reduces dimensions, addressing overfitting in traditional models with many features and few samples.
- Robust for Biomedical Data : PCA + KNN is effective for high-dimensional genomic data where feature selection is challenging. Cross-validation ensures generalizability.
- Interpretability and Performance : This pipeline balances complexity reduction with retaining enough variance for accurate cancer type classification.

Summary
Using PCA for dimensionality reduction followed by KNN classification is a robust approach for high-dimensional gene expression data with few samples. Cross-validation ensures model evaluation is reliable.