# 📘 KNN & PCA Assignment



## 🧠 Theoretical Questions

**Q1. What is K-Nearest Neighbors (KNN) and how does it work**

KNN is a supervised learning algorithm used for classification and regression. It predicts the output for a data point by identifying the 'k' closest data points (neighbors) from the training set and taking a majority vote (for classification) or averaging (for regression). It uses a distance metric (like Euclidean) to find neighbors.

**Q2. What is the difference between KNN Classification and KNN Regression**

KNN Classification predicts a discrete class label using the majority class among neighbors, while KNN Regression predicts a continuous value by averaging the outputs of neighbors.

**Q3. What is the role of the distance metric in KNN**

The distance metric (like Euclidean, Manhattan) determines how closeness between points is calculated, affecting which data points are considered neighbors.

**Q4. What is the Curse of Dimensionality in KNN**

As the number of features (dimensions) increases, the distance between points becomes less meaningful, making it harder for KNN to find relevant neighbors.

**Q5. How can we choose the best value of K in KNN**

The best value of K can be found using cross-validation — trying different K values and choosing the one with the highest validation accuracy.

**Q6. What are KD Tree and Ball Tree in KNN**

KD Tree and Ball Tree are data structures that speed up the search for nearest neighbors, especially useful for large datasets.

**Q7. When should you use KD Tree vs. Ball Tree**

KD Tree is efficient for low-dimensional data; Ball Tree is better for high-dimensional data.

**Q8. What are the disadvantages of KNN**

- Slow prediction for large datasets
- Sensitive to irrelevant features and feature scales
- Needs scaling
- High memory usage

**Q9. How does feature scaling affect KNN**

KNN relies on distance, so features with larger ranges can dominate unless features are scaled (standardized or normalized).

**Q10. What is PCA (Principal Component Analysis)**

PCA is a dimensionality reduction technique that transforms features into a new set of orthogonal (uncorrelated) components capturing most variance.

**Q11. How does PCA work**

PCA finds the directions (principal components) that maximize the variance in data and projects the data onto these directions.

**Q12. What is the geometric intuition behind PCA**

PCA rotates the coordinate system to align axes with the directions of greatest data variance, reducing dimensionality while preserving as much information as possible.

**Q13. What is the difference between Feature Selection and Feature Extraction**

- Feature Selection chooses a subset of existing features.
- Feature Extraction creates new features from the original ones (e.g., PCA).

**Q14. What are Eigenvalues and Eigenvectors in PCA**

Eigenvectors determine the directions of principal components; eigenvalues represent the variance captured by each component.

**Q15. How do you decide the number of components to keep in PCA**

Choose the number of components such that the cumulative explained variance ratio crosses a threshold (e.g., 90%) or use a Scree plot.

**Q16. Can PCA be used for classification**

Yes, PCA can be used to reduce dimensionality before classification for improved efficiency and sometimes better accuracy.

**Q17. What are the limitations of PCA**

- Assumes linearity
- Reduces interpretability
- Sensitive to feature scaling
- May discard useful information

**Q18. How do KNN and PCA complement each other**

PCA reduces noise and dimensionality, making KNN faster and sometimes more accurate by removing irrelevant features.

**Q19. How does KNN handle missing values in a dataset**

KNN can be used to impute missing values by averaging (or majority vote) of the nearest neighbors’ values.

**Q20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?**

- PCA maximizes variance without considering class labels.
- LDA maximizes class separation using label information.

## 💻 Practical Questions

**Q21. Train a KNN Classifier on the Iris dataset and print model accuracy**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))

**Q22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)**

In [None]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsRegressor()
model.fit(X_train, y_train)
print("MSE:", mean_squared_error(y_test, model.predict(X_test)))

**Q23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy**

In [None]:
for metric in ['euclidean', 'manhattan']:
    model = KNeighborsClassifier(metric=metric)
    model.fit(X_train, y_train)
    print(f"{metric.title()} Accuracy:", model.score(X_test, y_test))

**Q24. Train a KNN Classifier with different values of K and visualize decision boundaries**

In [None]:
# This task is visual and would typically involve using matplotlib
print("Use matplotlib and meshgrid to visualize decision boundaries for different k values.")

**Q25. Apply Feature Scaling before training a KNN model and compare results with unscaled data**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_s, X_test_s, y_train, y_test = train_test_split(X_scaled, y)
model = KNeighborsClassifier()
model.fit(X_train_s, y_train)
print("Scaled Accuracy:", model.score(X_test_s, y_test))

**Q26. Train a PCA model on synthetic data and print the explained variance ratio for each component**

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
X_pca = pca.fit_transform(X)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

**Q27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA**

In [None]:
X_pca = PCA(n_components=2).fit_transform(X_scaled)
X_train, X_test, y_train, y_test = train_test_split(X_pca, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print("PCA+KNN Accuracy:", model.score(X_test, y_test))

**Q28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV**

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid)
grid.fit(X_train, y_train)
print("Best K:", grid.best_params_)

**Q29. Train a KNN Classifier and check the number of misclassified samples**

In [None]:
y_pred = model.predict(X_test)
misclassified = (y_pred != y_test).sum()
print("Misclassified samples:", misclassified)

**Q30. Train a PCA model and visualize the cumulative explained variance**

In [None]:
import numpy as np
import matplotlib.pyplot as plt

cumsum = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumsum)
plt.title("Cumulative Explained Variance")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance")
plt.grid(True)
plt.show()

**Q31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy**

In [None]:
for w in ['uniform', 'distance']:
    model = KNeighborsClassifier(weights=w)
    model.fit(X_train, y_train)
    print(f"Weight={w}, Accuracy:", model.score(X_test, y_test))

**Q32. Train a KNN Regressor and analyze the effect of different K values on performance**

In [None]:
for k in [1, 3, 5, 10]:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train, y_train)
    print(f"K={k}, MSE:", mean_squared_error(y_test, model.predict(X_test)))

**Q33. Implement KNN Imputation for handling missing values in a dataset**

In [None]:
from sklearn.impute import KNNImputer
import numpy as np

X_missing = X.copy()
X_missing[::10] = np.nan
imputer = KNNImputer()
X_filled = imputer.fit_transform(X_missing)

**Q34. Train a PCA model and visualize the data projection onto the first two principal components**

In [None]:
X_pca = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.title("PCA Projection")
plt.show()

**Q35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance**

In [None]:
for algorithm in ['kd_tree', 'ball_tree']:
    model = KNeighborsClassifier(algorithm=algorithm)
    model.fit(X_train, y_train)
    print(f"{algorithm} Accuracy:", model.score(X_test, y_test))

**Q36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot**

In [None]:
plt.bar(range(len(pca.explained_variance_)), pca.explained_variance_)
plt.title("Scree Plot")
plt.xlabel("Component")
plt.ylabel("Variance")
plt.show()

**Q37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score**

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = model.predict(X_test)
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

**Q38. Train a PCA model and analyze the effect of different numbers of components on accuracy**

In [None]:
for n in [1, 2, 3]:
    X_pca = PCA(n_components=n).fit_transform(X_scaled)
    X_train, X_test, y_train, y_test = train_test_split(X_pca, y)
    model.fit(X_train, y_train)
    print(f"n={n}, Accuracy:", model.score(X_test, y_test))

**Q39. Train a KNN Classifier with different leaf_size values and compare accuracy**

In [None]:
for leaf_size in [10, 30, 50]:
    model = KNeighborsClassifier(leaf_size=leaf_size)
    model.fit(X_train, y_train)
    print(f"Leaf Size={leaf_size}, Accuracy:", model.score(X_test, y_test))

**Q40. Train a PCA model and visualize how data points are transformed before and after PCA**

In [None]:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Original Data")
plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.title("After PCA")
plt.show()

**Q41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report**

In [None]:
from sklearn.datasets import load_wine
from sklearn.metrics import classification_report

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

**Q42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error**

In [None]:
from sklearn.metrics import mean_absolute_error
for metric in ['euclidean', 'manhattan']:
    model = KNeighborsRegressor(metric=metric)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{metric.title()} MAE:", mean_absolute_error(y_test, y_pred))

**Q43. Train a KNN Classifier and evaluate using ROC-AUC score**

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

y_bin = label_binarize(y, classes=[0, 1, 2])
X_train, X_test, y_train, y_test = train_test_split(X, y_bin)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)
print("ROC-AUC Score:", roc_auc_score(y_test, y_score, multi_class='ovr'))

**Q44. Train a PCA model and visualize the variance captured by each principal component**

In [None]:
pca = PCA()
pca.fit(X_scaled)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_, marker='o')
plt.title("Variance per Principal Component")
plt.xlabel("Component")
plt.ylabel("Variance Ratio")
plt.grid(True)
plt.show()

**Q45. Train a KNN Classifier and perform feature selection before training**

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_new, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print("Accuracy with Feature Selection:", model.score(X_test, y_test))

**Q46. Train a PCA model and visualize the data reconstruction error after reducing dimensions**

In [None]:
X_pca = PCA(n_components=2).fit_transform(X_scaled)
X_reconstructed = PCA(n_components=2).fit(X_scaled).inverse_transform(X_pca)
reconstruction_error = ((X_scaled - X_reconstructed) ** 2).mean()
print("Reconstruction Error:", reconstruction_error)

**Q47. Train a KNN Classifier and visualize the decision boundary**

In [None]:
# Decision boundaries need matplotlib for 2D data
print("Use matplotlib to visualize decision boundary for 2D KNN classification")

**Q48. Train a PCA model and analyze the effect of different numbers of components on data variance.**

In [None]:
explained = []
for n in range(1, X.shape[1] + 1):
    pca = PCA(n_components=n)
    pca.fit(X)
    explained.append(np.sum(pca.explained_variance_ratio_))

plt.plot(range(1, len(explained) + 1), explained, marker='o')
plt.title("Variance vs Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.show()