# Theoretical

1. What is K-Nearest Neighbors (KNN) and how does it work.
- KNN is a supervised learning algorithm used for both classification and regression. It predicts the class of a data point based on the 'K' closest points in the training set, using a distance metric like Euclidean.

2. What is the difference between KNN Classification and KNN Regression
- KNN Classification: Assigns the most common class among K neighbors.

- KNN Regression: Predicts the average (or weighted average) value of K neighbors.

3. What is the role of the distance metric in KNN.
- The distance metric (e.g., Euclidean, Manhattan) determines how “closeness” is measured, directly affecting neighbor selection and prediction accuracy.

4. What is the Curse of Dimensionality in KNN.
- As dimensions increase, data becomes sparse and distance measures become less meaningful, reducing KNN performance.

5. How can we choose the best value of K in KNN.
- Use techniques like cross-validation to test various K values and pick the one that minimizes error on validation data.

6. What are KD Tree and Ball Tree in KNN.
- These are data structures used to speed up nearest neighbor searches:

- KD Tree: Efficient for low-dimensional data.

- Ball Tree: Better for high-dimensional or irregular data.

7. When should you use KD Tree vs. Ball Tree.
- Use KD Tree for dimensions < 20.

- Use Ball Tree for dimensions > 20 or non-uniform data.

8.  What are the disadvantages of KNN.
- Slow at prediction time

- Sensitive to outliers and irrelevant features

- Requires feature scaling

- Poor with high-dimensional data

9. How does feature scaling affect KNN.
- Since KNN is distance-based, unscaled features can dominate distance computation. Standardization or normalization is essential.

10. What is PCA (Principal Component Analysis).
- PCA is an unsupervised dimensionality reduction technique that transforms data into new variables (principal components) that retain maximum variance.

11. How does PCA work.
- Standardize data

- Compute covariance matrix

- Find eigenvalues and eigenvectors

- Project data onto top components

12. What is the geometric intuition behind PCA.
- PCA finds new axes (principal components) that capture maximum data variance by rotating the coordinate system.

13. What is the difference between Feature Selection and Feature Extraction.
- Feature Selection: Selects a subset of original features

- Feature Extraction (PCA): Creates new features from original ones

14. What are Eigenvalues and Eigenvectors in PCA.
- Eigenvalues: Indicate variance explained by components

- Eigenvectors: Directions of new axes (principal components)

15. How do you decide the number of components to keep in PCA.
- Use explained variance ratio or scree plot; typically keep components that explain ≥ 95% variance.

16. Can PCA be used for classification.
- Indirectly, yes. PCA reduces dimensionality before using classification models like KNN or SVM.

17. What are the limitations of PCA.
- Assumes linear relationships

- Sensitive to scaling

- Components may lack interpretability

18. How do KNN and PCA complement each other.
- PCA reduces dimensions and noise, which can improve KNN accuracy and speed.

19. How does KNN handle missing values in a dataset.
- KNN can use KNN imputation to fill missing values by averaging values of nearest neighbors.

20. What are the key differences between PCA and Linear Discriminant Analysis (LDA).
- PCA: Unsupervised, focuses on variance

- LDA: Supervised, focuses on maximizing class separation

21. Train a KNN Classifier on the Iris dataset and print model accuracy.
- from sklearn.datasets import load_iris
- from sklearn.neighbors import KNeighbors Classifier
- from sklearn.model_selection import train_test_split
- from sklearn.metrics import accuracy_score

- data = load_iris()
- X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
- model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))

22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE).
- from sklearn.datasets import make_regression
- from sklearn.neighbors import KNeighborsRegressor
- from sklearn.metrics import mean_squared_error

- X, y = make_regression(n_samples=100, n_features=1, noise=10)
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
- reg = KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train, y_train)
- preds = reg.predict(X_test)
- print("MSE:", mean_squared_error(y_test, preds))

23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy.
- from sklearn.metrics import accuracy_score

- Euclidean
knn_euclidean = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_euclidean.fit(X_train_iris, y_train_iris)
acc_euclidean = accuracy_score(y_test_iris, knn_euclidean.predict(X_test_iris))

- Manhattan
knn_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan')
knn_manhattan.fit(X_train_iris, y_train_iris)
acc_manhattan = accuracy_score(y_test_iris, knn_manhattan.predict(X_test_iris))

- print("Euclidean Accuracy:", acc_euclidean)
- print("Manhattan Accuracy:", acc_manhattan)

24. Train a KNN Classifier with different values of K and visualize decision boundaried.
- from sklearn.datasets import make_classification
from matplotlib.colors import ListedColormap

- Generate 2D classification dataset
X_2d, y_2d = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2, random_state=42)
X_train2d, X_test2d, y_train2d, y_test2d = train_test_split(X_2d, y_2d, test_size=0.3, random_state=42)

- Plot decision boundaries for different K
for k in [1, 3, 5, 7]:
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train2d, y_train2d)

-  Plotting
    h = .02
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

- plt.figure()
    plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA']))
    plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_2d, edgecolors='k', cmap=ListedColormap(['#FF0000', '#00FF00']))
    plt.title(f"K = {k}")
    plt.show()

25. Apply Feature Scaling before training a KNN model and compare results with unscaled data.
- Without Scaling
model_unscaled = KNeighborsClassifier(n_neighbors=3)
model_unscaled.fit(X_train_iris, y_train_iris)
acc_unscaled = model_unscaled.score(X_test_iris, y_test_iris)

- With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_iris)
X_test_scaled = scaler.transform(X_test_iris)

- model_scaled = KNeighborsClassifier(n_neighbors=3)
model_scaled.fit(X_train_scaled, y_train_iris)
acc_scaled = model_scaled.score(X_test_scaled, y_test_iris)

- print("Accuracy without scaling:", acc_unscaled)
- print("Accuracy with scaling:", acc_scaled)

26. Train a PCA model on synthetic data and print the explained variance ratio for each component.
- from sklearn.decomposition import PCA

- pca = PCA()
- pca.fit(X_iris)  # or use synthetic data if preferred
- print("Explained Variance Ratio:")
- print(pca.explained_variance_ratio_)

27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA.
- PCA applied
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_iris)
X_test_pca = pca.transform(X_test_iris)

- knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train_iris)
acc_with_pca = knn_pca.score(X_test_pca, y_test_iris)

- Original data
knn_orig = KNeighborsClassifier(n_neighbors=3)
knn_orig.fit(X_train_iris, y_train_iris)
acc_without_pca = knn_orig.score(X_test_iris, y_test_iris)

- print("Accuracy without PCA:", acc_without_pca)
- print("Accuracy with PCA:", acc_with_pca)

28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV.
- param_grid = {'n_neighbors': range(1, 20)}
- grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
- grid_search.fit(X_train_iris, y_train_iris)

- print("Best Parameters:", - grid_search.best_params_)
- print("Best Accuracy:", grid_search.best_score_)

29. Train a KNN Classifier and check the number of misclassified samples.
- y_pred = knn_iris.predict(X_test_iris)
- misclassified = (y_pred != y_test_iris).sum()
- print("Number of misclassified samples:", misclassified)

30.Train a PCA model and visualize the cumulative explained variance.
- pca = PCA().fit(X_iris)
- plt.figure(figsize=(8, 4))
- plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
- plt.xlabel('Number of Components')
- plt.ylabel('Cumulative Explained Variance')
- plt.title('PCA - Cumulative Explained Variance')
- plt.grid(True)
- plt.show()

31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare
accuracy.
- Uniform weights
- knn_uniform = KNeighborsClassifier(weights='uniform')
- knn_uniform.fit(X_train_iris, y_train_iris)
- acc_uniform = - knn_uniform.score(X_test_iris, y_test_iris)

- Distance-based weights
- knn_distance = KNeighborsClassifier(weights='distance')
- knn_distance.fit(X_train_iris, y_train_iris)
- acc_distance = knn_distance.score(X_test_iris, y_test_iris)

- print("Accuracy with uniform weights:", acc_uniform)
- print("Accuracy with distance weights:", acc_distance)

32. Train a KNN Regressor and analyze the effect of different K values on performance.
- for k in [1, 3, 5, 7, 9]:
- reg = KNeighborsRegressor(n_neighbors=k)
- reg.fit(X_train_syn, y_train_syn)
- pred = reg.predict(X_test_syn)
- mse = mean_squared_error(y_test_syn, pred)
- print(f"K={k}, MSE={mse:.2f}")

33. Implement KNN Imputation for handling missing values in a dataset.
- from sklearn.impute import KNNImputer
import numpy as np

- Introduce some missing values in iris
- X_iris_missing = X_iris.copy()
- X_iris_missing[0][0] = np.nan
- X_iris_missing[10][2] = np.nan

- imputer = KNNImputer(n_neighbors=3)
- X_iris_imputed = imputer.fit_transform(X_iris_missing)

- print("Before Imputation:", X_iris_missing[:12])
- print("After Imputation:", X_iris_imputed[:12])

34. Train a PCA model and visualize the data projection onto the first two principal components.
- pca = PCA(n_components=2)
- X_pca = pca.fit_transform(X_iris)

- plt.figure(figsize=(8, 5))
- plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_iris, cmap='viridis', edgecolor='k')
- plt.xlabel('PC1')
- plt.ylabel('PC2')
- plt.title('Projection onto First Two Principal Components')
- plt.colorbar()
- plt.show()

35.Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance.
- KD Tree
knn_kd = KNeighborsClassifier(algorithm='kd_tree')
- knn_kd.fit(X_train_iris, y_train_iris)
- acc_kd = knn_kd.score(X_test_iris, y_test_iris)

- Ball Tree
knn_ball = KNeighborsClassifier(algorithm='ball_tree')
- knn_ball.fit(X_train_iris, y_train_iris)
- acc_ball = knn_ball.score(X_test_iris, y_test_iris)

- print("Accuracy using KD Tree:", acc_kd)
- print("Accuracy using Ball Tree:", acc_ball)

36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot.
- Create high-dimensional data
- from sklearn.datasets import make_classification
- X_hd, _ = make_classification(n_samples=100, n_features=20, random_state=42)

- pca_hd = PCA()
- pca_hd.fit(X_hd)

- Scree Plot
- plt.figure(figsize=(8, 5))
plt.plot(range(1, len(pca_hd.explained_variance_ratio_) + 1), pca_hd.explained_variance_ratio_, marker='o')
- plt.xlabel('Principal Component')
- plt.ylabel('Explained Variance Ratio')
- plt.title('Scree Plot for High-Dimensional Data')
- plt.grid(True)
- plt.show()

37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score.
- y_pred_knn = knn_iris.predict(X_test_iris)

- precision = precision_score(y_test_iris, y_pred_knn, average='macro')
- recall = recall_score(y_test_iris, y_pred_knn, average='macro')
- f1 = f1_score(y_test_iris, y_pred_knn, average='macro')

- print("Precision:", precision)
- print("Recall:", recall)
- print("F1 Score:", f1)

38. Train a PCA model and analyze the effect of different numbers of components on accuracy.
- for n in range(1, 5):
- pca = PCA(n_components=n)
- X_train_pca = pca.fit_transform(X_train_iris)
- X_test_pca = pca.transform(X_test_iris)
    
- knn = KNeighborsClassifier(n_neighbors=3)
- knn.fit(X_train_pca, y_train_iris)
- acc = knn.score(X_test_pca, y_test_iris)
    
- print(f"PCA components: {n}, Accuracy: {acc:.2f}")

39. Train a KNN Classifier with different leaf_size values and compare accuracy.
- for leaf in [10, 20, 30, 40, 50]:
- knn = KNeighborsClassifier(n_neighbors=3, leaf_size=leaf)
- knn.fit(X_train_iris, y_train_iris)
- acc = knn.score(X_test_iris, y_test_iris)
- print(f"Leaf Size: {leaf}, Accuracy: {acc:.2f}")

40. Train a PCA model and visualize how data points are transformed before and after PCA.
- Before PCA
- plt.figure(figsize=(12, 5))
- plt.subplot(1, 2, 1)
- plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris, cmap='viridis', edgecolor='k')
- plt.title('Original Data (First 2 Features)')
- plt.xlabel('Feature 1')
- plt.ylabel('Feature 2')

- After PCA
- X_pca = PCA(n_components=2).fit_transform(X_iris)
- plt.subplot(1, 2, 2)
- plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_iris, cmap='viridis', edgecolor='k')
- plt.title('Data After PCA (2 Components)')
- plt.xlabel('PC1')
- plt.ylabel('PC2')

- plt.tight_layout()
- plt.show()

41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report.
- from sklearn.datasets import load_wine
- from sklearn.metrics import classification_report

- Load wine dataset
- wine = load_wine()
- X_wine, y_wine = wine.data, wine.target
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_wine, y_wine, test_size=0_

42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error.
- for metric in ['euclidean', 'manhattan']:
- knn_reg = KNeighborsRegressor(n_neighbors=3, metric=metric)
- knn_reg.fit(X_train_syn, y_train_syn)
- y_pred = knn_reg.predict(X_test_syn)
- mse = mean_squared_error(y_test_syn, y_pred)
- print(f"Distance Metric: {metric}, MSE: {mse:.2f}")

43. Train a KNN Classifier and evaluate using ROC-AUC score.
- from sklearn.preprocessing import label_binarize
- from sklearn.metrics import roc_auc_score

- Binarize for multiclass ROC AUC
y_test_bin = label_binarize(y_test_iris, classes=[0, 1, 2])
- y_pred_prob = knn_iris.predict_proba(X_test_iris)

- roc_auc = roc_auc_score(y_test_bin, y_pred_prob, average='macro', multi_class='ovr')
- print("ROC-AUC Score:", roc_auc)

44. Train a PCA model and visualize the variance captured by each principal component.
- pca = PCA().fit(X_iris)
- plt.bar(range(1, len(pca.explained_variance_ratio_) + 1),  pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Variance Captured by Each PCA Component')
- plt.show()

45.  Train a KNN Classifier and perform feature selection before training.
- from sklearn.feature_selection import SelectKBest, f_classif

- Select top 2 features
-selector = SelectKBest(score_func=f_classif, k=2)
- X_selected = selector.fit_transform(X_iris, y_iris)
X_train_sel, X_test_sel, y_train_sel, y_test_sel = train_test_split(X_selected, y_iris, test_size=0.3, random_state=42)

- knn_fs = KNeighborsClassifier(n_neighbors=3)
knn_fs.fit(X_train_sel, y_train_sel)
acc_fs = knn_fs.score(X_test_sel, y_test_sel)
- print("Accuracy after Feature Selection:", acc_fs)

46. Train a PCA model and visualize the data reconstruction error after reducing dimensions.
- Reduce to 2 components
pca = PCA(n_components=2)
- X_reduced = pca.fit_transform(X_iris)

- Reconstruct original data
X_reconstructed = pca.inverse_transfor

47. Train a KNN Classifier and visualize the decision boundary.
- 2D dataset
- from matplotlib.colors import ListedColormap
- X_vis, y_vis = make_classification(n_samples=100, n_features=2, n_redundant=0, random_state=42)
- X_train_v, X_test_v, y_train_v, y_test_v = train_test_split(X_vis, y_vis, test_size=0.3, random_state=42)

- knn_vis = KNeighborsClassifier(n_neighbors=3)
knn_vis.fit(X_train_v, y_train_v)

- Decision boundary
h = .02
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn_vis.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

- plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA']))
- plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y_vis, edgecolor='k', cmap=ListedColormap(['#FF0000', '#00FF00']))
- plt.title("KNN Decision Boundary")
- plt.xlabel("Feature 1")
- plt.ylabel("Feature 2")
- plt.show()

48. Train a PCA model and analyze the effect of different numbers of components on data variance.
- Try various component counts and track explained variance
- components = [1, 2, 3, 4]
for n in components:
    pca = PCA(n_components=n)
    pca.fit(X_iris)
    total_var = np.sum(pca.explained_variance_ratio_)
- print(f"Components: {n}, Total Explained - Variance: {total_var:.2f}")