---
### THEORY QUESTIONS:
---

<div style="font-family: Verdana; font-size: 18px; line-height: 1.6;">


###  K-Nearest Neighbors (KNN)

**1. What is K-Nearest Neighbors (KNN) and how does it work?**
KNN is a supervised learning algorithm used for classification and regression. It predicts the output based on the ‘k’ closest data points in the training set using a distance metric (like Euclidean distance).

**2. What is the difference between KNN Classification and KNN Regression?**

* Classification: Predicts a class label based on the majority vote of k neighbors.
* Regression: Predicts a continuous value by averaging the values of k neighbors.

**3. What is the role of the distance metric in KNN?**
The distance metric (e.g., Euclidean, Manhattan) determines how closeness is measured between points, directly affecting the accuracy of predictions.

**4. What is the Curse of Dimensionality in KNN?**
In high-dimensional spaces, data points become equidistant, making it hard for KNN to find meaningful neighbors. This degrades performance.

**5. How can we choose the best value of K in KNN?**
Use techniques like cross-validation to find the k that gives the best performance on the validation set.

**6. What are KD Tree and Ball Tree in KNN?**
They are data structures that speed up nearest neighbor search by organizing the training data in tree formats:

* KD Tree: Good for low-dimensional data.
* Ball Tree: Works better for high-dimensional data.

**7. When should you use KD Tree vs. Ball Tree?**

* Use KD Tree for data with ≤ 20 dimensions.
* Use Ball Tree for higher dimensions or non-uniform data.

**8. What are the disadvantages of KNN?**

* Slow prediction time for large datasets
* Sensitive to irrelevant features and feature scaling
* Poor performance in high dimensions

**9. How does feature scaling affect KNN?**
Since KNN relies on distances, unscaled features can dominate the distance metric. Always apply standardization or normalization.

---

###  Principal Component Analysis (PCA)

**10. What is PCA (Principal Component Analysis)?**
PCA is an unsupervised dimensionality reduction technique that transforms data to a new coordinate system to reduce the number of features while retaining most variance.

**11. How does PCA work?**

1. Standardize the data
2. Compute the covariance matrix
3. Calculate eigenvectors and eigenvalues
4. Project data onto top k eigenvectors (principal components)

**12. What is the geometric intuition behind PCA?**
PCA rotates the axes to align with the directions of maximum variance, projecting data into a space that captures the most information with fewer dimensions.

**13. What are Eigenvalues and Eigenvectors in PCA?**

* Eigenvectors: Directions of maximum variance (principal components).
* Eigenvalues: Amount of variance captured by each eigenvector.

**14. What is the difference between Feature Selection and Feature Extraction?**

* Feature Selection: Selects a subset of existing features.
* Feature Extraction: Creates new features (e.g., PCA) by transforming the original ones.

**15. How do you decide the number of components to keep in PCA?**
Use the explained variance ratio to choose enough components that retain 95-99% of the original variance.

**16. Can PCA be used for classification?**
Yes. PCA reduces dimensionality before applying a classification algorithm to improve speed and possibly accuracy.

**17. What are the limitations of PCA?**

* Linear technique – can’t capture non-linear relationships
* Components are not always interpretable
* Sensitive to data scaling

**18. How do KNN and PCA complement each other?**
PCA can reduce dimensionality before applying KNN, which improves KNN's performance and mitigates the curse of dimensionality.

**19. How does KNN handle missing values in a dataset?**
KNN typically doesn’t handle missing values directly. Missing values must be imputed before applying KNN.

**20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?**

* PCA: Unsupervised, maximizes variance.
* LDA: Supervised, maximizes class separation.




---
### PRACTICAL QUESTIONS:
---

###  1. Train a KNN Classifier on the Iris dataset and print model accuracy

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
```

---

###  2. Train a KNN Regressor on a synthetic dataset and evaluate using MSE

```python
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=200, n_features=1, noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KNeighborsRegressor(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
```

---

###  3. Train a KNN Classifier using Euclidean and Manhattan distance metrics and compare accuracy

```python
for metric in ['euclidean', 'manhattan']:
    model = KNeighborsClassifier(n_neighbors=3, metric=metric)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{metric.capitalize()} Accuracy:", accuracy_score(y_test, y_pred))
```

---

### 4. Train a KNN Classifier with different values of K and visualize decision boundaries

```python
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from matplotlib.colors import ListedColormap

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ['red', 'blue']

for k in [1, 3, 5]:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)

    h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure()
    plt.contourf(xx, yy, Z, cmap=cmap_light)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(cmap_bold), edgecolor='k', s=20)
    plt.title(f"K = {k}")
    plt.show()
```

---

### 5. Apply Feature Scaling before training a KNN model and compare results with unscaled data

```python
from sklearn.preprocessing import StandardScaler

# Without scaling
model_unscaled = KNeighborsClassifier(n_neighbors=3)
model_unscaled.fit(X_train, y_train)
print("Unscaled Accuracy:", accuracy_score(y_test, model_unscaled.predict(X_test)))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = KNeighborsClassifier(n_neighbors=3)
model_scaled.fit(X_train_scaled, y_train)
print("Scaled Accuracy:", accuracy_score(y_test, model_scaled.predict(X_test_scaled)))
```

---

### 6. Train a PCA model on synthetic data and print the explained variance ratio

```python
from sklearn.decomposition import PCA

X, _ = make_regression(n_samples=100, n_features=5, noise=0.1)
pca = PCA()
pca.fit(X)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)
```

---

### 7. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA

```python
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

model_pca = KNeighborsClassifier(n_neighbors=3)
model_pca.fit(X_train_pca, y_train)
print("Accuracy with PCA:", accuracy_score(y_test, model_pca.predict(X_test_pca)))
```

---

### 8. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

```python
from sklearn.model_selection import GridSearchCV

params = {'n_neighbors': list(range(1, 21))}
grid = GridSearchCV(KNeighborsClassifier(), param_grid=params, cv=5)
grid.fit(X_train_scaled, y_train)

print("Best K:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)
```

---

### 9. Train a KNN Classifier and check the number of misclassified samples

```python
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

misclassified = (y_pred != y_test).sum()
print("Misclassified Samples:", misclassified)
```

---

### 10. Train a PCA model and visualize the cumulative explained variance

```python
import numpy as np
import matplotlib.pyplot as plt

pca = PCA().fit(X_train_scaled)
cum_var = np.cumsum(pca.explained_variance_ratio_)

plt.figure()
plt.plot(range(1, len(cum_var)+1), cum_var, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Cumulative Explained Variance')
plt.grid(True)
plt.show()
```

### 11. Train a KNN Classifier using different values of the `weights` parameter (`uniform` vs. `distance`) and compare accuracy

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Uniform weights
knn_uniform = KNeighborsClassifier(weights='uniform')
knn_uniform.fit(X_train, y_train)
y_pred_uniform = knn_uniform.predict(X_test)

# Distance weights
knn_distance = KNeighborsClassifier(weights='distance')
knn_distance.fit(X_train, y_train)
y_pred_distance = knn_distance.predict(X_test)

# Compare accuracy
print("Uniform Weights Accuracy:", accuracy_score(y_test, y_pred_uniform))
print("Distance Weights Accuracy:", accuracy_score(y_test, y_pred_distance))
```

---

###  12. Train a KNN Regressor and analyze the effect of different K values on performance

```python
from sklearn.datasets import fetch_california_housing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

errors = []
k_values = range(1, 21)

for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    errors.append(mse)

plt.plot(k_values, errors, marker='o')
plt.xlabel("K Value")
plt.ylabel("Mean Squared Error")
plt.title("KNN Regressor Performance")
plt.grid()
plt.show()
```

---

### 13. Implement KNN Imputation for handling missing values in a dataset

```python
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.datasets import load_iris

X, _ = load_iris(return_X_y=True)

# Add some missing values
rng = np.random.RandomState(42)
X_missing = X.copy()
X_missing[rng.randint(0, X.shape[0], 10), rng.randint(0, X.shape[1], 10)] = np.nan

# Impute using KNN
imputer = KNNImputer(n_neighbors=3)
X_imputed = imputer.fit_transform(X_missing)

print("Missing values before:", np.isnan(X_missing).sum())
print("Missing values after:", np.isnan(X_imputed).sum())
```

---

### 14. Train a PCA model and visualize the data projection onto the first two principal components

```python
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X, y = load_iris(return_X_y=True)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Projection (2 Components)")
plt.grid()
plt.show()
```

---

### 15. Train a KNN Classifier using the `kd_tree` and `ball_tree` algorithms and compare performance

```python
for algorithm in ['kd_tree', 'ball_tree']:
    knn = KNeighborsClassifier(algorithm=algorithm)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f"{algorithm} Accuracy:", accuracy_score(y_test, y_pred))
```

---

### 16. Train a PCA model on a high-dimensional dataset and visualize the Scree plot

```python
from sklearn.datasets import make_classification

X, _ = make_classification(n_samples=200, n_features=30, random_state=42)

pca = PCA()
pca.fit(X)
explained_variance = pca.explained_variance_ratio_

plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.grid()
plt.show()
```

---

### 17. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

```python
from sklearn.metrics import classification_report

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))
```

---

### 18. Train a PCA model and analyze the effect of different numbers of components on accuracy

```python
accuracies = []
component_range = range(1, X.shape[1] + 1)

for n in component_range:
    pca = PCA(n_components=n)
    X_reduced = pca.fit_transform(X)
    X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_reduced, y, random_state=42)
    
    knn = KNeighborsClassifier()
    knn.fit(X_train_pca, y_train_pca)
    y_pred_pca = knn.predict(X_test_pca)
    acc = accuracy_score(y_test_pca, y_pred_pca)
    accuracies.append(acc)

plt.plot(component_range, accuracies, marker='o')
plt.xlabel("Number of PCA Components")
plt.ylabel("KNN Accuracy")
plt.title("Effect of PCA Components on Accuracy")
plt.grid()
plt.show()
```

---

### 19. Train a KNN Classifier with different `leaf_size` values and compare accuracy

```python
leaf_sizes = range(10, 51, 10)
for leaf in leaf_sizes:
    knn = KNeighborsClassifier(leaf_size=leaf)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f"Leaf size {leaf}: Accuracy = {accuracy_score(y_test, y_pred)}")
```

---

### 20. Train a PCA model and visualize how data points are transformed before and after PCA

```python
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Before PCA
ax[0].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
ax[0].set_title("Original Data")
ax[0].set_xlabel("Feature 1")
ax[0].set_ylabel("Feature 2")

# After PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
ax[1].set_title("After PCA (2 Components)")
ax[1].set_xlabel("PC1")
ax[1].set_ylabel("PC2")

plt.tight_layout()
plt.show()
```


###  21. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))
```

---

###  22. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error

```python
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

metrics = ['euclidean', 'manhattan', 'chebyshev']
errors = []

for metric in metrics:
    knn = KNeighborsRegressor(metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    errors.append(mse)
    print(f"{metric.capitalize()} MSE:", mse)
```

---

### 23. Train a KNN Classifier and evaluate using ROC-AUC score

```python
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score

X, y = load_wine(return_X_y=True)
y_bin = label_binarize(y, classes=[0, 1, 2])

X_train, X_test, y_train, y_test_bin = train_test_split(X, y_bin, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_prob = knn.predict_proba(X_test)

roc_auc = roc_auc_score(y_test_bin, y_pred_prob, multi_class='ovr')
print("ROC AUC Score (multi-class OVR):", roc_auc)
```

---

### 24. Train a PCA model and visualize the variance captured by each principal component

```python
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X, _ = load_wine(return_X_y=True)

pca = PCA()
pca.fit(X)

plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o')
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.title("Variance Captured by PCA Components")
plt.grid()
plt.show()
```

---

### 25. Train a KNN Classifier and perform feature selection before training

```python
from sklearn.feature_selection import SelectKBest, f_classif

# Select top 5 features
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=42)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))
```

---

### 26. Train a PCA model and visualize the data reconstruction error after reducing dimensions

```python
from sklearn.metrics import mean_squared_error

X, _ = load_wine(return_X_y=True)
pca = PCA(n_components=5)
X_reduced = pca.fit_transform(X)
X_reconstructed = pca.inverse_transform(X_reduced)

reconstruction_error = mean_squared_error(X, X_reconstructed)
print("Reconstruction Error (MSE):", reconstruction_error)
```

---

### 27. Train a KNN Classifier and visualize the decision boundary

```python
import numpy as np

# Use only 2 features for visualization
X_vis = X[:, :2]
y_vis = y

X_train, X_test, y_train, y_test = train_test_split(X_vis, y_vis, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Create meshgrid
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap='Pastel2')
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y_vis, cmap='Dark2', edgecolor='k')
plt.title("KNN Decision Boundary (2 Features)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
```

---

### 28. Train a PCA model and analyze the effect of different numbers of components on data variance

```python
pca = PCA()
pca.fit(X)

cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Effect of PCA Components on Variance")
plt.grid()
plt.show()
```
