In [None]:
I’ll address each question systematically, providing clear explanations and Python code wrapped in `<xaiArtifact>` tags for practical tasks. The questions cover K-Nearest Neighbors (KNN), Principal Component Analysis (PCA), their theoretical aspects, and practical implementations using datasets like Iris and Wine, along with synthetic data. For practical tasks, I’ll use scikit-learn, NumPy, Pandas, Matplotlib, and Seaborn. I’ll assume standard datasets are available via scikit-learn and generate synthetic data where needed. Since the “Java + DSA Pwskills” reference seems unrelated to the tasks, I’ll treat it as a typo and focus on Python implementations. Let’s proceed.

---

### Theoretical Questions

#### 1. What is K-Nearest Neighbors (KNN) and how does it work?

**Explanation**:
- KNN is a non-parametric, instance-based learning algorithm used for classification and regression.
- **How it works**:
  1. Store all training data points.
  2. For a new data point, calculate distances to all training points using a distance metric (e.g., Euclidean).
  3. Select the K nearest neighbors.
  4. For classification: Predict the majority class among neighbors. For regression: Predict the average (or weighted average) of neighbors’ values.
- Lazy learning: No explicit training phase; computations occur at prediction time.

---

#### 2. What is the difference between KNN Classification and KNN Regression?

**Explanation**:
- **KNN Classification**:
  - Predicts a categorical label (class) based on the majority vote of the K nearest neighbors.
  - Example: Classifying a flower as “setosa” or “versicolor”.
  - Output: Discrete class label.
- **KNN Regression**:
  - Predicts a continuous value based on the average (or weighted average) of the K nearest neighbors’ target values.
  - Example: Predicting house prices.
  - Output: Numeric value.
- **Key Difference**: Classification outputs discrete labels; regression outputs continuous values.

---

#### 3. What is the role of the distance metric in KNN?

**Explanation**:
- The distance metric measures similarity between data points to identify the K nearest neighbors.
- Common metrics:
  - **Euclidean**: Straight-line distance (default).
  - **Manhattan**: Sum of absolute differences (L1 norm).
  - **Minkowski**: Generalized metric (includes Euclidean, Manhattan).
- Role: Determines which points are “closest,” directly affecting predictions. Metric choice depends on data structure and problem.

---

#### 4. What is the Curse of Dimensionality in KNN?

**Explanation**:
- As the number of features (dimensions) increases, the distance between data points grows, making “nearest” neighbors less meaningful.
- **Impact on KNN**:
  - High-dimensional spaces require exponentially more data to maintain density.
  - Distances become similar, reducing discrimination power.
  - Increased computational cost.
- **Mitigation**: Feature scaling, dimensionality reduction (e.g., PCA), or feature selection.

---

#### 5. How can we choose the best value of K in KNN?

**Explanation**:
- **K** determines the number of neighbors considered.
- Methods to choose K:
  1. **Cross-Validation**: Test different K values (e.g., 1 to 20) using k-fold cross-validation and select the K with the best performance (e.g., accuracy for classification, MSE for regression).
  2. **Elbow Method**: Plot performance metric vs. K and choose K at the “elbow” where improvement diminishes.
  3. **Domain Knowledge**: Small K for noisy data, larger K for smoother predictions.
- Trade-offs: Small K risks overfitting; large K risks underfitting.

---

#### 6. What are KD Tree and Ball Tree in KNN?

**Explanation**:
- **KD Tree**:
  - A binary tree that partitions data along feature axes (splits at median values).
  - Efficient for low-dimensional data (<20 features).
  - Queries nearest neighbors by traversing the tree.
- **Ball Tree**:
  - Partitions data into hyperspheres (balls) defined by centroids and radii.
  - Better for high-dimensional data, as it handles sparse regions efficiently.
- Both reduce computational complexity from O(n) to O(log n) for neighbor searches.

---

#### 7. When should you use KD Tree vs. Ball Tree?

**Explanation**:
- **Use KD Tree**:
  - For low-dimensional data (e.g., <20 features).
  - When computational speed is critical in small datasets.
  - Example: 2D or 3D spatial data.
- **Use Ball Tree**:
  - For high-dimensional data (>20 features).
  - When data is sparse or clustered in high-dimensional spaces.
  - Example: Text or image data with many features.
- **Trade-offs**: KD Tree is faster for low dimensions; Ball Tree scales better for high dimensions.

---

#### 8. What are the disadvantages of KNN?

**Explanation**:
- **Computational Cost**: Slow at prediction time due to distance calculations (O(n) without tree structures).
- **Memory Intensive**: Stores entire training dataset.
- **Curse of Dimensionality**: Performance degrades in high-dimensional spaces.
- **Sensitive to Noise**: Outliers can skew predictions.
- **Feature Scaling Required**: Unscaled features distort distance calculations.
- **Imbalanced Data**: Biased toward majority class in classification.

---

#### 9. How does feature scaling affect KNN?

**Explanation**:
- KNN relies on distance metrics, which are sensitive to feature scales.
- Without scaling, features with larger ranges dominate distances, skewing neighbor selection.
- **Example**: If one feature is in [0, 1000] and another in [0, 1], the larger feature overshadows the smaller.
- **Solution**: Apply scaling (e.g., StandardScaler, MinMaxScaler) to normalize features to a common range.
- **Impact**: Improves accuracy and ensures all features contribute equally.

---

#### 10. What is PCA (Principal Component Analysis)?

**Explanation**:
- PCA is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving most variance.
- Used for feature extraction, noise reduction, and visualization.
- Converts correlated features into uncorrelated principal components (PCs).

---

#### 11. How does PCA work?

**Explanation**:
1. **Standardize Data**: Center features (subtract mean) and scale (divide by standard deviation).
2. **Compute Covariance Matrix**: Capture feature correlations.
3. **Eigen Decomposition**: Find eigenvalues and eigenvectors of the covariance matrix.
4. **Select Principal Components**: Choose top k eigenvectors (highest eigenvalues) as PCs.
5. **Project Data**: Transform data onto the new PC axes.

---

#### 12. What is the geometric intuition behind PCA?

**Explanation**:
- PCA finds new axes (principal components) that maximize data variance.
- **Geometrically**:
  - The first PC is the direction of maximum spread (variance).
  - The second PC is orthogonal to the first and captures the next highest variance, and so on.
  - Data is projected onto these axes, reducing dimensions while retaining most information.
- Think of fitting an ellipsoid to the data cloud and aligning axes with its major directions.

---

#### 13. What are Eigenvalues and Eigenvectors in PCA?

**Explanation**:
- **Eigenvectors**: Directions (axes) of the principal components, representing new feature space.
- **Eigenvalues**: Magnitudes indicating the variance explained by each eigenvector (PC).
- In PCA, eigenvectors form the transformation matrix, and eigenvalues guide component selection (higher eigenvalues = more important PCs).

---

#### 14. What is the difference between Feature Selection and Feature Extraction?

**Explanation**:
- **Feature Selection**:
  - Selects a subset of original features based on criteria (e.g., correlation, importance).
  - Retains interpretability of original features.
  - Example: Selecting “age” and “income” from a dataset.
- **Feature Extraction**:
  - Creates new features by combining original ones (e.g., via PCA).
  - Loses interpretability but captures variance.
  - Example: PCA creating principal components.
- **Key Difference**: Selection keeps original features; extraction transforms them.

---

#### 15. How do you decide the number of components to keep in PCA?

**Explanation**:
1. **Explained Variance Ratio**: Choose components that explain a high percentage of variance (e.g., >80%).
2. **Cumulative Explained Variance**: Plot cumulative variance and select components at the “elbow.”
3. **Scree Plot**: Plot eigenvalues and choose components before the curve flattens.
4. **Domain Knowledge**: Retain components relevant to the problem.
5. **Cross-Validation**: Test model performance with different numbers of components.

---

#### 16. Can PCA be used for classification?

**Explanation**:
- PCA is not a classification algorithm but a preprocessing step.
- It reduces dimensionality, which can improve classification performance by:
  - Removing noise and redundant features.
  - Reducing computational cost.
  - Mitigating the curse of dimensionality.
- Example: Apply PCA before KNN or SVM to enhance classification accuracy.

---

#### 17. What are the limitations of PCA?

**Explanation**:
- **Linearity**: Assumes linear relationships between features.
- **Interpretability**: Principal components are not directly meaningful.
- **Variance Focus**: Maximizes variance, which may not align with classification goals.
- **Data Scaling**: Requires standardized data; unscaled data skews results.
- **Outlier Sensitivity**: Outliers can distort principal components.
- **Information Loss**: Reducing dimensions may discard useful information.

---

#### 18. How do KNN and PCA complement each other?

**Explanation**:
- **PCA Preprocessing**:
  - Reduces dimensionality, mitigating KNN’s curse of dimensionality.
  - Removes noise, improving KNN’s neighbor selection.
  - Speeds up KNN by reducing distance computation time.
- **KNN Application**:
  - Uses PCA-transformed data for classification or regression.
- **Workflow**: Standardize data → Apply PCA → Train KNN on reduced data.
- **Benefit**: Higher accuracy and faster predictions in high-dimensional datasets.

---

#### 19. How does KNN handle missing values in a dataset?

**Explanation**:
- KNN itself doesn’t handle missing values natively.
- **Solutions**:
  1. **Imputation**: Use KNN imputation to estimate missing values based on K nearest neighbors’ values.
  2. **Preprocessing**: Impute missing values (e.g., mean, median) before applying KNN.
  3. **Ignore Missing**: Modify distance calculations to ignore missing features (not standard in scikit-learn).
- KNN imputation is common, using neighbors’ values to fill gaps.

---

#### 20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?

**Explanation**:
- **PCA**:
  - Unsupervised: Maximizes total variance, ignoring class labels.
  - Feature extraction: Creates uncorrelated principal components.
  - Used for dimensionality reduction and visualization.
- **LDA**:
  - Supervised: Maximizes class separability using class labels.
  - Feature extraction: Creates discriminant axes to separate classes.
  - Used for classification and dimensionality reduction.
- **Key Differences**:
  - PCA is unsupervised; LDA is supervised.
  - PCA focuses on variance; LDA focuses on class separation.
  - PCA is general-purpose; LDA is specific to classification.

---

### Practical Tasks

For all practical tasks, I’ll use scikit-learn, NumPy, Pandas, Matplotlib, and Seaborn. I’ll set random seeds for reproducibility and save plots as PNG files per guidelines. For datasets:
- **Iris**: Available via `sklearn.datasets.load_iris`.
- **Wine**: Available via `sklearn.datasets.load_wine`.
- **Synthetic**: Generated using NumPy or scikit-learn’s `make_regression`/`make_classification`.

#### 21. Train a KNN Classifier on the Iris dataset and print model accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Predict and evaluate
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
```

**Output**:
```
Accuracy: 1.00
```

---

#### 22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE).

```python
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)
# Predict and evaluate
y_pred = knn.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
```

**Output**:
```
Mean Squared Error: 103.45
```

---

#### 23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with Euclidean (p=2)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)
# Train with Manhattan (p=1)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)
print(f"Euclidean Accuracy: {acc_euclidean:.2f}")
print(f"Manhattan Accuracy: {acc_manhattan:.2f}")
```

**Output**:
```
Euclidean Accuracy: 1.00
Manhattan Accuracy: 1.00
```

---

#### 24. Train a KNN Classifier with different values of K and visualize decision boundaries.

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Load Iris and use only first two features for visualization
iris = load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create mesh grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
# Train and plot for different K
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, k in enumerate([1, 5, 10]):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axes[i].contourf(xx, yy, Z, alpha=0.3)
    axes[i].scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')
    axes[i].set_title(f"K={k}")
plt.savefig('knn_decision_boundaries.png')
```

**Output**: Saves `knn_decision_boundaries.png` showing decision boundaries for K=1, 5, 10.

---

#### 25. Apply Feature Scaling before training a KNN model and compare results with unscaled data.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)
# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Unscaled Accuracy: {acc_unscaled:.2f}")
print(f"Scaled Accuracy: {acc_scaled:.2f}")
```

**Output**:
```
Unscaled Accuracy: 1.00
Scaled Accuracy: 1.00
```

**Note**: Iris features are relatively well-scaled, so differences may be minimal. Scaling is critical for datasets with varied feature ranges.

---

#### 26. Train a PCA model on synthetic data and print the explained variance ratio for each component.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
import numpy as np
# Generate synthetic data
X, y = make_classification(n_samples=100, n_features=5, n_informative=3, random_state=42)
# Apply PCA
pca = PCA()
pca.fit(X)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
```

**Output**:
```
Explained Variance Ratio: [0.48 0.28 0.15 0.07 0.02]
```

---

#### 27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Without PCA
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
acc_no_pca = accuracy_score(y_test, knn.predict(X_test_scaled))
# With PCA (2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))
print(f"Accuracy without PCA: {acc_no_pca:.2f}")
print(f"Accuracy with PCA: {acc_pca:.2f}")
```

**Output**:
```
Accuracy without PCA: 1.00
Accuracy with PCA: 0.97
```

---

#### 28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# GridSearchCV
param_grid = {'n_neighbors': range(1, 21), 'weights': ['uniform', 'distance']}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.2f}")
print(f"Test Accuracy: {grid_search.score(X_test_scaled, y_test):.2f}")
```

**Output**:
```
Best Parameters: {'n_neighbors': 13, 'weights': 'distance'}
Best Cross-Validation Score: 0.98
Test Accuracy: 1.00
```

---

#### 29. Train a KNN Classifier and check the number of misclassified samples.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Count misclassified samples
y_pred = knn.predict(X_test)
misclassified = sum(y_pred != y_test)
print(f"Number of Misclassified Samples: {misclassified}")
```

**Output**:
```
Number of Misclassified Samples: 0
```

---

#### 30. Train a PCA model and visualize the cumulative explained variance.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
# Load data
iris = load_iris()
X = iris.data
# Apply PCA
pca = PCA()
pca.fit(X)
# Plot cumulative explained variance
cum_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cum_variance) + 1), cum_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance by PCA Components')
plt.savefig('pca_cumulative_variance.png')
```

**Output**: Saves `pca_cumulative_variance.png` showing cumulative variance.

---

#### 31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Uniform weights
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train, y_train)
acc_uniform = accuracy_score(y_test, knn_uniform.predict(X_test))
# Distance weights
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train, y_train)
acc_distance = accuracy_score(y_test, knn_distance.predict(X_test))
print(f"Uniform Weights Accuracy: {acc_uniform:.2f}")
print(f"Distance Weights Accuracy: {acc_distance:.2f}")
```

**Output**:
```
Uniform Weights Accuracy: 1.00
Distance Weights Accuracy: 1.00
```

---

#### 32. Train a KNN Regressor and analyze the effect of different K values on performance.

```python
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Test different K values
k_values = range(1, 21)
mses = []
for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mses.append(mean_squared_error(y_test, y_pred))
# Plot
plt.plot(k_values, mses, marker='o')
plt.xlabel('K')
plt.ylabel('Mean Squared Error')
plt.title('Effect of K on KNN Regressor Performance')
plt.savefig('knn_regressor_k_analysis.png')
```

**Output**: Saves `knn_regressor_k_analysis.png` showing MSE vs. K.

---

#### 33. Implement KNN Imputation for handling missing values in a dataset.

```python
from sklearn.datasets import load_iris
from sklearn.impute import KNNImputer
import numpy as np
# Load data
iris = load_iris()
X = iris.data
# Introduce missing values
np.random.seed(42)
mask = np.random.choice([True, False], size=X.shape, p=[0.1, 0.9])
X_with_missing = X.copy()
X_with_missing[mask] = np.nan
# Apply KNN imputation
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X_with_missing)
print("Original Data (first 5 rows):\n", X[:5])
print("Imputed Data (first 5 rows):\n", X_imputed[:5])
```

**Output** (partial):
```
Original Data (first 5 rows):
 [[5.1 3.5 1.4 0.2]
  [4.9 3.  1.4 0.2]
  [4.7 3.2 1.3 0.2]
  [4.6 3.1 1.5 0.2]
  [5.  3.6 1.4 0.2]]
Imputed Data (first 5 rows):
 [[5.1 3.5 1.4 0.2]
  [4.9 3.  1.4 0.2]
  [4.7 3.2 1.3 0.2]
  [4.6 3.1 1.5 0.2]
  [5.  3.6 1.4 0.2]]
```

---

#### 34. Train a PCA model and visualize the data projection onto the first two principal components.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection of Iris Data')
plt.savefig('pca_projection.png')
```

**Output**: Saves `pca_projection.png` showing data in 2D PCA space.

---

#### 35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import time
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# KD Tree
start = time.time()
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kd.fit(X_train, y_train)
acc_kd = accuracy_score(y_test, knn_kd.predict(X_test))
time_kd = time.time() - start
# Ball Tree
start = time.time()
knn_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
knn_ball.fit(X_train, y_train)
acc_ball = accuracy_score(y_test, knn_ball.predict(X_test))
time_ball = time.time() - start
print(f"KD Tree Accuracy: {acc_kd:.2f}, Time: {time_kd:.4f}s")
print(f"Ball Tree Accuracy: {acc_ball:.2f}, Time: {time_ball:.4f}s")
```

**Output** (times vary):
```
KD Tree Accuracy: 1.00, Time: 0.0020s
Ball Tree Accuracy: 1.00, Time: 0.0025s
```

---

#### 36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
# Generate high-dimensional data
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, random_state=42)
# Apply PCA
pca = PCA()
pca.fit(X)
# Plot Scree plot
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot for PCA')
plt.savefig('pca_scree_plot.png')
```



#### 37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Calculate metrics
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
```

**Output**:
```
Precision: 1.00
Recall: 1.00
F1-Score: 1.00
```

---

#### 38. Train a PCA model and analyze the effect of different numbers of components on accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Test different numbers of components
components = range(1, 5)
accuracies = []
for n in components:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_pca, y_train)
    accuracies.append(accuracy_score(y_test, knn.predict(X_test_pca)))
# Plot
plt.plot(components, accuracies, marker='o')
plt.xlabel('Number of PCA Components')
plt.ylabel('Accuracy')
plt.title('Effect of PCA Components on KNN Accuracy')
plt.savefig('pca_components_accuracy.png')
```

**Output**: Saves `pca_components_accuracy.png` showing accuracy vs. components.

---

#### 39. Train a KNN Classifier with different leaf_size values and compare accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Test different leaf sizes
leaf_sizes = [10, 30, 50]
for leaf_size in leaf_sizes:
    knn = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', leaf_size=leaf_size)
    knn.fit(X_train, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test))
    print(f"Leaf Size {leaf_size} Accuracy: {acc:.2f}")
```

**Output**:
```
Leaf Size 10 Accuracy: 1.00
Leaf Size 30 Accuracy: 1.00
Leaf Size 50 Accuracy: 1.00
```

**Note**: Leaf size affects tree construction speed, not accuracy, for small datasets like Iris.

---

#### 40. Train a PCA model and visualize how data points are transformed before and after PCA.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X, y = iris.data[:, :2], iris.target  # Use first two features for visualization
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
ax1.set_title('Original Data')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax2.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
ax2.set_title('PCA Transformed Data')
ax2.set_xlabel('Principal Component 1')
ax2.set_ylabel('Principal Component 2')
plt.savefig('pca_transformation.png')
```

**Output**: Saves `pca_transformation.png` showing original vs. PCA-transformed data.

---

#### 41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report.

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
# Load data
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Print classification report
print(classification_report(y_test, knn.predict(X_test_scaled)))
```

**Output**:
```
              precision    recall  f1-score   support
           0       0.93      0.93      0.93        14
           1       0.93      0.93      0.93        14
           2       1.00      1.00      1.00         8
    accuracy                           0.94        36
   macro avg       0.95      0.95      0.95        36
weighted avg       0.94      0.94      0.94        36
```

---

#### 42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error.

```python
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Test distance metrics
metrics = ['euclidean', 'manhattan']
mses = []
for metric in metrics:
    knn = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mses.append(mean_squared_error(y_test, y_pred))
for metric, mse in zip(metrics, mses):
    print(f"{metric.capitalize()} MSE: {mse:.2f}")
```

**Output**:
```
Euclidean MSE: 103.45
Manhattan MSE: 115.72
```

---

#### 43. Train a KNN Classifier and evaluate using ROC-AUC score.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
# Load data (use binary classification for simplicity)
iris = load_iris()
X, y = iris.data, (iris.target != 0).astype(int)  # Binary: not setosa vs. others
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Calculate ROC-AUC
y_score = knn.predict_proba(X_test_scaled)[:, 1]
roc_auc = roc_auc_score(y_test, y_score)
print(f"ROC-AUC Score: {roc_auc:.2f}")
```

**Output**:
```
ROC-AUC Score: 1.00
```

---

#### 44. Train a PCA model and visualize the variance captured by each principal component.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X = iris.data
# Apply PCA
pca = PCA()
pca.fit(X)
# Plot variance per component
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Variance Captured by PCA Components')
plt.savefig('pca_variance_components.png')
```

**Output**: Saves `pca_variance_components.png` showing variance per component.

---

#### 45. Train a KNN Classifier and perform feature selection before training.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature selection
selector = SelectKBest(score_func=f_classif, k=2)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_selected, y_train)
acc = accuracy_score(y_test, knn.predict(X_test_selected))
print(f"Accuracy with Feature Selection: {acc:.2f}")
```

**Output**:
```
Accuracy with Feature Selection: 0.97
```

---

#### 46. Train a PCA model and visualize the data reconstruction error after reducing dimensions.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X = iris.data
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Calculate reconstruction error
components = range(1, 5)
errors = []
for n in components:
    pca = PCA(n_components=n)
    X_pca = pca.fit_transform(X_scaled)
    X_reconstructed = pca.inverse_transform(X_pca)
    error = np.mean((X_scaled - X_reconstructed) ** 2)
    errors.append(error)
# Plot
plt.plot(components, errors, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Mean Reconstruction Error')
plt.title('PCA Reconstruction Error')
plt.savefig('pca_reconstruction_error.png')
```

**Output**: Saves `pca_reconstruction_error.png` showing error vs. components.

---

#### 47. Train a KNN Classifier and visualize the decision boundary.

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
# Load Iris (use first two features)
iris = load_iris()
X, y = iris.data[:, :2], iris.target
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
# Create mesh grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Decision Boundary')
plt.savefig('knn_decision_boundary.png')
```

**Output**: Saves `knn_decision_boundary.png` showing decision boundary.

---

#### 48. Train a PCA model and analyze the effect of different numbers of components on data variance.

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
# Load data
iris = load_iris()
X = iris.data
# Apply PCA
pca = PCA()
pca.fit(X)
# Plot cumulative variance
cum_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cum_variance) + 1), cum_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Effect of PCA Components on Variance')
plt.savefig('pca_variance_components_analysis.png')
```

