# Module72 KNN Assignment2

Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?

A1. **Euclidean Distance:** Measures the straight-line (direct) distance between two points. It emphasizes large differences in feature values and is sensitive to outliers.

**Manhattan Distance:** Measures the grid-like (step-wise) distance between two points. It focuses on the sum of absolute differences and is less sensitive to outliers.

# Impact on KNN Performance:

**Euclidean Distance** is better suited for smooth, continuous data where relationships between features are geometric.

**Manhattan Distance** works better with high-dimensional data or datasets with sparse features, as it mitigates the effect of individual feature variances.

Choosing the wrong distance metric can lead to suboptimal predictions, especially if the metric does not align with the data’s structure.

Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?

A2.

# Challenges of k Selection:

1.) **Small k (e.g., 1):** Sensitive to noise, leading to overfitting.

2.) **Large k:** Tends to oversmooth the decision boundary, leading to underfitting.

# Techniques to Choose Optimal k:

1.) **Cross-Validation:** Split the data into training and validation sets, and evaluate the model's performance for various k values.

2.) **Elbow Method:** Plot the error rate (or validation accuracy) against different k values and choose the k where the error stabilizes (elbow point).

3.) **Grid Search:** Use automated hyperparameter tuning tools like GridSearchCV to search for the best k.

4.) **Domain Knowledge:** Consider the problem context to decide whether a higher or lower k makes sense.

Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?

A3.

# Effect on Performance:

Different distance metrics emphasize different aspects of the data.

Metrics like Euclidean distance work well when the data is normalized and relationships are geometric.

Manhattan distance is preferable for high-dimensional data or when features are on different scales.


# Situations:

1.) **Euclidean Distance:**

Use for continuous, smooth data.

Sensitive to feature scaling and outliers, so normalization is critical.

2.) **Manhattan Distance:**

Use for sparse or high-dimensional datasets.

Less affected by outliers or irrelevant features.

3.) **Other Metrics:**

**Minkowski Distance:** Generalization of both Euclidean and Manhattan (use with different powers).

4.) **Cosine Similarity:** Use when angles or relative proportions matter more than magnitude.


Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?

A4.

# Common Hyperparameters:

1.) **k (Number of Neighbors):**

Determines how many neighbors influence the prediction.

Low k: Risk of overfitting.

High k: Risk of underfitting.


2.) **Distance Metric:**

Determines how neighbors are identified.

Common choices: Euclidean, Manhattan, Minkowski.


3.) **Weighting Scheme:**

**Uniform:** All neighbors contribute equally.

**Distance-based:** Closer neighbors have higher influence.

# Tuning Hyperparameters:

**Grid Search/Random Search:** Explore combinations of k, distance metrics, and weighting schemes.

**Cross-Validation:** Use k-fold cross-validation to evaluate performance across different hyperparameters.

**Automated Tuning:** Tools like GridSearchCV or Bayesian optimization.

Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?

A5.

# Impact of Training Set Size:

Larger datasets improve KNN performance as more neighbors provide a better representation of the data distribution.

Very large datasets may increase computational cost and degrade performance in high-dimensional spaces.


# Techniques to Optimize Training Set Size:

1.) **Feature Selection:** Remove irrelevant features to reduce dimensionality.

2.) **Sampling:** Use techniques like stratified sampling or undersampling to create representative subsets.

3.) **Dimensionality Reduction:** Apply PCA or t-SNE to reduce the dataset size while retaining meaningful information.

4.) **Approximate Nearest Neighbors (ANN):** Use algorithms like KD-Trees or Ball Trees for faster neighbor searches.


Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?

A6.

# Drawbacks:

1.) **Computational Complexity:**

KNN requires storing the entire training dataset and computing distances for each prediction, which can be computationally expensive.

**Solution:** Use Approximate Nearest Neighbors (ANN) algorithms or indexing structures like KD-Trees.

2.) **Sensitive to Irrelevant Features:**

Irrelevant or noisy features can distort distance calculations.
**Solution:** Perform feature selection or dimensionality reduction.

3.) **Feature Scaling:**

Distance metrics can be dominated by features with larger scales.

**Solution:** Apply normalization or standardization.

4.) **Curse of Dimensionality:**

High-dimensional data makes neighbors appear equidistant.

**Solution:** Use dimensionality reduction techniques like PCA or LDA.


5.) **Imbalanced Data:**

KNN struggles when classes are imbalanced, as neighbors from the majority class dominate.
Solution: Use weighted KNN or balance the dataset with oversampling/undersampling.

6.) **Overfitting to Noise:**

Small values of k make KNN sensitive to noise.
Solution: Choose an optimal k value using cross-validation.

7.) **Storage Requirement:**

Storing the entire dataset is memory-intensive.

**Solution:** Reduce the dataset size using clustering or sampling techniques.