# KNN Assignment 2

### Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?


**Euclidean Distance:**

| Aspect                  | Description                                          |
|-------------------------|------------------------------------------------------|
| Formula                 | `sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2 + ...)`     |
| Sensitivity             | Sensitive to both magnitude and direction.           |
| Anisotropy              | Suitable for isotropic (similar in all directions) relationships between features. |

**Manhattan Distance:**

| Aspect                  | Description                                          |
|-------------------------|------------------------------------------------------|
| Formula                 | absolute values : x1-x2 + y1-y2 + z1-z2 + ...             |
| Sensitivity             | Less sensitive to outliers and differences in scale.  |
| Anisotropy              | Suitable for anisotropic (differing in various directions) relationships between features. |

**How might this difference affect the performance of a KNN classifier or regressor?**

![download.jpg](attachment:8ccf951b-20e4-468a-b4c0-c5a3e49a6539.jpg)
 
**Euclidean Distance**:

- Imagine it's like measuring the straight-line distance between two points. It cares about both how far apart they are and in which direction they are separated.

- It can be sensitive to differences in scale (like comparing meters to millimeters) and can be affected by outliers (unusual data points).

- Use it when your data's relationships between features are similar in all directions, like measuring distances in a balanced way.

**Manhattan Distance**:

- Think of it as measuring the distance by adding up the steps you need to take when moving in a grid-like city. It only considers how many steps you take in different directions.

- It's less sensitive to differences in scale and outliers because it only looks at the absolute differences (ignoring plus or minus) along each axis.

- Use it when your data's relationships between features are different in various directions, like navigating a city with lots of right-angle turns.

In KNN, choosing the right one depends on your data. If your data is balanced and has similar relationships between features, go for Euclidean. If it's not balanced or has different relationships, go for Manhattan. And remember, sometimes it's best to try both and see which one works better for your specific problem.

### Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?

1. **Odd vs. Even K**:
   - For a binary classification problem (two classes), it's better to choose an odd K value to prevent ties when voting for the class, making predictions less ambiguous.

2. **Cross-Validation**:
   - A reliable way to find the right K is by splitting your data into training and validation sets and testing different K values. Pick the K that gives the best performance on the validation data, which helps prevent overfitting.

3. **Domain Knowledge**:
   - If you know your problem well, consider your domain knowledge when selecting K. It can provide valuable insights into what neighborhood size is most meaningful for your specific application.

4. **Rule of Thumb**:
   - As a starting point, you can use K = sqrt(N), where N is the total number of data points in your dataset. It's a reasonable initial choice for K.

### Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?


- **Euclidean Distance** is like measuring distance in a straight line. It's great when everything is balanced and you care about both how far apart things are and in what direction they're separated. It's sensitive to differences in size and can be affected by unusual data.

- **Manhattan Distance** is like measuring distance by counting steps in a city with streets running north-south and east-west. It works better when things are not balanced, and you don't really care about size or direction as much. It's less bothered by unusual data.

- **Choosing the Right One**: Use Euclidean when things are balanced and both size and direction matter. Use Manhattan when things are not balanced, and you don't care as much about size or direction. Sometimes, it's good to try both and see what works best for your specific problem.

- If you have a lot of data with many features, you might want to think about reducing the number of features, so you don't get bogged down by the calculations. This is especially important when dealing with high-dimensional data.

In a nutshell, the choice between Euclidean and Manhattan distance depends on the nature of your data. One might work better than the other depending on how balanced your data is and whether you care more about size or direction. Experiment to see which one gives you better results for your problem.

### Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?


| Hyperparameter          | Effect on Performance                               | Tuning Strategy                                        |
|------------------------|----------------------------------------------------|--------------------------------------------------------|
| K (Number of Neighbors) | - Smaller K: Sensitive to noise, potential overfitting - Larger K: Oversmoothing | Use cross-validation or grid search to experiment with different K values and choose the one that minimizes errors. |
| Distance Metric        | - Affects similarity calculation (e.g., Euclidean, Manhattan) - Should match data characteristics | Experiment with different distance metrics to choose the one that works best for your dataset based on scale and direction. |
| Weighting Scheme       | - Uniform (all neighbors have equal weight) - Distance-based (closer neighbors have more influence) | Experiment with different weighting schemes to see which one yields better results. Distance-based weighting can be beneficial when closer neighbors are more informative. |
| Algorithm for Finding Neighbors | - Different algorithms (e.g., Ball Tree, KD Tree) - Impact computational efficiency | Depending on dataset size and dimensionality, test different neighbor search algorithms to find the one that works faster and more effectively. |
| Dimensionality Reduction | - Reduces the number of features - Addresses "curse of dimensionality" | Experiment with dimensionality reduction techniques, like PCA, to reduce high-dimensional data's complexity. |


### Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?



| Training Set Size     | Small Training Set                                                                                              | Large Training Set                                                                                     |
|-----------------------|------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| Effect on Performance | - The model may not capture underlying patterns effectively. - Sensitivity to noise and high variance, leading to overfitting. | - More data to learn from, capturing true underlying patterns. - Reduced risk of overfitting.          |
| Issues                | - Overfitting - Poor generalization to unseen data - Unreliable model predictions                                       |  - Better overall performance - Computational complexity - Data management challenges - Handling outliers and noise - Curse of dimensionality - Generalization to unseen data - Model complexity                   |



Optimizing the size of the training set in a K-Nearest Neighbors (KNN) model involves strategies to make the most of the available data. Here are techniques that can be used to optimize the training set size:

1. **Data Collection**:
   - Collect more data: If possible, gather additional data from various sources to increase the size of your training set.

2. **Data Sampling**:
   - Bootstrapping: Generate additional training examples by resampling from your existing data with replacement.
   - Random sampling: Randomly select a subset of your data for training, creating a smaller but diverse training set.

3. **Feature Selection**:
   - Carefully choose the most informative features to include in your training set. Removing irrelevant or redundant features can effectively reduce the dimensionality of your data.

4. **Dimensionality Reduction**:
   - Use dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the number of features while preserving important information.

5. **Data Augmentation**:
   - Create synthetic data by applying domain knowledge or data augmentation techniques. For instance, in image classification, you can rotate, crop, or flip existing images to generate more training examples.

6. **Cross-Validation**:
   - Implement cross-validation techniques (e.g., k-fold cross-validation) to make more efficient use of your existing data. Cross-validation divides your data into training and validation sets multiple times, helping you assess your model's performance more accurately.

7. **Active Learning**:
   - Use active learning strategies to intelligently select which data points to label and include in the training set, focusing on the most informative examples.

8. **Incremental Learning**:
   - Implement incremental or online learning techniques, where you continuously update your model as new data becomes available. This can be useful for adapting to changing data distributions.

9. **Weighted Sampling**:
   - If certain subsets of your data are more representative or valuable, use weighted sampling to increase the prominence of those subsets in your training data.

10. **Domain Knowledge**:
    - Leverage domain expertise to curate a more focused training set that captures the most important aspects of your problem.

11. **Synthetic Data Generation**:
    - For imbalanced datasets, generate synthetic data for minority classes to balance the training set.

12. **Data Imputation**:
    - Use data imputation techniques to fill in missing values and make the most of incomplete data.


### Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?



| Drawback                         | How to Overcome                               |
|----------------------------------|-----------------------------------------------|
| Computational Intensity          | Use efficient data structures (e.g., KD-trees or Ball trees) for faster neighbor search. |
| Curse of Dimensionality          | Apply dimensionality reduction techniques like PCA to reduce the number of features. |
| Sensitivity to Scale and Outliers | Scale and standardize features to make them comparable and reduce sensitivity to scale and outliers. |
| Local Optima                     | Experiment with different values of K and distance metrics using cross-validation or grid search. |
| Ineffective for Imbalanced Data  | Use resampling techniques (e.g., oversampling, undersampling, or synthetic data generation) to address imbalanced datasets. |
| Memory Usage                     | Optimize memory usage and consider using parallel processing for large datasets. |


## The End