## Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?


A good distance metric helps in improving the performance of Classification, Clustering, and Information Retrieval process significantly. In this article, we will discuss different Distance Metrics and how do they help in Machine Learning Modelling.

So, in this blog, we are going to understand distance metrics, such as Euclidean and Manhattan Distance used in machine learning models, in-depth.


<b>Euclidean Distance Metric:</b>

Euclidean Distance represents the shortest distance between two points.

The “Euclidean Distance” between two objects is the distance you would expect in “flat” or “Euclidean” space; it’s named after Euclid, who worked out the rules of geometry on a flat surface.

The Euclidean is often the “default” distance used in e.g., K-nearest neighbors (classification) or K-means (clustering) to find the “k closest points” of a particular sample point. The “closeness” is defined by the difference (“distance”) along the scale of each variable, which is converted to a similarity measure. This distance is defined as the Euclidian distance.

It is only one of the many available options to measure the distance between two vectors/data objects. However, many classification algorithms, as mentioned above, use it to either train the classifier or decide the class membership of a test observation and clustering algorithms (for e.g. K-means, K-medoids, etc) use it to assign membership to data objects among different clusters.

Mathematically, it’s calculated using Pythagoras’ theorem. The square of the total distance between two objects is the sum of the squares of the distances along each perpendicular co-ordinate.


<b>Manhattan Distance Metric:</b>

Manhattan Distance is the sum of absolute differences between points across all the dimensions.

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, Minkowski’s L1 distance, taxi-cab metric, or city block distance.


<b>How Distance affect Performance:</b>

* The Manhattan distance is suitable for data that has discrete and categorical features, as it does not penalize small differences as much as the Euclidean distance. It can also handle high-dimensional data better, as it is less sensitive to the curse of dimensionality. However, it can be influenced by the orientation and scale of the features, as it assumes that all directions are equally important and all units are comparable.

* The Euclidean distance is suitable for data that has continuous and numerical features with similar scales and ranges. It can also handle outliers and noise well, as it gives more weight to larger differences. However, it can be affected by the curse of dimensionality, which means that as the number of features increases, the distance between any two points becomes less meaningful and more similar.

## Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?


To find the optimal K value for your data, you’ll typically run the KNN algorithm multiple times with varying K values and evaluate each scenario based on accuracy. If the accuracy is stable as K changes, that K value may be a suitable choice.

When selecting K, consider that the feature count and group size are influential factors in the model’s performance. More features or more classes often require larger K values to capture meaningful patterns in the data.

For example:

* Higher K values: Increasing K generally stabilizes predictions and improves resilience to outliers. A practical approach is incrementally increasing K until your chosen accuracy metric—like the F-Measure—meets an acceptable threshold.
* K = 1: This makes predictions highly sensitive to noise and outliers, as each prediction relies on a single, possibly unreliable, neighbor.

When choosing the K value, it’s important to consider the distribution of samples across classes:

* Increasing K: If one class has significantly more samples than others, increasing K helps balance predictions by averaging across more neighbors, reducing the impact of any single data point. This can lead to more stable predictions and prevent the model from being too influenced by outliers.
* Decreasing K: When classes are evenly distributed or if certain classes have fewer samples, using a lower K allows the model to be more sensitive to closer, potentially relevant neighbors. This approach works well for datasets with a balanced class distribution but may be less effective for imbalanced datasets.

Here are some examples of varying the value of K for a specific dataset:

<br>

![image.png](attachment:image.png)

<br>

Source: https://neptune.ai/blog/knn-algorithm

## Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?


* <b>The Manhattan distance</b> is suitable for data that has discrete and categorical features, as it does not penalize small differences as much as the Euclidean distance. It can also handle high-dimensional data better, as it is less sensitive to the curse of dimensionality. However, it can be influenced by the orientation and scale of the features, as it assumes that all directions are equally important and all units are comparable.

* <b>The Euclidean distance</b> is suitable for data that has continuous and numerical features with similar scales and ranges. It can also handle outliers and noise well, as it gives more weight to larger differences. However, it can be affected by the curse of dimensionality, which means that as the number of features increases, the distance between any two points becomes less meaningful and more similar.

* <b>The Minkowski distance</b> is suitable for data that has mixed types of features, as it allows you to adjust the parameter p to balance the importance of different features and distances. However, it can be computationally expensive and difficult to interpret, as the parameter p can have different effects on different data sets and problems.

* <b>The cosine similarity</b> is suitable for data that has sparse and high-dimensional features, such as text or image data, as it measures the similarity based on the direction and not the magnitude of the vectors. It can also handle data that has different scales and units, as it normalizes the vectors before comparing them. However, it can be affected by the distribution and frequency of the features, as it assumes that all features are equally important and independent.

Source: https://www.linkedin.com/advice/3/what-most-effective-distance-metrics-optimizing-xndwc

## Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?


1. feature_dim:

    The number of features in the input data.

    Required

    Valid values: positive integer.

2. k:

    The number of nearest neighbors.

    Required

    Valid values: positive integer

3. predictor_type:

    The type of inference to use on the data labels.

    Required

    Valid values: classifier for classification or regressor for regression.

4. sample_size:

    The number of data points to be sampled from the training data set.

    Required

    Valid values: positive integer

5. dimension_reduction_target:

    The target dimension to reduce to.

    Required when you specify the dimension_reduction_type parameter.

    Valid values: positive integer greater than 0 and less than feature_dim.

6. dimension_reduction_type:

    The type of dimension reduction method.

    Optional

    Valid values: sign for random projection or fjlt for the fast Johnson-Lindenstrauss transform.

    Default value: No dimension reduction

7. faiss_index_ivf_nlists:

    The number of centroids to construct in the index when index_type is faiss.IVFFlat or faiss.IVFPQ.

    Optional

    Valid values: positive integer

    Default value: auto, which resolves to sqrt(sample_size).

8. faiss_index_pq_m:

    The number of vector sub-components to construct in the index when index_type is set to faiss.IVFPQ.

    The FaceBook AI Similarity Search (FAISS) library requires that the value of faiss_index_pq_m is a divisor of the data dimension. If faiss_index_pq_m is not a divisor of the data dimension, we increase the data dimension to smallest integer divisible by faiss_index_pq_m. If no dimension reduction is applied, the algorithm adds a padding of zeros. If dimension reduction is applied, the algorithm increase the value of the dimension_reduction_target hyper-parameter.

    Optional

    Valid values: One of the following positive integers: 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 40, 48, 56, 64, 96

9. index_metric:

    The metric to measure the distance between points when finding nearest neighbors. When training with index_type set to faiss.IVFPQ, the INNER_PRODUCT distance and COSINE similarity are not supported.

    Optional

    Valid values: L2 for Euclidean-distance, INNER_PRODUCT for inner-product distance, COSINE for cosine similarity.

    Default value: L2

10. index_type:

    The type of index.

    Optional

    Valid values: faiss.Flat, faiss.IVFFlat, faiss.IVFPQ.

    Default values: faiss.Flat

11. mini_batch_size:

    The number of observations per mini-batch for the data iterator.

    Optional

    Valid values: positive integer

    Default value: 5000




Source: https://docs.aws.amazon.com/sagemaker/latest/dg/kNN_hyperparameters.html

<br>

<b>Things to keep in mind when performing turning:</b>

1.    Understand the parameters: The main hyperparameter to tune in k-nearest neighbors is k, the number of neighbors to consider. Other parameters include distance metrics, weights, and algorithm types.

2.    Select a distance metric: Choose the right distance metric to measure the similarity between the data points. Common distance metrics include Euclidean, Manhattan, and cosine distance.

3.    Select an appropriate value for k: Selecting a value for k is crucial in k-nearest neighbors. A larger value of k provides a smoother decision boundary but may not be suitable for all datasets. A smaller value of k may lead to overfitting.

4.    Choose an algorithm type: k-nearest neighbors has two algorithm types: brute-force and tree-based. Brute-force algorithm computes the distances between all pairs of points in the dataset while tree-based algorithm divides the dataset into smaller parts.

5.    Cross-validation: Cross-validation is a technique used to validate the performance of the model. It involves splitting the dataset into training and testing sets and evaluating the model's performance on the testing set.

6.    Grid search: Grid search is a hyperparameter tuning technique that involves testing a range of values for each hyperparameter to find the best combination of values.

7.    Random search: Random search is another hyperparameter tuning technique that randomly selects a combination of hyperparameter values to test.

8.    Bias-variance tradeoff: k-nearest neighbors is prone to overfitting due to the high variance in the model. Regularization techniques such as L1 and L2 regularization can be used to mitigate overfitting.

9.    Data preprocessing: Data preprocessing plays a crucial role in k-nearest neighbors. Scaling the data using techniques such as normalization and standardization can improve the model's performance. Outlier removal and feature selection can also help improve the model's performance.



## Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?


* As a dataset grows, KNN becomes increasingly inefficient, compromising overall model performance. It is commonly used for simple recommendation systems, pattern recognition, data mining, financial market predictions, intrusion detection, and more.

* Optimization techniques like pruning, quantization, and knowledge distillation are vital for improving computational efficiency: Pruning reduces model size by removing less important neurons, involving identification, elimination, and optional fine-tuning.

## Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?

<b>KNN limitations:</b>

KNN is straightforward, requiring only the knowledge of the number of categories for classification. This simplicity allows it to seamlessly incorporate new categories without needing prior data on the existing ones. However, this feature also limits KNN in predicting rare occurrences, such as new diseases, because it lacks historical data to estimate their prevalence in a general population.

Despite its good accuracy on test sets, KNN is slow and resource-intensive. It retains the entire training dataset in memory for making predictions, which can be impractical with large datasets. Additionally, KNN’s typical use of Euclidean distance makes it highly sensitive to feature scale, disproportionately impacting features with larger magnitudes.

Although KNN produces good accuracy on the testing set, the classifier remains slower and costlier in terms of time and memory. It requires large memory to store the entire training dataset for prediction. Furthermore, Euclidean distance is very sensitive to magnitude, hence, features with high magnitudes will always weigh more than their counterparts with low ones.

Given these factors, KNN is less effective for datasets with high dimensionality due to its computational and memory demands.

<br>

<b>How to Improve performation of KNN algorithm</b>:


1. Feature selection:

   One way to improve the KNN algorithm is to select the most relevant features for the classification task. This can reduce the dimensionality of the data, speed up the computation, and avoid the curse of dimensionality. Feature selection can be done using various methods, such as filter, wrapper, or embedded approaches. For example, you can use correlation analysis, information gain, or chi-square test to rank the features according to their relevance and select the top ones.

2. Distance metric:

   Another way to improve the KNN algorithm is to choose the appropriate distance metric for measuring the similarity between the data points. The distance metric should reflect the nature and scale of the data, as well as the distribution of the classes. For example, you can use Euclidean distance, Manhattan distance, Minkowski distance, or cosine similarity, depending on the data type and structure. You can also use weighted distance, where each feature has a different weight according to its importance.

3. Number of neighbors:

   A third way to improve the KNN algorithm is to optimize the number of neighbors, or k, that are used to classify a new data point. The choice of k affects the accuracy and complexity of the algorithm. If k is too small, the algorithm may be too sensitive to noise and outliers. If k is too large, the algorithm may be too general and lose the local information. You can use cross-validation, grid search, or other methods to find the optimal value of k for your data set.

4. Data preprocessing:

   A fourth way to improve the KNN algorithm is to preprocess the data before applying the algorithm. Data preprocessing can include cleaning, scaling, normalizing, transforming, or encoding the data to make it more suitable for the KNN algorithm. For example, you can remove missing values, outliers, or duplicates, scale or normalize the numerical features to the same range, transform the skewed features to a normal distribution, or encode the categorical features to numerical values.

5. Algorithm modification:

   A fifth way to improve the KNN algorithm is to modify the algorithm itself to make it more efficient or effective. For example, you can use a different voting scheme, such as weighted voting, where each neighbor has a different weight according to its distance or similarity. You can also use a different data structure, such as a tree, a graph, or a hash table, to store and retrieve the neighbors faster. You can also use a different algorithm, such as K-means, K-medoids, or K-modes, to cluster the data before applying KNN.

6. Evaluation metric:

    A sixth way to improve the KNN algorithm is to use the appropriate evaluation metric to measure the performance of the algorithm. The evaluation metric should reflect the goal and the challenge of the classification task. For example, you can use accuracy, precision, recall, F1-score, ROC curve, or AUC, depending on the data size, class balance, and error cost. You can also use multiple metrics to compare and contrast the results of different KNN models.