## Q1. What is the KNN algorithm?


K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby. Imagine a streaming service wants to predict if a new user is likely to cancel their subscription (churn) based on their age. They checks the ages of its existing users and whether they churned or stayed. If most of the “K” closest users in age of new user canceled their subscription KNN will predict the new user might churn too. The key idea is that users with similar ages tend to have similar behaviors and KNN uses this closeness to make decisions.

## Q2. How do you choose the value of K in KNN?


The choice of k will largely depend on the input data as data with more outliers or noise will likely perform better with higher values of k. Overall, it is recommended to have an odd number for k to avoid ties in classification, and cross-validation tactics can help you choose the optimal k for your dataset.

## Q3. What is the difference between KNN classifier and KNN regressor?


* In classification tasks, the user seeks to predict a category, which is usually represented as an integer label, but represents a category of "things". For instance, you could try to classify pictures between "cat" and "dog" and use label 0 for "cat" and 1 for "dog".

     The KNN algorithm for classification will look at the k nearest neighbours of the input you are trying to make a prediction on. It will then output the most frequent label among those k examples.

* In regression tasks, the user wants to output a numerical value (usually continuous). It may be for instance estimate the price of a house, or give an evaluation of how good a movie is.

    In this case, the KNN algorithm would collect the values associated with the k closest examples from the one you want to make a prediction on and aggregate them to output a single value. usually, you would choose the average of the k values of the neighbours, but you could choose the median or a weighted average (or actually anything that makes sense to you for the task at hand).

    For your specific problem, you could use both but regression makes more sense to me in order to predict some kind of a "matching percentage ""between the user and the thing you want to recommand to him.

## Q4. How do you measure the performance of KNN?


Once the metric is chosen, there are several ways to evaluate the performance of the KNN algorithm, each with its own benefits and limitations. Two common approaches are cross -validation and training/testing split. Overall, both cross -validation and training/testing split are useful approaches for evaluating the performance of the KNN algorithm, and the choice between them depends on the specific problem and the available resources. Training/testing split is a quick and easy way to evaluate the performance of the algorithm, but it may suffer from overfitting or underfitting. Cross -validation provides a more reliable estimate of performance, but can be computationally expensive.

## Q5. What is the curse of dimensionality in KNN?


The dimensionality curse phenomenon states that in high dimensional spaces distances between nearest and farthest points from query points become almost equal. Therefore, nearest neighbor calculations cannot discriminate candidate points. Many indexing methods that try to cope with the dimensionality curse in high dimensional spaces have been proposed, but, usually these methods end up behaving like the sequential scan over the database in terms of accessed pages when queries like k-Nearest Neighbors are examined. In this paper, we experiment with state of the art multi-attribute indexing methods and try to investigate when these methods reach their limits, namely, at what dimensionality a kNN query requires visiting all the data pages. In our experiments we compare the Hybrid Tree, the R*-tree, and, the iDistance Method.

## Q6. How do you handle missing values in KNN?


When working with data, encountering missing values is inevitable. They can arise from various reasons, like human error or data corruption, and handling them is crucial for data integrity and quality. Different techniques exist to fill in these missing values, including univariate and multivariate imputation. This article will introduce these concepts and delve into K-Nearest Neighbors (KNN) imputation, a widely used technique for handling missing values.


Let’s briefly discuss the two main types of imputation:

    * Univariate Imputation: This technique considers only one column (or feature) at a time. If we have a missing value in a column, univariate imputation fills it based solely on the values in that column. Common univariate techniques include filling missing values with the mean, median, or mode of the column.
    
    * Multivariate Imputation: In multivariate methods, multiple columns are used to predict missing values. This means that the algorithm takes advantage of other features in the dataset, which often leads to more accurate imputations.

## Q7. Compare and contrast the performance of the KNN classifier and regressor. Which one is better for which type of problem?


KNN, or k-Nearest Neighbors, is an algorithm used for both classification and regression problems. It is a non-parametric method that uses an existing dataset of classified objects (or records) to classify new items by finding the most similar pre-existing item or record. KNN is particularly useful in cases where the data cannot fit into any linear model - as it finds and uses a nonlinear decision boundary instead.

The major difference between KNN and other classification algorithms like SVM (Support Vector Machines), Random Forests, Logistic Regression etc., lies in the fact that KNN does not learn through explicit training, while other algorithms do. Instead, it stores all of its training examples in memory and uses them directly when classifying new points – this makes it very efficient computationally but also means that if there are too many features (variables) this can lead to poor performance due to overfitting as well as extra complexity from storing all those points in memory.

As for using KNN for regression tasks such as predicting a continuous variable like house prices based on neighborhood characteristics, this would allow us to predict more accurately than with more traditional methods since we can take into account all variables even if they are not linearly related - something which could be difficult with standard linear regression models due to their simplifying assumptions about features being independent and uncorrelated with each other. Additionally KNN regression can provide better generalization as they don’t assume an underlying functional form of the data they are fitting so they have less chances of producing spurious results

## Q8. What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks, and how can these be addressed?



<b>Advantages and disadvantages of the KNN Classification algorithm</b>:

Just like any machine learning algorithm, k-NN classification has its strengths and weaknesses. Depending on the project and application, it may or may not be the right choice.

<b>Advantages</b>:

   - Easy to implement: Given the algorithm’s simplicity and accuracy, it is one of the first classifiers that a new data scientist will learn.

   - Adapts easily: As new training samples are added, the algorithm adjusts to account for any new data since all training data is stored into memory.

   - Few hyperparameters: KNN only requires a k value and a distance metric, which is low when compared to other machine learning algorithms.

<b>Disadvantages</b>:

   - Does not scale well: Since KNN is a lazy algorithm, it takes up more memory and data storage compared to other classifiers. This can be costly from both a time and money perspective. More memory and storage will drive up business expenses and more data can take longer to compute. While different data structures, such as Ball-Tree, have been created to address the computational inefficiencies, a different classifier may be ideal depending on the business problem.

   - Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of dimensionality, which means that it doesn’t perform well with high-dimensional data inputs. This is sometimes also referred to as the peaking phenomenon, where after the algorithm attains the optimal number of features, additional features increases the amount of classification errors, especially when the sample size is smaller.

   - Prone to overfitting: Due to the “curse of dimensionality”, KNN is also more prone to overfitting. While feature selection and dimensionality reduction techniques are leveraged to prevent this from occurring, the value of k can also impact the model’s behavior. Lower values of k can overfit the data, whereas higher values of k tend to “smooth out” the prediction values since it is averaging the values over a greater area, or neighborhood. However, if the value of k is too high, then it can underfit the data.



<b>Advantages of KNN Regression</b>:

The K-nearest neighbors algorithm has various advantages as discussed below.

Simple to understand and implement: KNN regression is one of the simplest machine learning algorithms. It is easy to understand and implement, making it accessible to both practitioners and researchers.
No assumptions about data distribution: Unlike many other regression algorithms, KNN regression does not make any assumptions about the distribution of the data. This makes it suitable for a wide range of datasets, including those with complex or non-linear relationships between features and target variables.
Handle noisy data well: KNN regression is able to handle noisy data well, as it is less sensitive to outliers and extreme values in the data.
Versatile: KNN regression can be used for both regression and classification problems and can handle both continuous and categorical target variables.
Can be used for online learning: KNN regression can be used for online learning, which means it can be updated incrementally as new data becomes available. This can give accurate results in real time.
Can work well with small datasets: Unlike some other algorithms that require a large amount of data to work well, KNN regression can still produce good results with small datasets, as long as the data is representative of the problem space.


<b>Disadvantages of KNN Regression</b>:

Apart from its advantages, the KNN regression algorithm also has many disadvantages. Some of them are discussed below.

Computational cost: While predicting the results, the KNN algorithm needs to find the distance between all the data points in the existing dataset and the new dataset. Due to this, computational costs keep increasing as the data size increases. Thus, it might not be computationally efficient for use cases with large data sets.
High memory usage: KNN regression requires a lot of memory to store the entire training dataset, which can be a problem for very large datasets.
Hyperparameter sensitivity: The performance of KNN regression is highly dependent on the choice of the hyperparameter K, which determines the number of nearest neighbors used to make the prediction. Choosing the wrong value of K can result in overfitting or underfitting.
Sensitivity to irrelevant features: KNN regression is sensitive to irrelevant features, as they can have a large impact on the distance metric used to identify the nearest neighbors. This can result in poor performance if the features are not carefully pre-processed.
Non-parametric nature: Unlike other regression algorithms, KNN regression does not provide a model that can be used to make predictions for new data points. Every time, it calculates the distance and then gives the results. This can make it more difficult to interpret the results and understand the relationships between the features and target variables.
Not suitable for large datasets with many features: KNN regression can become computationally infeasible for datasets with a large number of features and data points. In these cases, you can use algorithms like multiple regression or polynomial regression.

## Q9. What is the difference between Euclidean distance and Manhattan distance in KNN?


A good distance metric helps in improving the performance of Classification, Clustering, and Information Retrieval process significantly. In this article, we will discuss different Distance Metrics and how do they help in Machine Learning Modelling.

So, in this blog, we are going to understand distance metrics, such as Euclidean and Manhattan Distance used in machine learning models, in-depth.

![image.png](attachment:image.png)

<b>Euclidean Distance Metric:</b>

Euclidean Distance represents the shortest distance between two points.

The “Euclidean Distance” between two objects is the distance you would expect in “flat” or “Euclidean” space; it’s named after Euclid, who worked out the rules of geometry on a flat surface.

The Euclidean is often the “default” distance used in e.g., K-nearest neighbors (classification) or K-means (clustering) to find the “k closest points” of a particular sample point. The “closeness” is defined by the difference (“distance”) along the scale of each variable, which is converted to a similarity measure. This distance is defined as the Euclidian distance.

It is only one of the many available options to measure the distance between two vectors/data objects. However, many classification algorithms, as mentioned above, use it to either train the classifier or decide the class membership of a test observation and clustering algorithms (for e.g. K-means, K-medoids, etc) use it to assign membership to data objects among different clusters.

Mathematically, it’s calculated using Pythagoras’ theorem. The square of the total distance between two objects is the sum of the squares of the distances along each perpendicular co-ordinate.


<b>Manhattan Distance Metric:</b>

Manhattan Distance is the sum of absolute differences between points across all the dimensions.

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, Minkowski’s L1 distance, taxi-cab metric, or city block distance.




## Q10. What is the role of feature scaling in KNN?

All such distance based algorithms are affected by the scale of the variables. Consider your data has an age variable which tells about the age of a person in years and an income variable which tells the monthly income of the person in rupees:

![image.png](attachment:image.png)

Here the Age of the person ranges from 25 to 40 whereas the income variable ranges from 50,000 to 110,000. Let’s now try to find the similarity between observation 1 and 2. The most common way is to calculate the Euclidean distance and remember that smaller this distance closer will be the points and hence they will be more similar to each other. Just to recall, Euclidean distance is given by:

![image-2.png](attachment:image-2.png)

Here,

n = number of variables

p1,p2,p3,… = features of first point

q1,q2,q3,… = features of second point

The Euclidean distance between observation 1 and 2 will be given as:

Euclidean Distance = [(100000–80000)^2 + (30–25)^2]^(1/2)

which will come out to be around 20000.000625. It can be noted here that the high magnitude of income affected the distance between the two points. This will impact the performance of all distance based model as it will give higher weightage to variables which have higher magnitude (income in this case).

We do not want our algorithm to be affected by the magnitude of these variables. The algorithm should not be biased towards variables with higher magnitude. To overcome this problem, we can bring down all the variables to the same scale. One of the most common technique to do so is normalization where we calculate the mean and standard deviation of the variable. Then for each observation, we subtract the mean and then divide by the standard deviation of that variable:

![image-3.png](attachment:image-3.png)