## What is K Nearest Neighbour ?

K-Nearest Neighbors (K-NN) is a supervised machine learning algorithm used for classification and regression tasks. It's a simple and intuitive algorithm that can be used for both classification and regression tasks.

## How does K-NN works?

###Training Phase:

In the training phase, K-NN doesn't do much training in the traditional sense. Instead, it stores the entire training dataset in memory. This dataset consists of labeled examples, where each example has a set of features and a corresponding class label (for classification) or a target value (for regression).


### Prediction Phase:

When you want to make a prediction for a new, unseen data point, K-NN follows these steps:

a. Calculate Distance:

* Calculate the distance between the new data point and every data point in the training dataset. The most common distance metrics used are Euclidean distance, Manhattan distance, or Minkowski distance. The choice of distance metric depends on the problem and data.

b. Select K Neighbors:

* Choose the K data points with the shortest distances to the new data point. These K data points are referred to as the "K nearest neighbors."

c. Classification (for classification tasks):


*   For a classification task, K-NN counts the number of data points in each class among the K nearest neighbors.
*   The new data point is classified into the class that appears most frequently among the K neighbors. This is known as "majority voting."

d. Regression (for regression tasks):

* For a regression task, K-NN calculates the average (or weighted average) of the target values of the K nearest neighbors.
* The new data point is assigned this average as its predicted target value.


### Choosing K:

One crucial hyperparameter in K-NN is the value of K. The choice of K can significantly impact the algorithm's performance. A smaller K value makes the algorithm more sensitive to noise in the data, potentially leading to overfitting, while a larger K value can make the decision boundary smoother but may result in underfitting.

### Distance Metric:

The choice of distance metric is another important consideration. Different distance metrics can lead to different results, so selecting an appropriate metric is essential. Common distance metrics include Euclidean distance, Manhattan distance, and others.

###Normalization:

It's often recommended to normalize or standardize the features in your dataset before using K-NN. This ensures that features with different scales do not disproportionately influence the distance calculations.

## When to use KNN ?

K-Nearest Neighbors (K-NN) is a versatile algorithm that can be useful in certain situations. Here are some scenarios in which you might consider using K-NN:

* Small to Moderate-Sized Datasets: K-NN can work well when you have a relatively small to moderately sized dataset. It doesn't require building an explicit model during training, so it's computationally feasible for datasets of this size.

* No Assumptions About Data Distribution: K-NN doesn't make strong assumptions about the underlying data distribution. If you're unsure about the data distribution and whether it's linear or non-linear, K-NN can be a good choice as it can adapt to various data shapes.

* Binary Classification: K-NN can perform well in binary classification tasks when you have two classes to distinguish between. It's relatively simple to implement and can be a good starting point for such problems.

* Multi-Class Classification: K-NN can also be applied to multi-class classification problems, where you need to classify data points into more than two classes. However, the choice of K and distance metric becomes more critical in these cases.

* Regression: K-NN can be used for regression tasks when you need to predict a continuous target variable. It calculates the average (or weighted average) of the target values of the K nearest neighbors to make predictions.

* Localized Patterns: K-NN is particularly suitable for problems where the decision boundaries are highly localized. If the decision boundary varies in different parts of the feature space, K-NN can capture these localized patterns effectively.

* Anomaly Detection: K-NN can be used for anomaly detection. Unusual data points may have few neighbors that are similar to them, so they can be detected as outliers.

* Feature Engineering: K-NN can help identify important features in your dataset. By examining which features are most influential in determining the nearest neighbors, you can gain insights into the importance of various features.

* Quick Prototyping: K-NN is easy to implement and can serve as a quick prototype for testing the feasibility of a machine learning solution on your data. It can be a starting point before exploring more complex algorithms

## When not to use KNN ?

* Large Datasets: K-NN can be computationally expensive, especially with large datasets, as it requires calculating distances between the new data point and all data points in the training set for each prediction.

* High-Dimensional Data: K-NN is less effective in high-dimensional spaces because the concept of distance becomes less meaningful, and the curse of dimensionality can lead to increased computational complexity.

* Imbalanced Datasets: If your dataset is heavily imbalanced (i.e., one class significantly outnumbers the others), K-NN can be biased towards the majority class. You might need to consider techniques like resampling or using different distance weights.

* Optimal K Selection: Choosing the right value of K can be challenging. It often requires experimentation and cross-validation to determine the best K for your specific problem.

In [None]:
from IPython.display import Image
Image(url='https://miro.medium.com/v2/resize:fit:828/1*n9v1xsBi0bek98rqBnWGEg.gif')