# 2.1.2.5. K-NEAREST NEIGHBOURS
## INTRODUCTION
K Nearest Neighbours (KNN) is a simple and widely used machine learning algorithm for classification problems. It is based on the idea that similar data points tend to belong to the same class or category.

## STEPS
1. Given a new data point that needs to be classified, calculate the distance between it and all the existing data points in the training set. The distance can be measured using different formulas, such as Euclidean, Manhattan, or Minkowski distance.
2. Sort the distances in ascending order and select the K nearest neighbours, where K is a predefined parameter that represents the number of neighbours to consider.
3. Assign the new data point to the most frequent class among its K nearest neighbours. If there is a tie, use some other criteria to break it, such as choosing the closest neighbour or using a weighted voting scheme.

## ADVANTAGES
- It is easy to implement and understand.
- It does not make any assumptions about the distribution or structure of the data.
- It can handle both numerical and categorical features.
- It can adapt to changes in the data by updating the neighbours.

## DISADVANTAGES
- It can be computationally expensive and slow, especially when dealing with large and high-dimensional data sets.
- It can be sensitive to noise, outliers, and irrelevant features, which can affect the distance calculation and the classification accuracy.
- It can suffer from the curse of dimensionality, which means that as the number of features increases, the distance between any two data points becomes less meaningful and more similar.
- It requires a good choice of K and distance metric, which can vary depending on the problem and the data.

## K-VALUE DEPENDENCIES
- **The size and density of the data set**. A larger data set may require a larger K to avoid overfitting, while a smaller or sparser data set may require a smaller K to avoid underfitting.
- **The complexity and variability of the classes**. A more complex or diverse class may require a larger K to capture its characteristics, while a simpler or more homogeneous class may require a smaller K to avoid confusion with other classes.
- **The trade-off between bias and variance**. A smaller K may lead to low bias but high variance, meaning that it can capture the local patterns well but may be unstable and sensitive to noise. A larger K may lead to high bias but low variance, meaning that it can smooth out the noise but may miss some important details.

## K-VALUE OPTIMALITY
One way to find the optimal value of K is to use cross-validation, which involves splitting the data into training and validation sets, applying different values of K on the training set, and evaluating their performance on the validation set. The value of K that minimizes the validation error can be chosen as the best one.

## CONCLUSION
In conclusion, KNN is a simple yet powerful machine learning algorithm for classification problems that relies on proximity and similarity. It has some advantages such as being easy to implement and flexible to different types of data, but also some disadvantages such as being computationally expensive and sensitive to noise and dimensionality. Choosing the best value of K is crucial for achieving good results with KNN and can be done using cross-validation or other methods.

## HANDS-ON: KNN

### 1. IMPORTS

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

### 2. DATASET

In [2]:
iris = load_iris()
data = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
data['species'] = iris['target']

### 3. DATASET PREPROCESSING

In [4]:
X = data.drop('species', axis=1)
y = data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### 4. KNN CLASSIFIER

In [5]:
k = 3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

### 5. PREDICTION AND EVALUATION

In [6]:
# Make predictions on the testing data
y_pred = knn.predict(X_test)

# Measure the performance of the model
print('Accuracy:', accuracy_score(y_test, y_pred))

Accuracy: 1.0


## REFERENCES
1. https://www.geeksforgeeks.org/k-nearest-neighbours/
2. https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/
3. https://www.ibm.com/topics/knn
4. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm