# KNN intuition

KNN (k nearest neighbors) is one of the simplest machine learning algorithms that exist. It is used for regression and for classification and when we give them a point to make a prediction, it just check the k nearest neighbors (geometrically) and let them decide democratically, let's take a look.

In [94]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
from ipywidgets import interact
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline

In [97]:
data = {'x' : [1, 2, 3, 6, 7, 9], 'y': [2, 3, 1, 5, 7, 6], 'class': ['k', 'k', 'k', 'r', 'r', 'r']}
data = pd.DataFrame(data)
data

Unnamed: 0,x,y,class
0,1,2,k
1,2,3,k
2,3,1,k
3,6,5,r
4,7,7,r
5,9,6,r


In [98]:
sns.set()

In [99]:
def make_plot(x, y):
    fig, ax = plt.subplots(figsize = (13, 8))

    ax.scatter(data[data['class'] == 'k']['x'], data[data['class'] == 'k']['y'], color = 'k')
    ax.scatter(data[data['class'] == 'r']['x'], data[data['class'] == 'r']['y'], color = 'r')
    ax.plot([x], [y], color = 'g', marker = '*', markersize = 12)

In [100]:
interact(make_plot, x = (1, 8, 0.1), y = (2, 7, 0.1));

interactive(children=(FloatSlider(value=4.0, description='x', max=8.0, min=1.0), FloatSlider(value=4.0, descri…

# Geometry matters


When we occupy KNN there is one thing that changes the results and that is that the algorithm is based on **distance**, so the definition of distance itself matters.


## Minkowski distance


Minkowski distance is the generalized distance metric. Here generalized means that we can manipulate the above formula to calculate the distance between two data points in different ways.


$$d(x, y) = \left(\sum |x_i - y_i|^p \right)^{\frac{1}{p}}$$


## Euclidean Distance

Euclidean distance is one of the most used distance metrics.  It is calculated using the Minkowski Distance formula by setting p’s value to 2. This will update the distance d’ formula as below


$$d(x, y) = \sqrt{\sum (x_i - y_i)^2}$$

Euclidean distance formula can be used to calculate the distance between two data points in a plane.


<img src = https://miro.medium.com/max/173/0*zoty_Iv6Im-PBDvw.png>


## Manhattan Distance:


We use Manhattan Distance if we need to calculate the distance between two data points in a grid-like path. As mentioned above, we use the Minkowski distance formula to find Manhattan distance by setting p’s value as 1.

Distance d will be calculated using an absolute sum of the difference between its cartesian coordinates as below :

$$d(x, y) = \sum |x_i - y_i|$$


<img src = https://miro.medium.com/max/200/0*WH9xVZc-T9IsfH6a.png>


## Hamming Distance


A Hamming distance in information technology represents the number of points at which two corresponding pieces of data can be different. It is often used in various kinds of error correction or evaluation of contrasting strings or pieces of data.

the Hamming distance is a very practical metric for measuring data strings. The Hamming distance involves counting up which set of corresponding digits or places are different, and which are the same. For example, take the text string “hello world” and contrast it with another text string, “herra poald.” There are five places along the corresponding strings where the letters are different.

<img src = https://www.researchgate.net/profile/Fredrick-Ishengoma/publication/264978395/figure/fig1/AS:295895569584128@1447558409105/Example-of-Hamming-Distance.png>

## Exercise 
implement KNN model for classification

In [125]:
new_point = np.array([5,7])

In [128]:
new_point = np.array([4, 4])

In [133]:
new_points = {'x' : [1, 6, 4, 5], 'y' : [4, 3, 6, 1]}
new_points = pd.DataFrame(new_points).values
new_points

array([[1, 4],
       [6, 3],
       [4, 6],
       [5, 1]], dtype=int64)

# Iris Data

In [137]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"


df = pd.read_csv(url, names=['lng sepalo','anch sepalo','lng petalo','anch petalo','especie'])

df.head()

Unnamed: 0,lng sepalo,anch sepalo,lng petalo,anch petalo,especie
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [148]:
X = df.iloc[:,0:4].values
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [149]:
y = df.iloc[:,4].values
y

array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versic

In [150]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [151]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((112, 4), (38, 4), (112,), (38,))

In [158]:
my_cls = Knn_classifier(X_train, y_train, k = 5)

In [159]:
pred = my_cls.predict(X_test)

In [160]:
(pred == y_test).mean()

0.9473684210526315

# Sklearn



In [163]:
cls = KNeighborsClassifier()

In [164]:
cls.fit(X_train, y_train)

KNeighborsClassifier()

In [165]:
pred = cls.predict(X_test)

In [166]:
(pred == y_test).mean()

0.9473684210526315