## KNN: 
The KNN algorithm assumes that similar things exist in close proximity. It's a supervised machine learning algorithm that can be used to solve both classification and regression problems. Class of new data point is decided based on **majority voting**.

### Steps: The KNN Algorithm:
- 1. Load the data.
- 2. Initialize K to your chosen number of neighbors.
- 3. For each new data point in the data.
      - 3.1 Calculate the Euclidean distance between the new data point and the all other data pionts in the data.
      - 3.2 Add the distance and the index of the example to an ordered collection.
- 4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances.
- 5. Pick the first K entries from the sorted collection.
- 6. Get the labels of the selected K entries.
- 7. If regression, return the mean of the K labels.
- 8. If classification, return the mode of the K labels.

## How to choose value of K?
- 1. As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, image K=1 and we have a query point surrounded by several reds and one green, but the green is the single nearest neighbor. 

Reasonably, we would think the query point is most likely red, but because K=1, KNN incorrectly predicts that the query point is green.

- 2. Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.

- 3. In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
df1 = pd.DataFrame(iris.data, columns = iris.feature_names)
df2 = pd.DataFrame(iris.target, columns = ['target'])
df = pd.concat((df1, df2), axis=1)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=10, shuffle=True)

In [7]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

In [8]:
knn.fit(x_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [11]:
y_pred = knn.predict(x_test)

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.9666666666666667

In [13]:
np.where(y_test != y_pred)

(array([6], dtype=int64),)

In [14]:
print(knn.score(x_test,y_test))
knn.score(x_train,y_train)

0.9666666666666667


0.975

In [16]:
d, t = knn.kneighbors(x_test[6].reshape(1,-1), n_neighbors=5, return_distance=True)

In [17]:
t

array([[ 45, 114,   4,  39,  88]], dtype=int64)

In [21]:
d

array([[0.36055513, 0.36055513, 0.41231056, 0.42426407, 0.43588989]])

In [18]:
for i in t:
    print(y_train[i])

[2 2 2 1 2]


In [19]:
knn.predict(x_test[6].reshape(1,-1))

array([2])

In [20]:
y_test[6]

1