# K Nearest Neighbour and Sklearn

KNN is an algorithm based on the idea that data points with similar atributes tend to fall into similar categories. When you plot a series of data points in a 2-D plot, data points with similar attributes will be close together and so will be classified as belonging to the same category.

The KNN algorithm draws a series of **decision boundaries** in order to classify the data points. 

![knn](imgs/knn-1.png)

All machine learning models in sklearn are implemented as Python classes. These classes serve two purposes, they:

- implement the algorithms for learning and predicting
- store the information learned from the data.

Training the model is called **fitting** a model to the data, we use the sklearn `.fit()` method.

To predict the label of a new data point we use sklearn's `.predict()` method.

### Overview of using Sklearn to fit a KNN Classifier

In [None]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

# load the data
iris = datasets.load_iris()
X = iris.data # features, 2-D numpy array
y = iris.target # target, 1-D numpy array

# create the classifier
knn = KNeighborsClassifier(n_neighbors=6)

# fit the classifier to the training set
knn.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', 
                     metric_params=None, n_jobs=1, n_neighbors=6, weights='uniform');

**NOTE**  

Sklearn requires that:

- the features are homogenous, continuous values, e.g. all floats, as apposed to categorical data, e.g. male/female
- the features and labels/target must be a numpy array or pandas dataframe of the correct shape.
- there are no missing values in the dataset.
- each row be a sample or observation point, and each column be a variable or feature
- the target needs to be a single column with the same number of observations as the feature data. 

Common practce name the feature array `X` and response variable `y`.

The classifier is returned modified to fit our data. We can now pass it some unlabelled data. 

We call the `.predict()` method on the classifier and pass it the unlabelled data. Sklearn requires that the unlabelled data be an array, with the features in columns and the observations in rows.

For every observation in our unlabelled dataset, the classifier will return a prediction. Thus a unlabelled dataset of (3,4) - 3 observations/4 features, will return a 1-D array with 3 predictions, e.g. [1, 1, 0]

In [None]:
X_new = []

# predicting on unlabelled data
y_prediction = knn.predict(X_new)

Using the house votes as an example.

In [None]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier

df = 