## K-Nearest Neighbors
This notebook will start by covering what K-Nearest Neighbors (KNN) is, how it works, and how to use KNN in Python. Throughout this notebook we will also go over what pipelines are and how to use them. 

### What is K-Nearest Neighbors

K-nearest neighbors is a model that uses the "K" most similar observations in order to make a prediction.

In [None]:
from IPython.display import Video

# Couldn't identify the source of this video. 
Video("images/KNN-Classification.mp4")

Here is roughly how K-Nearest Neighbors works:
1. User specifies value for K. In this example above, we choose K=5 neighbors around black point.
2. Search for the K observations in the data that are nearest to the measurements of an unknown sample
    * Euclidian distance is often used as the distance metric
3. Use the most popular target value from the K nearest neighbors as the predicted target value. In the example above, out of 5 nearest neighbors of black point, 2 are brown and 3 are green. Since we have a majority of green points around this black point we assign green label to it.

<b>Advantages of KNN</b>

Easier to understand and explain than other machine learning algorithms

Can be used for classification or regression

<b>Disadvantages of KNN</b>

It must store all of the training data. 

Its prediction phase can be slow when n is large

Typically worse performance than other supervised learning methods

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

# For scaling data
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn import metrics

### Load Data
The Iris dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below loads the iris dataset.

In [None]:
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

### Arrange Data into Features Matrix and Target Vector

For demonstrational purposes, we are going take two features 

In [None]:
X = df.loc[:, ['sepal length (cm)', 'sepal width (cm)']]
#X = df.loc[:, df.columns != 'target']

In [None]:
X.shape

In [None]:
y = df.loc[:, 'target'].values

In [None]:
y.shape

### Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 0,
                                                    test_size = .2)

### KNN in `scikit-learn`

<b>Step 1:</b> Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [None]:
from sklearn.neighbors import KNeighborsClassifier

<b>Step 2:</b> Make an instance of the Model

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
print(knn)

<b>Step 3:</b> Train the model on the data, storing the information learned from the data. Model is learning the relationship between features and labels

In [None]:
knn.fit(X_train, y_train)

<b>Step 4:</b> Predict the labels of new data

Uses the information the model learned during the model training process

In [None]:
predictions = knn.predict(X_test)

In [None]:
predictions

In [None]:
# calculate classification accuracy
score = knn.score(X_test, y_test)

In [None]:
score

### Visualizing Data

In [None]:
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
h = .02  # step size in the mesh


# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X_train.loc[:, 'sepal length (cm)'].values.min() - 1, X_train.loc[:, 'sepal length (cm)'].values.max() + 1
y_min, y_max = X_train.loc[:, 'sepal width (cm)'].values.min() - 1, X_train.loc[:, 'sepal width (cm)'].values.max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X_train.loc[:, 'sepal length (cm)'].values,
            X_train.loc[:, 'sepal width (cm)'].values,
            c=y_train,
            cmap=cmap_bold,
            edgecolor='k',
            s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = 5)")

In [None]:
xx.shape

### Tuning k
When k is low, KNN is considered a low bias, high variance model. 

When k is high, KNN is considered a high bias, low variance model. 

In the video, as K is increased, the classification spaces' borders become more distinct. 

In [None]:
# Source not clear for this video
# Maybe machinelearningknowledge?
Video("images/KNNlowtoHigh.mp4")

In [None]:
# Code that generated the images for the video video
"""
for num_neighbors in range(1, 51):

    # Make an instance of the Model
    knn = KNeighborsClassifier(n_neighbors=num_neighbors)

    # Train the model on the data
    knn.fit(X_train, y_train)

    cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
    cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
    h = .005  # step size in the mesh


    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X_train.loc[:, 'sepal length (cm)'].values.min() - 1, X_train.loc[:, 'sepal length (cm)'].values.max() + 1
    y_min, y_max = X_train.loc[:, 'sepal width (cm)'].values.min() - 1, X_train.loc[:, 'sepal width (cm)'].values.max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize = (7,7))
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X_train.loc[:, 'sepal length (cm)'].values,
                X_train.loc[:, 'sepal width (cm)'].values,
                c=y_train,
                cmap=cmap_bold,
                edgecolor='k',
                s=40)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.title("3-Class classification k = " + str(num_neighbors), fontsize = 15)
    plt.savefig('imagesanimation/' + 'initial' + str(num_neighbors).zfill(4) + '.png', dpi = 50)
    plt.cla()
"""

In [None]:
# ignore
#!ffmpeg -framerate 1 -i 'initial%04d.png' -c:v libx264 -r 30 -pix_fmt yuv420p initial_002.mp4

## Benefits of Pipelines
Pipelines are a simply way to keep your data processing and modeling code organized. Specifically a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

* Cleaner Code: You don’t need to keep track of your training data at each step of processing. Accounting for data at each step of processing can get messy. 
* Fewer Bugs: There are fewer opportunities to mis-apply a step or forget a pre-processing step
* More options for model testing


### Arrange Data into Features Matrix and Target Vector

In [None]:
X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values

### Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 0,
                                                    test_size = .2)

### KNN in `scikit-learn`

In [None]:
# Reduce dimension to 2 with PCA
std_clf = make_pipeline(StandardScaler(),
                        PCA(n_components=2, random_state=0),
                        KNeighborsClassifier(n_neighbors=5))

In [None]:
std_clf.fit(X_train, y_train)
pred_test_std = std_clf.predict(X_test)

In [None]:
print('\nPrediction accuracy for the standardized test dataset with PCA')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))

In [None]:
# Extract PCA from pipeline
pca_std = std_clf.named_steps['pca']

# Use PCA with scale on X_train data for visualization.
scaler = std_clf.named_steps['standardscaler']
X_train_std_transformed = pca_std.transform(scaler.transform(X_train))

# visualize standardized  with PCA performed
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
    plt.scatter(X_train_std_transformed[y_train == l, 0],
                X_train_std_transformed[y_train == l, 1],
                color=c,
                label='class %s' % l,
                alpha=0.5,
                marker=m
                )

plt.title('Standardized training dataset after PCA')
plt.xlabel('1st principal component')
plt.ylabel('2nd principal component')
plt.legend(loc='upper right')
plt.grid()

plt.tight_layout()