# K-Nearest Neighbor

- Supervised Algorithm
- Asks: Who is around me?
- Scaling recommended

Pros:
1. Simple
1. Robust to noise
1. Performs calculations just in time
1. Can be used in quickly changing conditions

Cons:
1. "Correct" value for k is a bit ambiguous
1. High computational cost: Entire training set needs to be held in memory
1. Highly sensitive to the "curse of dimensionality" - fields need to be carefully curated

### Imports

In [None]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from pydataset import data

## Acquire

In [None]:
# read Iris data from pydatset
df = data('iris')

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.hist()

## Prepare

In [None]:
# convert column names to lowercase, replace '.' in column names with '_'
df.columns = [col.lower().replace('.', '_') for col in df]

df.head()

### Splitting

In [None]:
def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    Test is 20% of the original dataset, validate is .30*.80= 24% of the 
    original dataset, and train is .70*.80= 56% of the original dataset. 
    The function returns, in this order, train, validate and test dataframes. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [None]:
# split into train, validate, test
train, validate, test = train_validate_test_split(df, target='species', seed=123)

# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['species'])
y_train = train.species

X_validate = validate.drop(columns=['species'])
y_validate = validate.species

X_test = test.drop(columns=['species'])
y_test = test.species

## Modeling

#### Create KNN Object

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

#### Fit the model

In [None]:
knn.fit(X_train, y_train)

#### Make predictions

In [None]:
y_pred = knn.predict(X_train)

#### Estimate Probability

In [None]:
y_pred_proba = knn.predict_proba(X_train)

## Evaluation

#### Compute Accuracy

In [None]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

#### Confusion Matrix

In [None]:
print(confusion_matrix(y_train, y_pred))

#### Classification Report

In [None]:
print(classification_report(y_train, y_pred))

## Replicating the KNN Algorithm With our Neural Network

In [None]:
# Four labeled observations (made up data))
samples = pd.DataFrame({'a': [5.7, 5.5, 6.3], 
                        'b': [2.6, 3.5, 2.8], 
                        'c': [3.5, 1.3, 5.1], 
                        'd': [1.0, 0.2, 1.5], 
                        'target': ['versicolor', 'setosa', 'virginica']
                       })


samples

Now we train our brain's neural network by staring at the data and trying to develop some insight

Excellent, now that our personal algorithm has been fit to the data, lets look at unseen data:

In [None]:
new_obs = pd.DataFrame([[6.3, 2.8, 5.1, 1.4], 
                       [6.25, 2.77, 5.09, 1.35], 
                       [5.5, 3.5, 1.29, 0.3]], 
                        columns = ['a', 'b', 'c', 'd'])

new_obs

Time to thing label. Which prediction should we make for each of these new observations?

KNN uses a distance formula to compute the euclidean distance and find the K sample/labeled observations with the shortest distances. Then it asks, of those K samples, which species is most common (i.e. what is the mode of those neighbors)?

## Validation

Compute the accuracy of the model when run on the validate dataset.

In [None]:
print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

## Visualizing the Model

Lets look at different K values

In [None]:
import matplotlib.pyplot as plt
k_range = range(1, 20)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20])
plt.show()

## Exercise Time