### Code for help in building a KNN classifier
#### Dr. Bruns

In this code I show the ideas you need to build a KNN classifier.

In [1]:
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from scipy.stats import mode, zscore
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

Read the Iris data set.
Each row gives measurements for a single Iris flower.
I use the species associated with each row as the index of the DataFrame.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/grbruns/cst383/master/iris.csv', index_col=4)

Apply zscore normalization to every column.

In [3]:
df = df.apply(zscore)

To make things simple, use 20 randomly-selected
rows of the data.

In [4]:
np.random.seed(0)
m = 20
df_small = df.iloc[np.random.choice(df.shape[0], m), :]
df_small.head()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,-1.506521,0.328414,-1.340227,-1.315444
virginica,2.249683,1.709595,1.672157,1.317199
versicolor,-0.052506,-0.82257,0.194384,-0.262387
virginica,0.553333,-0.362176,1.046945,0.790671
setosa,-1.143017,0.098217,-1.283389,-1.447076


From this small data set, create a training and test set


In [5]:
X = df_small.values
y = df_small.index.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5, random_state=42)

Create a distance matrix.
There is one row for every row of X_test.
There is one column for every row of X_train.

In [6]:
dm = distance_matrix(X_test, X_train)

If we look at the first row of dm, we can see the
distances from the first row of the test data to
all of the rows of the training data.

In [7]:
row = dm[0,:]
row

array([1.32828601, 3.42843578, 3.85411762, 3.70474792, 0.67807863,
       3.67957157, 2.6273335 , 2.47623259, 0.45352282, 0.45352282,
       0.76304261, 3.67957157, 4.54009611, 2.47623259, 1.29203689])

The smallest value in this array corresponds to the row of the training data that is most similar to the first row of the test data.   What is the row number for the training example that is most similar to the first row of the test data?  We can find out with argsort().

In [8]:
np.argsort(row)

array([ 8,  9,  4, 10, 14,  0,  7, 13,  6,  1,  5, 11,  3,  2, 12],
      dtype=int64)

You can confirm that the smallest value in array row is at index 8.  This means that the first row of the test data is most similar to row 8 of the training data.  Let's look at their values to confirm they're similar.  

In [9]:
print(X_test[0])
print(X_train[8])

[-1.50652052  0.32841405 -1.34022653 -1.3154443 ]
[-1.14301691  0.09821729 -1.2833891  -1.44707648]


The species of the rows of the training data that are closest to the first row of the test data can be found like this:

In [10]:
y_train[np.argsort(row)]

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica'],
      dtype=object)

This means that the rows of the training data having values most similar to the first row of the test data are all of species setosa.

Here are the species associated with the 5 rows of the training data that are most similar to the first row of the test data: 

In [11]:
y_train[np.argsort(row)][:5]

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa'], dtype=object)

In this example all are setosa.  In general, you can find the species that appears the most by applying scipy.stats.mode().

In [12]:
closest_species = y_train[np.argsort(row)][:5]
mode(closest_species)[0][0]

'setosa'

Is this prediction correct?   Let's check the species for the first row of the test data.

In [13]:
y_test[0]

'setosa'

In this example I showed how to find the predicted class for the 
first row of the test data.  The same idea can be used to get 
predictions for all rows of the test data set.