# Digit Recognition Using K Nearest Neighbors

We've briefly gone over using K Nearest Neighbors in Python.

Let's go more in depth this time, as well as introduce basic cross validation. Cross validation will give us a more generalized measure of the model's scores, as well as help us to automatically tune the model (i.e. select the best `k` value).

In [None]:
import numpy as np
import pandas as pd
import sklearn as sk

from sklearn import datasets
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#What kind of object is digits?
digits = datasets.load_digits()

In [None]:
print digits.DESCR #The description of the dataset
print digits.images.shape # The formatted array. This is an nparray of 1797 x 8 x 8 (1797 images having 8x8 pixels)
print digits.data.shape #The raw data. This is an nparray of size 1797 x 64. Imagine each 8x8 image row's stacked into a single vector

print digits.target.shape
print digits.target_names

Let's take a look at a sample image

In [None]:
digits.images[0] #If you squint your eyes hard enough, the non zero values look roughly like a "0"

In [None]:
digits.data[0] #Same as above, but each row has been stacked into a single vector

In [None]:
#Let's visualize what these images look like
#plt.imshow( digits.images[0])
plt.imshow( digits.images[0], cmap=plt.cm.gray_r) #Color map, or cmap provides a mapping between the values, and the represented colors

In [None]:
# Lets look at a couple of these
for index in range(5):
    image = digits.images[index]
    label = digits.target[index]
    
    plt.figure(figsize=(5,5))
    plt.subplot(2,3,index+1)
    plt.imshow(image,cmap=plt.cm.gray_r)
    plt.title("Image label: %i" % label)

Okay, let's see if we can do a decent job of classifying digits. What we will do is, we will consider each row of digits.data to be a sample. 

X = digits.data, where the columns are features (i.e. pixel values), and the rows are samples
y = digits.target, where each element is the true label

In [None]:
# Sklearn is the package for machine learning in python.
# It has a host of machine learning tools/models, and various datasets

# http://scikit-learn.org/stable/ 
from sklearn import neighbors

In [None]:
n_samples = digits.images.shape[0]
X         = digits.data
y         = digits.target 

We've just created our entire dataset. Ideally, what we would like to do before building a model, is to split this dataset into a training and testing dataset.

```
X = [ 0 1 4 6 8 8 ... 9 ]
    [ 1 5 7 2 0 0 ... 3 ]
    [ 4 3 5 9 8 0 ... 0 ]
    [ 3 5 0 0 1 1 ... 7 ]
    [ 4 4 9 3 9 4 ... 1 ]  
    [ 0 4 2 4 1 1 ... 1 ]
    
split to...

X = [ 0 1 4 6 8 8 ... 9 ]
    [ 1 5 7 2 0 0 ... 3 ]  Training Data Set
    [ 4 3 5 9 8 0 ... 0 ]
    [ 3 5 0 0 1 1 ... 7 ]
    ---------------------
    [ 4 4 9 3 9 4 ... 1 ]  Testing Data Set
    [ 0 4 2 4 1 1 ... 1 ]
    
```




```
Y = [ 1 ]
    [ 0 ]
    [ 9 ] 
    [ 7 ] 
    [ 3 ]
    [ 4 ]
    [ 0 ]
    
split too...

Y = [ 1 ]
    [ 0 ]
    [ 9 ]  Training Labels
    [ 7 ] 
    -----
    [ 3 ]
    [ 4 ]  Testing Labels
    [ 0 ]

```

Why cant we just select `rows 0 to #rows x 0.8` as training data, and `#rows x 0.8 to #rows` as testing data?
**When the user inserted or grabbed the original data, it could have been ordered already!.** i.e. What if when mining for image data, the user first grabbed all 0 images, then 1 images, then 2 images, and so on. Then if you split up the dataset, you'll only really have trained on a non-random set

In [None]:
# Luckily, there is a function in "cross_validation" called "train_test_split" that does exactly that for us
from sklearn.cross_validation import train_test_split

In [None]:
# This is an example of tuple unpacking. 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8)  # Use 80% of the data for training

Now that we've split up our data, we can 'fit' or 'create' the model using our training data, and score on the testing data

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=10)

In [None]:
# The "fit" function will be available in nearly all the models that you can import from sklearn
# In this case, fit just remembers the data and labels that you have passed in.
knn.fit(X_train,y_train)

There's a few parameters here. The ones to note are "weights", n_neighbors, and "metric" For now, let's use the defaults (which are set to grab distances by euclidean distance).

Now that we've fit the model, lets see how the model does in predicting the samples that it trained on.

In [None]:
# In predict, you pass in a matrix of samples you want to predict. 
# For each row in X_train it will return a prediction of the label
y_pred = knn.predict(X_train)

Let's see how well it predicted on the training data. The scoring method for each model will be different, but for knn the scoring metric is #Correct/#Samples (or # Rows correctly predicted / # Total Rows )

In [None]:
np.sum(y_pred == y_train) / np.float64(len(y_train))

SKlearn also has a `score` function built in. It handles the predction as well as score calculation steps in a single function.

In [None]:
# There's a function built into 
knn.score(X_train,y_train)

In [None]:
# Let's score on the test data. Pretty good!
knn.score(X_test,y_test)

What kind of images did the model not do well on?

* We can predict using the model on X_test, and save the predictions to y_pred.
* Then, we will check the indexes where y_pred != y_test.
* Finally, for a few of these, we can plot out the image

In [None]:
y_pred = knn.predict(X_test)

In [None]:
# Need to index by 0 at the end bc np.where returns a tuple per array dimension
inc_idx = np.where(y_pred != y_test)[0]

In [None]:
for plotidx, idx in enumerate(inc_idx):
    plt.subplot(1,len(inc_idx),plotidx+1)
    plt.imshow(X_test[idx].reshape( (8,8)) ,cmap=plt.cm.gray_r )
    plt.xlabel("Pred: %i True: %i" %( y_pred[idx], y_test[idx]))

# Cross Validation

The incorrect classifications seem somewhat reasonable. What if we just got lucky with choosing our datasets? If we run the chunk of code multiple times, we just get a different number each time.

To the whiteboard for cross validation...



In [None]:
# The sklearn cross_validation module also includes a nifty function called cross_val_score
from sklearn.cross_validation import cross_val_score

In [None]:
# cross_val_score takes in mainly the original model object, the entire dataset X, and the entire labelset y
# the argument cv indicates how many folds or partitions to build models for
# Return value: Score on the "test set" within each partition
cv_score = cross_val_score(knn, X, y, cv=5)

In [None]:
cv_score.mean()

# Exercise

Let's see if cross validation can give us a cleaner one step solution for picking the best `k` value.

Similar to last time, vary `k` from 1 to 30. With each `k`, create a new KNeighborsClassifier where n_neighbors = k. Then, use cross_val_score() with cv=5, and mean the results of the 5 test partitions. Append this number onto a list.

Once you have this list, use pyplot.plot (or plt.plot if you have pyplot aliased), to plot out the trend.

Which `k` do you get the best results for?

(Hint: Use np.argmax)

(Hint 2: If you start with an empty list, remember to add 1 to the argmax output)