# Project: Recognizing digits with k-NN

In [3]:
import numpy as np
import random

import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots

n = 50
rows = 5 
cols = 10

# get images of digits from the MNIST dataset 
with open("mnist-images-ubyte", 'rb') as foo:
    pics = np.array([int(b) for b in foo.read()[16:]]).reshape(-1, 28**2)


# get corresponding labels
with open("mnist-labels-ubyte", 'rb') as foo:
    labels = np.array([int(b) for b in foo.read()[8:]])

s = random.sample(range(pics.shape[0]), n)
sample_pics = pics[s]
sample_labels = np.array(labels)[s]

fig = make_subplots(rows=rows, cols=cols, subplot_titles=([str(l) for l in sample_labels[:n]]))
fig.update_layout(width=800, height=500, paper_bgcolor='rgb(255,255,255)', plot_bgcolor='rgb(255,255,255)')

fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False, scaleanchor="x")

for i in range(n):
    hmap = go.Heatmap(z=sample_pics[i].reshape(28,28)[::-1], colorscale = 'gray_r', showscale=False)
    fig.add_trace(hmap, row=i//10 + 1 , col=i%10 + 1)
    
for i in fig['layout']['annotations']:
    i['font'] = dict(size=15,color='#ff0000')
fig.write_image("digits.png")

The [MNIST database](http://yann.lecun.com/exdb/mnist/) is a collection of 60,000 images of handwritten digits from 0 through 9, with numerical labels. Here is sample of images included in the database:

![MNIST digits](digits.png)

## Objectives

Download MNIST files for this project:

The first file contains images of digits and the second the corresponding labels. The format of both files is described on the [MNIST database website](http://yann.lecun.com/exdb/mnist/). 

Investigate how useful the k-Nearest Neighbors algorithm is for classification of images in the MNIST database, and describe your results. 

Here are some questions which you may consider:

- How does the classification accuracy depend on the size of the training set and on the number of neighbors? 
- Various ways of measuring distances between images, and their impact on classification. 
- What to do if an image we want to classify has equal number of neighbors with two different labels? How does this affect accuracy of classification? 
- Analyze examples of images that have not been classified correctly. What went wrong with them? 
- Anything else you find interesting.