# MNIST Dataset
The MNIST (Modified National Institute of Standards and Technology) database is a database of handwritten digits consisting of 60,000 training images and 10,000 testing images. It is only a subset of the larger NIST database which is used to learn pattern recognition methods and learning techniques.

The images are available on the MNIST [website](http://yann.lecun.com/exdb/mnist/) to download. The images are stored in .gz files and will need to be converted into something more readable for our program. Thankfully Python already has a [gzip package](https://docs.python.org/3/library/gzip.html), so we can decompress the files with just a few lines of code.

### Reading the files

In [1]:
import gzip                                                          

with gzip.open('data/t10k-images-idx3-ubyte.gz', 'rb') as f:        # test set images 
    file_content = f.read()                                          

FileNotFoundError: [Errno 2] No such file or directory: 'data/t10k-images-idx3-ubyte.gz'

Now that we have read the data from the gzip file and stored them in a variable, let's have Python check each byte and verify their values against the documentation.

In [None]:
type(file_content)

### What the bytes mean
We can see that the images are being stored as bytes. The first byte in the file tells us if the dataset is image or label dataset and is called the "magic number". The second byte contains the number of images in the file. The third and fourth bytes contains the numbers of rows and number of columns of pixels in each image.

To verify these values we have to convert the bytes from hexadecimal bytes to integers. The classmethod [int.from_bytes](https://docs.python.org/3/library/stdtypes.html) will return the integer represented by an array of bytes. We will be passing two parameters to this method, bytes and byteorder, where byteorder must be "little" or "big". This byteorder parameter refers to [Endianness](https://en.wikipedia.org/wiki/Endianness), and basically corresponds to how the data is stored in the byte. Big-end first means that the first byte is the lowest, while little-end first means the first byte is smallest.

In [None]:
int.from_bytes(file_content[0:4], byteorder='big')   # Check the first byte, referred to as the magic number

In [None]:
int.from_bytes(file_content[4:8], byteorder='big')   # Check the second byte, how many images are in the file

In [None]:
int.from_bytes(file_content[8:12], byteorder='big')  # Check the third byte, the amount of rows of pixels per image

In [None]:
int.from_bytes(file_content[12:16], byteorder='big') # Check the fourth byte, the amount of columns of pixels per image

So now that we've seen how to investigate a gzip file and validated our data, we will have to use this information to create images for our 

# Reading a single file

In [None]:
l = file_content[16:800]

In [None]:
type(l)

In [None]:
import numpy as np

image = ~np.array(list(file_content[16:800])).reshape(28,28).astype(np.uint8)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.imshow(image, cmap='gray')

In [None]:
with gzip.open('data/t10k-labels-idx1-ubyte.gz', 'rb') as f:
    labels = f.read()

In [None]:
int.from_bytes(labels[8:9], byteorder="big")