# Data Loader for MNIST Database

This tutorial shows you how to download the MNIST digit database and process it to make it ready for machine learning algorithms.

## Topics to be covered

1. Downloading the dataset.
2. Processing the raw data to a easier data structure (numpy ndarray).
3. Saving the images.
4. Saving the dataset as a pickle file


## Downloading the Dataset

The dataset can be downloaded using a browser using the following downloadable links :

* [Training Images](http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz)
* [Training Labels](http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz)
* [Testing Images](http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz)
* [Testing Labels](http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz)

Alternatively the code segment below can also be used to download the images to a specific dataset.

In [None]:
#IMPORTS
import os,urllib.request


# PROVIDE YOUR DOWNLOAD DIRECTORY HERE
datapath = '../../Data/MNISTData/'  

# CREATING DOWNLOAD DIRECTORY
if not os.path.exists(datapath):
    os.makedirs(datapath)

# URLS TO DOWNLOAD FROM
urls = ['http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz']

for url in urls:
    filename = url.split('/')[-1]   # GET FILENAME
    
    if os.path.exists(datapath+filename):
        print(filename, ' already exists')  # CHECK IF FILE EXISTS
    else:
        print('Downloading ',filename)
        urllib.request.urlretrieve (url, datapath+filename) # DOWNLOAD FILE
     
print('All files are available')

## Extracting the downloaded files

The downloaded files are in an archive format and needs to be extracted. It can be manually extracted using the GUI or the code segment below can also be used.

In [None]:
import os,gzip,shutil

# PROVIDE YOUR DOWNLOAD DIRECTORY HERE
datapath = '../../Data/MNISTData/'  

# LISTING ALL ARCHIVES IN THE DIRECTORY
files = os.listdir(datapath)
for file in files:
    if file.endswith('gz'):
        print('Extracting ',file)
        with gzip.open(datapath+file, 'rb') as f_in:
            with open(datapath+file.split('.')[0], 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
print('Extraction Complete')

# OPTIONAL REMOVE THE ARCHIVES
for file in files:
    print('Removing ',file)
    os.remove(datapath+file)
print ('All archives removed')

In [3]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>


## Process the Files 
All the image files and labels of the MNIST dataset is encoded into these 4 files. We need to be able to extract the images from the files to work with them.

### File descriptions
Four files are provided:

* Test Images : t10k-images-idx3-ubyte
* Test Labels :  t10k-labels-idx1-ubyte
* Train Images : train-images-idx3-ubyte
* Train Labels :  train-labels-idx1-ubyte

The IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

#### The basic format for labels
  
|Offset | Type               | Value           |   Description                   |
|-------|--------------------|-----------------|---------------------------------|
|0000   |4 byte integer      |0x00000801(2049) |magic number (MSB first)         |
|0004   |4 byte integer      |10000 or 60000   |number of items (test or train)  |
|0008   |unsigned byte       |??               |label                            |
|0009   |unsigned byte       |??               |label                            |
|...    |...                 |...              |...                              |
|xxxx   |unsigned byte       |??               |label                            |


#### The basic format for images

|Offset | Type               | Value           |   Description                   |
|-------|--------------------|-----------------|---------------------------------|
|0000   |4 byte integer      |0x00000801(2051) |magic number (MSB first)         |
|0004   |4 byte integer      |10000 or 60000   |number of images (test or train) |
|0008   |4 byte integer      |28               |number of rows                   |
|0012   |4 byte integer      |28               |number of columns                |
|0016   |unsigned byte       |??               |pixel intensity (0-255)          |
|0017   |unsigned byte       |??               |pixel intensity (0-255)          |
|...    |...                 |...              |...                              |
|xxxx   |unsigned byte       |??               |pixel intensity (0-255)          |


### Converting the ubyte files to numpy arrays for easy processing
The following code converts the ubyte files into four numpy n dimensional arrays and stores them in a dictionary called `data_dict` which has four key, value pairs.

| Key           |  Type        |Shape         |
|---------------|--------------|--------------|
|*train_images* |numpy ndarray |[60000,28,28] |
|*train_labels* |numpy ndarray |[60000]       |
|*test_images*  |numpy ndarray |[10000,28,28] |
|*test_labels*  |numpy ndarray |[10000]       |


In [None]:
import os,codecs,numpy

# PROVIDE YOUR DIRECTORY WITH THE EXTRACTED FILES HERE
datapath = '../../Data/MNISTData/'

files = os.listdir(datapath)

def get_int(b):   # CONVERTS 4 BYTES TO A INT
    return int(codecs.encode(b, 'hex'), 16)

data_dict = {}
for file in files:
    if file.endswith('ubyte'):  # FOR ALL 'ubyte' FILES
        print('Reading ',file)
        with open (datapath+file,'rb') as f:
            data = f.read()
            type = get_int(data[:4])   # 0-3: THE MAGIC NUMBER TO WHETHER IMAGE OR LABEL
            length = get_int(data[4:8])  # 4-7: LENGTH OF THE ARRAY  (DIMENSION 0)
            if (type == 2051):
                category = 'images'
                num_rows = get_int(data[8:12])  # NUMBER OF ROWS  (DIMENSION 1)
                num_cols = get_int(data[12:16])  # NUMBER OF COLUMNS  (DIMENSION 2)
                parsed = numpy.frombuffer(data,dtype = numpy.uint8, offset = 16)  # READ THE PIXEL VALUES AS INTEGERS
                parsed = parsed.reshape(length,num_rows,num_cols)  # RESHAPE THE ARRAY AS [NO_OF_SAMPLES x HEIGHT x WIDTH]           
            elif(type == 2049):
                category = 'labels'
                parsed = numpy.frombuffer(data, dtype=numpy.uint8, offset=8) # READ THE LABEL VALUES AS INTEGERS
                parsed = parsed.reshape(length)  # RESHAPE THE ARRAY AS [NO_OF_SAMPLES]                           
            if (length==10000):
                set = 'test'
            elif (length==60000):
                set = 'train'
            data_dict[set+'_'+category] = parsed  # SAVE THE NUMPY ARRAY TO A CORRESPONDING KEY     

### Saving images from the dataset

This code segment can be used to save the data of the numpy array as images in class specific directories.

In [None]:
import os
from skimage.io import imsave
datapath = '../../Data/MNISTData/' # PATH WHERE IMAGES WILL BE SAVED

sets = ['train','test']

for set in sets:   # FOR TRAIN AND TEST SET
    images = data_dict[set+'_images']   # IMAGES
    labels = data_dict[set+'_labels']   # LABELS
    no_of_samples = images.shape[0]     # NUBMER OF SAMPLES
    for indx in range (no_of_samples):  # FOR EVERY SAMPLE
        print(set, indx)
        image = images[indx]            # GET IMAGE
        label = labels[indx]            # GET LABEL
        if not os.path.exists(datapath+set+'/'+str(label)+'/'):    # IF DIRECTORIES DO NOT EXIST THEN 
            os.makedirs (datapath+set+'/'+str(label)+'/')       # CREATE TRAIN/TEST DIRECTORY AND CLASS SPECIFIC SUBDIRECTORY
        filenumber = len(os.listdir(datapath+set+'/'+str(label)+'/'))  # NUMBER OF FILES IN THE DIRECTORY FOR NAMING THE FILE
        imsave(datapath+set+'/'+str(label)+'/%05d.png'%(filenumber),image)  # SAVE THE IMAGE WITH PROPER NAME
        

### Saving the dictionary using pickle

Python data structures can be directly saved as it is using the pickle package.
The code segment below shows how to save the `data_dict` using `pickle.dump` and later loading it back using `pickle.load`

In [88]:
import pickle

datapath = '../../Data/MNISTData/'

# DUMPING THE DICTIONARY INTO A PICKLE 
with open(datapath+'MNISTData.pkl', 'wb') as fp :
    pickle.dump(data_dict, fp)

# LOADING THE DICTIONARY FROM A PICKLE
with open(datapath+'MNISTData.pkl', 'rb') as fp :
    new_dict = pickle.load(fp)
