## DSVM Tutorial

This tutorial was created to showcase some of the features of the Ubuntu DSVM. It shows many steps of the data science process using the CIFAR-10 dataset. CIFAR-10 is a popular dataset for image classification, collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It contains 60,000 images of 10 different types of objects (truck, automobile, cat, etc.).

This tutorial is divided into three parts:

1. Load data. This notebook downloads the CIFAR-10 dataset and processes it to convert it to the format expected by CNTK. This processing is parallelized with Spark, which is included on the DSVM for single-node tasks.
2. Train a model. This notebook trains a basic deep learning model to classify images as one of the CIFAR-10 categories (truck, cat, etc.).
3. Deploy a model. This notebook shows you how to create a REST API with you model using Microsoft ML Server.

This tutorial was originally created for Microsoft's internal machine learning and data science conference (MLADS), but you can also run it on an Ubuntu DSVM of your own outside of the conference.

### Part 1: Load data

This tutorial will show how to prepare image data sets for use with deep learning algorithms in CNTK. The CIFAR-10 dataset is not included in the CNTK distribution but can be easily downloaded and converted to CNTK-supported format

In [1]:
from IPython.display import Image as ShowImage
ShowImage(url="https://cntk.ai/jup/201/cifar-10.png", width=500, height=500)

In [2]:
import os
import tarfile
import shutil
try: 
    from urllib.request import urlretrieve 
except ImportError: 
    from urllib import urlretrieve

def downloadData(src):
    print ('Downloading ' + src)
    fname, h = urlretrieve(src, './delete.me')
    try:
        with tarfile.open(fname) as tar:
            tar.extractall()
        print ('Done.')
    finally:
        os.remove(fname)
    
# Paths for saving the text files
data_dir = './data/CIFAR-10/'

if not os.path.exists(data_dir):
    os.makedirs(data_dir)

try:
    os.chdir(data_dir)   
    
    # use the dataset that was already downloaded by the setup script
    # downloadData('http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz')
    
    shutil.copyfile('/data/cifar-10-python.tar.gz', 'cifar-10-python.tar.gz')
    with tarfile.open('cifar-10-python.tar.gz') as tar:
            tar.extractall()
            
finally:
    os.chdir("../..")

## Process data with Spark

The downloaded CIFAR-10 dataset contains six files, each with 10,000 images. Here we use five files for training and one for testing. 

The images can be processed in serial with a *for* loop, but that is slow even for CIFAR-10. Here we show how the standalone Spark instance can be used to process each file in parallel to take advantage of more cores on the VM.

CNTK requires us to also provide a *mean* image, where each pixel value is the mean over all images in the training set, so images can be normalized during training. We compute the mean image by computing the *summed* image for each parallel batch, then averaging over the sums.

In [3]:
import numpy as np
import pickle as cp
import os
from PIL import Image

# CIFAR Image data
imgSize = 32
numFeature = imgSize * imgSize * 3

foldername = 'train'

if not os.path.exists(foldername):
    os.makedirs(foldername)

In [4]:
# processes a batch of images in a single file
def processBatch(ifile):    
    mapFileArray = []
    dataMean = np.zeros((3, imgSize, imgSize)) # mean is in CHW format.
        
    filename = os.path.join('./data/CIFAR-10/cifar-10-batches-py', 'data_batch_' + str(ifile))
    with open(filename, 'rb') as f:
                data = cp.load(f, encoding='latin1')
                for i in range(10000):
                    filename = os.path.join(os.path.abspath(foldername), ('%05d.png' % (i + (ifile - 1) * 10000)))
                    image_data = data['data'][i, :]
                    label = data['labels'][i]
                    saveImage(filename, image_data, label, 4, mean=dataMean)
                    
                    ## add to mapFileArray
                    mapFileArray.append("%s\t%d\n" % (filename, label))
    
    return (mapFileArray, dataMean)                                    

# saves a single image to a file
def saveImage(filename, data, label, pad, **key_parms):    
    # data in CIFAR-10 dataset is in CHW format.
    pixData = data.reshape((3, imgSize, imgSize))
    if ('mean' in key_parms):
        key_parms['mean'] += pixData

    if pad > 0:
        pixData = np.pad(pixData, ((0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=128) 

    img = Image.new('RGB', (imgSize + 2 * pad, imgSize + 2 * pad))
    pixels = img.load()
    for x in range(img.size[0]):
        for y in range(img.size[1]):
            pixels[x, y] = (pixData[0][y][x], pixData[1][y][x], pixData[2][y][x])
    img.save(filename)
    
SparkContext._ensure_initialized()
spark = SparkSession.builder.getOrCreate()
    
## do the parallel processing 

batch_file_names = []

for ifile in range(1, 6):
        batch_file_names.append(ifile)
    
print('Starting Spark job')
image_rdd = spark.sparkContext.parallelize(batch_file_names).map(processBatch).collect()
print('Spark job finished')

## train_map.txt needs one line per image in the training set, so write that here

with open('train_map.txt', 'w') as mapFile:
    for row in image_rdd:
        mapFile.writelines(row[0])

Starting Spark job
Spark job finished


In [5]:
# Some code to save the mean image

import xml.etree.cElementTree as et
import xml.dom.minidom

def saveMean(fname, data):
    root = et.Element('opencv_storage')
    et.SubElement(root, 'Channel').text = '3'
    et.SubElement(root, 'Row').text = str(imgSize)
    et.SubElement(root, 'Col').text = str(imgSize)
    meanImg = et.SubElement(root, 'MeanImg', type_id='opencv-matrix')
    et.SubElement(meanImg, 'rows').text = '1'
    et.SubElement(meanImg, 'cols').text = str(imgSize * imgSize * 3)
    et.SubElement(meanImg, 'dt').text = 'f'
    et.SubElement(meanImg, 'data').text = ' '.join(['%e' % n for n in np.reshape(data, (imgSize * imgSize * 3))])

    tree = et.ElementTree(root)
    tree.write(fname)
    x = xml.dom.minidom.parse(fname)
    with open(fname, 'w') as f:
        f.write(x.toprettyxml(indent = '  '))
        
dataMean = np.zeros((3, imgSize, imgSize)) # mean is in CHW format.
for row in image_rdd:
    dataMean += row[1]
dataMean = dataMean / (50 * 1000)
saveMean('CIFAR-10_mean.xml', dataMean)

## Test Set

The test set only has one file with all 10,000 images in the download. We could parallelize the images within this file, but we skip that here for simplicity.

In [6]:
print ('Converting test data to png images...')
    
foldername = 'test'
    
if not os.path.exists(foldername):
    os.makedirs(foldername)
    
mapItems = []
    
with open('test_map.txt', 'w') as mapFile:
    with open(os.path.join('./data/CIFAR-10/cifar-10-batches-py', 'test_batch'), 'rb') as f:
        data = cp.load(f, encoding='latin1')
        for i in range(10000):
            fname = os.path.join(os.path.abspath(foldername), ('%05d.png' % i))
            saveImage(fname, data['data'][i, :], data['labels'][i], 0)
            mapItems.append("%s\t%d\n" % (fname, data['labels'][i]))
               
with open('test_map.txt', 'w') as mapFile:    
    mapFile.writelines(mapItems)
                
print ('Done.')

Converting test data to png images...
Done.
