# Image Classification on MNIST

In this notebook we demonstrate how to build a neural network to classify images. The neural network will be built using the Keras style API available in Analytics Zoo, and use a BigDL optimizer to train the model.

Data: 
* The dataset used is MNIST, which is a collection of handwritten numbers ranging from 0 to 9. The data training dataset contains 60000 images and the test set contains 10000 images. We will try to predict the number in each image.

## Intialization

* import necessary libraries

In [None]:
# Copyright 2018 Analytics Zoo Authors.
# http://www.apache.org/licenses/LICENSE-2.0

import warnings
warnings.filterwarnings("ignore", message = "numpy.dtype size changed")

import tensorflow as tf
from zoo import init_nncontext
from zoo.pipeline.api.net import TFOptimizer, TFDataset, TFPredictor
from bigdl.optim.optimizer import *
import sys
from tensorflow.keras.models import Model
from tensorflow.keras.layers import *

from bigdl.dataset import mnist
from bigdl.dataset.transformer import *

* Initilaize NN context, it will get a SparkContext with optimized configuration for BigDL performance.

In [None]:
sc = init_nncontext()

## Data Preparation

* Create function to import and format the images.

In [None]:
# get data, pre-process and create TFDataset
def get_data_rdd(dataset):
    (images_data, labels_data) = mnist.read_data_sets("/tmp/mnist", dataset)
    # image_rdd = sc.parallelize(images_data[:data_num])
    image_rdd = sc.parallelize(images_data)
    # labels_rdd = sc.parallelize(labels_data[:data_num])
    labels_rdd = sc.parallelize(labels_data)
    rdd = image_rdd.zip(labels_rdd) \
        .map(lambda rec_tuple: [normalizer(rec_tuple[0], mnist.TRAIN_MEAN, mnist.TRAIN_STD),
                                np.array(rec_tuple[1])])
    return rdd

* Download and read the MNIST data

In [None]:
training_rdd = get_data_rdd("train")
testing_rdd = get_data_rdd("test")

* Format the RDDs into a TensorFlow Dataset object. Format the shape of the input features to that of the images.

In [None]:
# Batch size can be adjusted to improve the model, but must be a multiple of the total number of cores in the Spark environment
batch_size = 512

# Images are 28 by 28 pixels with one color channel for black and white; labels have no dimmensions because they are integers
dataset = TFDataset.from_rdd(training_rdd,
                             names=["features", "labels"],
                             shapes=[[28, 28, 1], []], 
                             types=[tf.float32, tf.int32],
                             batch_size=batch_size,
                             val_rdd=testing_rdd
                             )
pred_dataset = TFDataset.from_rdd(training_rdd,
                                  names=["features", "labels"],
                                  shapes=[[28, 28, 1], []], 
                                  types=[tf.float32, tf.int32],
                                  batch_per_thread=1)

## Build Model

* Create the model structure, with inputs being the shape of the images and outputs being the number of image classes.

In [None]:
data = Input(shape=[28, 28, 1]) # Must match the dimmensions of the images specified in the TFDataset
# x = Convolution2D(6, 5, 5, activation='tanh', name='conv1_5x5')(data)
# x = MaxPooling2D()(x)
x = Flatten()(data)
x = Dense(64, activation='relu')(x) # The number of dense units can be adjusted to attempt to improve the model.
x = Dense(64, activation='relu')(x)
 # The number of dense units in the predictions (output) layer should match the number of classes
predictions = Dense(10, activation='softmax')(x)

* Initialize the model and the optimizer.

In [None]:
model = Model(inputs=data, outputs=predictions)

model.compile(optimizer='rmsprop',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

optimizer = TFOptimizer.from_keras(keras_model=model, dataset=dataset)

* Set location for model training summary output.
* Train the model.

In [None]:
optimizer.set_train_summary(TrainSummary("/tmp/mnist_log", "mnist"))
optimizer.set_val_summary(ValidationSummary("/tmp/mnist_log", "mnist"))
# kick off training
max_epoch = 5
optimizer.optimize(end_trigger=MaxEpoch(max_epoch))

* Create a TFPredictor object based on the trained model

In [None]:
model_predictor = TFPredictor.from_keras(model, pred_dataset)

* Predict the classes of the data in the testing data (pred_dataset)

In [None]:
testing_rdd_preds = model_predictor.predict()