# Evaluating Process
We are able to evaluate the model against test dataset both after and in parallel with the training process. We aim to perform the latter method in this workshop. In the former, the evaluation performs on all the pre-build check-points however the latter evaluates every single checkpoint that the training process generates. Anyhow, let's go through the evaluation process.

Again, we import `tensorflow`, `mnist`, `lenet`, and `load_batch`.

In [None]:
import tensorflow as tf

from datasets import mnist
from model import lenet, load_batch

Like the train code, we shorten some directions and specify the flags.

In [None]:
slim = tf.contrib.slim
metrics = tf.contrib.metrics

flags = tf.app.flags
flags.DEFINE_string('data_dir', './data/',
                    'Directory with the MNIST data.')
flags.DEFINE_integer('batch_size', 5, 'Batch size.')
flags.DEFINE_integer('eval_interval_secs', 60,
                    'Number of seconds between evaluations.')
flags.DEFINE_integer('num_evals', 100, 'Number of batches to evaluate.')
flags.DEFINE_string('log_dir', './log/eval/',
                    'Directory where to log evaluation data.')
flags.DEFINE_string('checkpoint_dir', './log/train/',
                    'Directory with the model checkpoint data.')
FLAGS = flags.FLAGS

Load the dataset using `mnist.get_split`. Notice that we load the test dataset here since we have to evaluate the model using a separate dataset from the training dataset. Otherwise, the accuracy will turn out an unrealistic value, i.e. 1 or so close. To test the quality of the recognition in real-world conditions, we must use digits that the system has NOT seen during training. Otherwise, it could learn all the training digits by heart and still fail at recognizing an "8" that I just wrote. The MNIST dataset contains 10,000 test digits.

In [None]:
dataset = mnist.get_split('...', FLAGS.data_dir)

images, labels = load_batch(
    dataset,
    FLAGS.batch_size,
    is_training=False)

Get the model prediction from the LeNet network.

In [None]:
predictions = ...

Convert prediction values for each class into single class prediction which is the highest probability for that class.

In [None]:
predictions = tf.to_int64(tf.argmax(predictions, 1))

The accuracy is simply the % of correctly recognized digits. This is computed on the test set. You will see the values go up if the training goes well. 'streaming_accuracy' calculates how often predictions matches labels.

In [None]:
metrics_to_values, metrics_to_updates = metrics.aggregate_metric_map({
    'mse': metrics.streaming_mean_squared_error(predictions, labels),
    'accuracy': metrics.streaming_accuracy(predictions, labels),
})

Write the metrics values as summaries to be plotted later. We will be plotting the online evolution of accuracy on trained model.

In [None]:
for metric_name, metric_value in metrics_to_values.iteritems():
    tf.summary.scalar(metric_name, metric_value)

Having the instruction above, we are ready to launch the model evaluation. So, utilizing function `slim.evaluation.evaluation_loop` the checkpoints in the `checkpoint_dir` will run in a loop of evaluation with the time intervals of `eval_interval_secs`. Recall that we have specified the interval to be 60 seconds.

In [None]:
...(
    '',
    FLAGS.checkpoint_dir,
    FLAGS.log_dir,
    num_evals=FLAGS.num_evals,
    eval_op=metrics_to_updates.values(),
    eval_interval_secs=FLAGS.eval_interval_secs)