# Adding validation to models trained with a minibatch source
In this notebook we'll demonstrate how to add metrics to the training process when you're using minibatch sources. First we'll set up the data source and the model. We then train it using a loss and include a metric for validation. Finally we can use the metric to validate the model at the end of the notebook.

## Setting up the data source
The data source for the model is a CTF file containing the training data for the model.
We've split the data file into a training set and a test set. We've created a utility function to
turn a file into a minibatch source.

In [1]:
from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITELY_REPEAT

def create_datasource(filename, limit=INFINITELY_REPEAT):

    labels_stream = StreamDef(field='labels', shape=3, is_sparse=False)
    features_stream = StreamDef(field='features', shape=4, is_sparse=False)

    deserializer = CTFDeserializer(filename, StreamDefs(labels=labels_stream, features=features_stream))

    minibatch_source = MinibatchSource(deserializer, randomize=True, max_sweeps=limit)
    
    return minibatch_source

training_source = create_datasource('iris_train.ctf')
test_source = create_datasource('iris_test.ctf', limit=1)

With the helper function we can now create multiple minibatch sources. One for testing and one for training.
The training data source can be iterated over an unlimited number of times. This is required to be able to run multiple epochs of training. The test data source has a limited number of sweeps, because we only want to pass the test data through the model once at the end of the training session.

## Create the model
The model we're using is a classification model that is capable of classifying iris flowers of three different species. The model has four input neurons and three output neurons corresponding to the number of features in the dataset and the number of species it can classify. It features a single hidden layer of 4 neurons as well.

In [2]:
from cntk import default_options, input_variable
from cntk.layers import Dense, Sequential
from cntk.ops import log_softmax, relu, sigmoid

model = Sequential([
    Dense(4, activation=sigmoid),
    Dense(3, activation=log_softmax)
])

features = input_variable(4)
labels = input_variable(3)

z = model(features)

## Training the model
We're going to train the model using a cross entropy loss and validate it using the f-measure metric that we've seen before in chapter 4 of the book. We're using the SGD learner to optimize the weights.

In [3]:
from cntk.losses import cross_entropy_with_softmax, fmeasure
from cntk.learners import sgd 

loss = cross_entropy_with_softmax(z, labels)
metric = fmeasure(z, labels, beta=1)
learner = sgd(z.parameters, 0.1)

In order to test the model against the test set we need to create a test configuration.
This configuration tells the training session how to run a test run at the end of the training session.

The training configuration needs a minibatch source that allows a limited number of sweeps. This is needed to prevent the training session from running forever.

In [4]:
from cntk.train import TestConfig

test_config = TestConfig(test_source)

To run the training logic we'll use the `training_session` function from CNTK. This function can be set up with a training minibatch source, parameters to control how the data is fed into the model and how much data is used per minibatch. We can add to this another keyword argument `test_config` which tells the session how to run a test at the end of the session.

Once we have the session configured we can call train on it to start the training process.
When the training process completes the test config is used automatically to validate the model performance for us.

In [5]:
from cntk.logging import ProgressPrinter
from cntk.train import Trainer, training_session

minibatch_size = 16
samples_per_epoch = 150
num_epochs = 30
max_samples = samples_per_epoch * num_epochs

input_map = {
    features: training_source.streams.features,
    labels: training_source.streams.labels
}

progress_writer = ProgressPrinter(0)
trainer = Trainer(z, (loss, metric), learner, progress_writer)

session = training_session(trainer, 
                           mb_source=training_source,
                           mb_size=minibatch_size, 
                           model_inputs_to_streams=input_map, 
                           max_samples=max_samples,
                           test_config=test_config)

session.train()

 average      since    average      since      examples
    loss       last     metric       last              
 ------------------------------------------------------
Learning rate per minibatch: 0.1
     1.24       1.24      0.275      0.275            16
      1.5       1.63     0.0628    -0.0434            48
      1.3       1.16      0.114      0.153           112
     1.18       1.08      0.121      0.127           240
     1.06      0.954       0.16      0.196           496
    0.949      0.837      0.242      0.321          1008
    0.815      0.684      0.372      0.501          2032
    0.667      0.519      0.528      0.682          4080
Finished Evaluation [1]: Minibatch[1-1]: metric = 75.58% * 30;
