# Training with a manual minibatch loop
In this notebook we'll retrain our flower classification model using a manual minibatch loop.
The model is the same as before, 4 input features and a binary encoded label as output. 

We're going to pretent that de dataset is too big to fit in memory. So we'll load it in chunks. Since our data is stored as CSV we can still use pandas but with a different configuration. The LabelBinarizer that we've used in previous samples no longer works as you can't train that component in chunks. Instead we'll use a different technique to encode the labels.

## The model
The model is a classification neural network with 4 inputs and 3 outputs. The 4 inputs correspond with the number of input features that we have in our dataset. The 3 outputs represent a binary encoding of 3 possible species of flowers that we can classify.

The loss function for the model is a categorical cross entropy function because we're dealing with a multi-class classification problem. The learner is a standard SGD (Stochastic Gradient Descent) algorithm.

In [24]:
from cntk import default_options, input_variable
from cntk.layers import Dense, Sequential
from cntk.ops import log_softmax, relu, sigmoid
from cntk.learners import sgd 

model = Sequential([
    Dense(4, activation=sigmoid),
    Dense(3, activation=log_softmax)
])

features = input_variable(4)
labels = input_variable(3)

z = model(features)

The loss for the model is defined as a combination of a loss and a metric. We use the criterion_factory utility to create this as a CNTK function object.

In [22]:
import cntk
from cntk.losses import cross_entropy_with_softmax, fmeasure

@cntk.Function
def criterion_factory(outputs, targets):
    loss = cross_entropy_with_softmax(outputs, targets)
    metric = fmeasure(outputs, targets, beta=1)
    
    return loss, metric

In [25]:
loss = criterion_factory(z, labels)
learner = sgd(z.parameters, 0.1)

## Encoding the labels
We still have a set of labels  in our dataset so we need to encode to a binary representation. Sadly sklearn requires us to load the whole dataset into memory if we want to train a LabelBinarizer for this purpose like we did before in previous samples. So instead of using a LabelBinarizer we create a manual mapping between the labels and their encoded values.

In [10]:
label_mapping = {
    'Iris-setosa': 0,
    'Iris-versicolor': 1,
    'Iris-virginica': 2
}

## Training the model
This next section implements a single epoch of training using a manual minibatch loop.
You can wrap the code after the creation of the trainer in an extra for-loop to introduce multiple epochs.

In [26]:
import pandas as pd
import numpy as np
from cntk.logging import ProgressPrinter
from cntk.train import Trainer

progress_writer = ProgressPrinter(0)
trainer = Trainer(z, loss, learner, progress_writer)

for _ in range(0,30):
    input_data = pd.read_csv('iris.csv', 
        names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'], 
        index_col=False, chunksize=16)

    for df_batch in input_data:
        feature_values = df_batch.iloc[:,:4].values
        feature_values = feature_values.astype(np.float32)

        label_values = df_batch.iloc[:,-1]

        label_values = label_values.map(lambda x: label_mapping[x])
        label_values = label_values.values

        encoded_labels = np.zeros((label_values.shape[0], 3))
        encoded_labels[np.arange(label_values.shape[0]), label_values] = 1.

        trainer.train_minibatch({features: feature_values, labels: encoded_labels})

 average      since    average      since      examples
    loss       last     metric       last              
 ------------------------------------------------------
Learning rate per minibatch: 0.1
     1.45       1.45     -0.189     -0.189            16
     1.24       1.13    -0.0382     0.0371            48
     1.13       1.04      0.141      0.276           112
     1.21        1.3     0.0382    -0.0599           230
      1.2       1.18      0.037     0.0358           466

  (sample.dtype, var.uid, str(var.dtype)))



     1.17       1.13     0.0524     0.0674           948
     1.08       0.99      0.129      0.204          1912
    0.865      0.654      0.343      0.557          3830


In [30]:
from cntk import Evaluator

evaluator = Evaluator(loss.outputs[1], [progress_writer])
input_data = pd.read_csv('iris.csv', 
        names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'], 
        index_col=False, chunksize=16)

for df_batch in input_data:
    feature_values = df_batch.iloc[:,:4].values
    feature_values = feature_values.astype(np.float32)

    label_values = df_batch.iloc[:,-1]
    
    label_values = label_values.map(lambda x: label_mapping[x])
    label_values = label_values.values
    
    encoded_labels = np.zeros((label_values.shape[0], 3))
    encoded_labels[np.arange(label_values.shape[0]), label_values] = 1.
    
    evaluator.test_minibatch({ features: feature_values, labels: encoded_labels})
    
evaluator.summarize_test_progress()

Finished Evaluation [1]: Minibatch[1-11]: metric = 65.71% * 166;


  (sample.dtype, var.uid, str(var.dtype)))
