# Train and visualize a model in Tensorflow - Part 3: TensorFlow's API

We showed in [Part 2](https://github.com/PLN-FaMAF/tensorflowTutorial2018/blob/master/tensorflow_tutorial_2.ipynb) how to train a neural network model with scikit-learn's `MLPClassifier`. We also experienced how painfully slow the process is.

TensorFlow's API has a steeper learning curve than Scikit-Learn, but offers a much better performance than the latter. In the first versions of TensorFlow, the only way to code a neural network was doing the math from scratch. This is what we will see here to have a point of comparison with the next tutorial.

In [1]:
import numpy as np
import tensorflow as tf

from sklearn.metrics import accuracy_score, classification_report

## Preliminaries

First we need a couple of functions that we will use in the training (and evaluation) process. These are two: `one_hot_encoding` and `next_batch`.

The `one_hot_encoding` function takes an array of labels in scalar format (that is numbered from zero to the maximum number of labels there is) and convert each label into a one-hot vector: a vector with size equal to the number of unique labels, that has a one in only one dimension (the label's position when sorted) and zero in all the other dimensions. It is needed as TensorFlow's API only deals with binary labels.

The `next_batch` function is needed in order to traverse the dataset. It takes a dataset, a batch size and an offset and returns `batch_size` elements starting from the offset from the dataset. This is needed because the neural network we are about to create can't fit more than a certain number of elements in memory at the same time as it's too expensive.

We also load the data in this step so it's ready to train the neural network in the following steps.

In [2]:
def one_hot_encoding(labels, num_classes):
    """Convert class labels from scalars to one-hot vectors."""
    num_labels = labels.shape[0]
    index_offset = np.arange(num_labels) * num_classes
    one_hot = np.zeros((num_labels, num_classes))
    one_hot.flat[index_offset + labels.ravel()] = 1
    return one_hot

def next_batch(data, target, offset, 
               batch_size, train_data=True):
    """Takes the next batch from data and target
    given the offset used. Returns the data and
    target batch as well as the offset for the 
    next iteration. If the data is for training, it 
    shuffles the content."""
    start = offset
    end = offset + batch_size
    num_examples = data.shape[0]
    
    if start == 0 and train_data:
        perm = np.random.permutation(num_examples)
        data = data[perm]
        target = target[perm]
        
    if end > num_examples and train_data:
        perm = np.random.permutation(num_examples)
        data = data[perm]
        target = target[perm]
        start = 0
        end = batch_size
    elif end > num_examples:
        end = num_examples

    return end, data[start:end], target[start:end]

# Loading the data

newsgroup = np.load('./resources/newsgroup.npz')
train_data = newsgroup['train_data']
train_target = newsgroup['train_target']
test_data = newsgroup['test_data']
test_target = newsgroup['test_target']
labels = newsgroup['labels']

## Defining the Architecture

Before training the neural network, we need to create it. Unlike Scikit-Learn, which internally defines the algorithm based on the parameters given to the model's class, in this case we need to define the operations of the network layer by layer. This is what makes TensorFlow much more difficult but it also make it much more flexible when we need to design novel and more complex neural networks (something plainly impossible on Scikit-Learn).

The operations in TensorFlow are symbolic, we tell it what it needs to do (in this case create the operations for a multilayer perceptron), and then we compile it. You can think of TensorFlow as a DSL to do math.

As TensorFlow is way faster than Scikit-Learn, we can improve the architecture and use two hidden layers for this case instead of one like with Scikit-Learn.

In [3]:
# Number of training instances given 
# to the network on each epoch
batch_size = 100

# Size of the input layer
input_size = train_data.shape[1]

# Number of classes (size of the output layer)
num_classes = labels.shape[0]

# Size of the hidden layers
hidden_layer_1 = 5000
hidden_layer_2 = 2000

# Define the placeholders, this are needed 
# for the network to be given data, in this case
# we have a placeholder for the data (x) and for
# the target (y). Remember as the operations
# are symbolic, we can't just feed the neural network
# the raw dataset.
x = tf.placeholder(tf.float32, [None, input_size])
y = tf.placeholder(tf.float32, [None, num_classes])

# We define a scope (important to keep named structure)
# and define the operations in the first hidden layer.
# What it basically does is take the input layer and
# apply a matrix multiplication with a non-linearity
# (the `relu` function).
with tf.name_scope('hidden_layer_1'):
    W_h1 = tf.Variable(
        tf.truncated_normal(
            [input_size, hidden_layer_1],
            stddev=1.0 / np.sqrt(input_size)
        ),
        name='W_h1'
    )
    b_h1 = tf.Variable(
        tf.zeros([hidden_layer_1]),
        name='b_h1'
    )
    h1 = tf.nn.relu(tf.matmul(x, W_h1) + b_h1)

# Same as before, we define the operations
# for the second hidden layer
with tf.name_scope('hidden_layer_2'):
    W_h2 = tf.Variable(
        tf.truncated_normal(
            [hidden_layer_1, hidden_layer_2],
            stddev=1.0 / np.sqrt(hidden_layer_1)
        ),
        name='W_h2'
    )
    b_h2 = tf.Variable(
        tf.zeros([hidden_layer_2]),
        name='b_h2'
    )
    h2 = tf.nn.relu(tf.matmul(h1, W_h2) + b_h2)

# The last layer (output), is similar to the hidden 
# layers but in this case we don't apply the 
# non-linearity as the result is needed to calculate 
# the cost via the softmax function
with tf.name_scope('output_layer'):
    W_o = tf.Variable(
        tf.truncated_normal(
            [hidden_layer_2, num_classes],
            stddev=1.0 / np.sqrt(hidden_layer_2)
        ),
        name='W_o'
    )
    b_o = tf.Variable(
        tf.zeros([num_classes]),
        name='b_o'
    )
    logits = tf.matmul(h2, W_o) + b_o

# We define the cost function as the mean of the 
# softmax cross-entropy given the labels (or target) 
# y and the result of the output layer in the
# previous step
cost = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=y, logits=logits
    )
)

# Finally, we calculate the prediction of the 
# neural network using the argmax of the logit
y_hat = tf.argmax(logits, 1)

# We define the train step that we will keep calling
# in each epoch to fit the neural network. This uses
# an optimizer algorithm (Adam in this case) to minimize
# the cost function described early.
train_step = tf.train.AdamOptimizer(0.01)\
    .minimize(cost)

## Training

Once the architecture is defined we train the neural network with the dataset. Also, unlike Scikit Learn is not as easy as to call a `fit` method of the model that will do all the work magically in the backend. In this case we need to keep feeding the model with a batch of data so the model fits that batch by adjusting the weights of the network. The process is repeated many times (as many as needed, this is a hyperparameter) until we think is ok to stop it (generally because the model is converging as it's not improving anymore).

In [4]:
# First we need to define a session, which is
# TensorFlow's way to excecute a piece of code.
# Then we initialize the variables given by the
# neural network code we designed previously.
# As this is a Jupyter notebook, the session is
# interactive. I recommend reading more about this
# in TensorFlow's documentation.
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

# The offset is set to zero for the first time
offset = 0

# We train the network for 2000 epochs
for epoch in range(1, 2001):
    # For each epoch we obtain the batch of data
    # needed to fit the network
    offset, batch_data, batch_target =\
        next_batch(train_data, train_target,
                   offset, batch_size)
    # We run the train step operation (defined before)
    # and return the loss value every 100 epochs
    _, loss = sess.run(
        [train_step, cost],
        feed_dict={
            x: batch_data,
            y: one_hot_encoding(batch_target, num_classes)
        })
    if epoch % 100 == 0:
        print("Loss for epoch %02d: %.3f" % (epoch, loss))

Loss for epoch 100: 0.654
Loss for epoch 200: 0.189
Loss for epoch 300: 0.110
Loss for epoch 400: 0.407
Loss for epoch 500: 0.017
Loss for epoch 600: 0.002
Loss for epoch 700: 0.193
Loss for epoch 800: 0.062
Loss for epoch 900: 0.009
Loss for epoch 1000: 0.209
Loss for epoch 1100: 0.211
Loss for epoch 1200: 0.004
Loss for epoch 1300: 0.373
Loss for epoch 1400: 0.078
Loss for epoch 1500: 0.003
Loss for epoch 1600: 0.045
Loss for epoch 1700: 0.000
Loss for epoch 1800: 0.000
Loss for epoch 1900: 0.065
Loss for epoch 2000: 0.001


## Evaluation

The evaluation step, similaryly to the train step, is done by traversing the whole dataset exactly once (while for training we traverse many times the whole dataset). We store for each batch of data the results and we use that to compare it later.

In [5]:
# First we define the initial offset to zero.
# The number of test examples is needed to calculate
# the maximum number of epochs needed.
# And a list with the predictions of each batch of data
offset = 0
test_examples = test_data.shape[0]
predictions = []

# For each batch in the dataset we run the prediction
# operation (y_hat) given the data.
for _ in range(np.int(test_examples / batch_size) + 1):
    offset, batch_data, _ = next_batch(
        test_data, test_target, offset, batch_size, False)
    predictions.append(sess.run(y_hat, feed_dict={x: batch_data}))

# Finally, concatenate the predictions and check the performance
predictions = np.concatenate(predictions)
accuracy = accuracy_score(test_target, predictions)

print("Accuracy: %.2f\n" % accuracy)

print("Classification Report\n=====================")
print(classification_report(test_target, predictions))

Accuracy: 0.86

Classification Report
             precision    recall  f1-score   support

          0       0.92      0.97      0.95       160
          1       0.74      0.83      0.78       195
          2       0.87      0.79      0.83       197
          3       0.77      0.75      0.76       196
          4       0.69      0.84      0.76       192
          5       0.96      0.76      0.85       196
          6       0.84      0.71      0.77       194
          7       0.86      0.93      0.89       198
          8       0.99      0.88      0.94       199
          9       0.74      0.99      0.85       199
         10       0.95      0.96      0.96       200
         11       0.99      0.89      0.94       198
         12       0.92      0.76      0.83       196
         13       0.83      0.92      0.87       198
         14       0.88      0.92      0.90       197
         15       0.93      0.92      0.92       200
         16       0.98      0.78      0.87       182
       