### AlexNet

In the last tutorial we have learned how to build a simple convolutional neural network using the <code>tf.keras</code> API. In this tutorial we will look at a more complicated CNN, AlexNet, which won the 2012 ImageNet challenge and incorporated several innovative techniques for training a neural network. The original AlexNet structure was distributed on 2 GPUs due to memory constraints at that time. Here we build a simpler version that can be run on a single GPU today using the <code>tf.estimator</code> API, which is a high-level, but more flexible API. The architecture that we're going to build is as follows:

<img src="./files/AlexNet.png">

In [1]:
import tensorflow as tf
print(tf.__version__)

1.12.0


### Implementing the <code>model_fn</code>

Estimator-based models are easily distributable on a multi-server environment, and can be run on CPUs, GPUs or TPUs without recoding the model. At the same time, it is high-level, meaning that you don't need to work with low-level TensorFlow APIs such as <code>tf.placeholder</code> and <code>tf.Variable</code>. 

When we talk about using <code>tf.estimator</code>, what we're really doing is building a <code>tf.estimator.Estimator</code> instance. This <code>Estimator</code> has methods that allow us to easily train, evaluate and export the model or to do predictions. The model itself should be wrapped in a model function, or <code>model_fn</code>, which will be supplied to the constructor of the <code>Estimator</code>. Therefore, we should build the model architecture in the <code>model_fn</code>.

The <code>model_fn</code> has a few key arguments:

<code>features</code>: the first item returned from <code>input_fn</code>, which contain features of our training instances. <code>input_fn</code> is the way by which we stream input data into the model, and we will talk about that later in this tutorial.

<code>labels</code>: the second item returned from <code>input_fn</code>, containing labels of our training instances. Both <code>features</code> and <code>labels</code> are either <code>tf.Tensor</code> instances or dictionaries of string feature/label names to <code>Tensor</code>.

<code>mode</code>: <code>tf.estimator.ModeKeys.TRAIN</code>, <code>tf.estimator.ModeKeys.EVAL</code>, or <code>tf.estimator.ModeKeys.PREDICT</code>. This is important because some steps are done differently in different modes. For example, dropout is only performed during training. 

What is returned from <code>model_fn</code> should be a <code>tf.estimator.EstimatorSpec</code> instance. This will tell the <code>Estimator</code> what loss function to optimize, what metrics to calculate, what algorithm to use for training, and how to generate predictions. Different arguments are required for different <code>ModeKeys</code>, which needs to be specified in the required <code>mode</code> argument. <code>TRAIN</code> also requires <code>loss</code> specifying the loss function and <code>train_op</code> specifying the training operation. <code>EVAL</code> only requires <code>loss</code>. And <code>PREDICT</code> requires <code>predictions</code>. Apart from these, there are other arguments by which you can do more things, including <code>eval_metric_ops</code>, which specifies different metrics to calculate. These will become clearere when we actually build the neural network.

Now we have come to the meat of implementing the model function. The flow is like this: we take training data from the <code>feature</code> argument, and pass it through different layers of the neural network. Each layer takes the output from the previous layer and produces output that can be fed into the next layer. After defining the model architecture, we define what predictions to make (classes or probabilities), how to calculate loss function, what optimizer to use and what metrics to evaluate. Then we return different <code>tf.estimator.EstimatorSpec</code> instances corresponding different values of the <code>mode</code> argument.

Here I'll make use of various types of functions available in TensorFlow. Layers are usually available in <code>tf.layers</code>, activation functions in <code>tf.nn</code>, loss functions in <code>tf.losses</code>, optimization algorithms in <code>tf.train</code>, and performance metrics in <code>tf.metrics</code>. Let's look at how this model function is constructed.

In [3]:
def model_fn(features, labels, mode):
    """Model function for AlexNet"""
    # 1. Input layer
    # The shape, by convention, should be [batch_size, height, width, channels].
    # Since batch size is not known yet, setting a size of -1 tells Tensorflow to
    # infer the size of that dimension when the data is actually available.
    input_layer = tf.reshape(feature["x"], [-1, 224, 224, 3])
    
    # 2. Convolutional layer 1
    # Takes input_layer as input. The other arguments are almost the same as what
    # we have seen in tf.keras. Here, the ReLU activation was used for the first
    # time. Deep neural networks with ReLU trains much faster than those with tanh
    # as the activation function.
    conv1 = tf.layers.conv2d(input_layer, filters=96, kernel_size=(11,11), strides=(4,4),
                             padding="valid", activation=tf.nn.relu)
    
    # 3. Local response normalization (LRN) 1
    # This is a new technique introduced in AlexNet. This normalization is done for 
    # each position across different feature maps. More details after this funcition
    # definition.
    lrn1 = tf.nn.lrn(conv1, depth_radius=5, bias=2, alpha=1e-4, beta=0.75)
    
    # 4. Max pooling layer 1
    # AlexNet used overlapping pooling with kernel size 3 by 3, and a stride of 2. 
    pool1 = tf.layers.max_pooling2d(lrn1, pool_size=(3,3), strides=(2,2), padding="valid")
    
    # 5. Convolutional layer 2
    # Here we use "same" padding and a stride of 1 to keep the dimension the same.
    conv2 = tf.layers.conv2d(pool1, filters=256, kernel_size=(5,5), strides=(1,1), 
                             padding="same", activation=tf.nn.relu)
    
    # 6. LRN 2
    lrn2 = tf.nn.lrn(conv2, depth_radius=5, bias=2, alpha=1e-4, beta=0.75)
    
    # 7. Max pooling layer 2
    pool2 = tf.layers.max_pooling2d(lrn2, pool_size=(3,3), strides=(2,2), padding="valid")
    
    # 8. Convolutional layer 3
    conv3 = tf.layers.conv2d(pool2, filters=384, kernel_size=(3,3), strides=(1,1), 
                             padding="same", activation=tf.nn.relu)
    
    # 9. Convolutional layer 4
    conv4 = tf.layers.conv2d(conv3, filters=384, kernel_size=(3,3), strides=(1,1), 
                             padding="same", activation=tf.nn.relu)
    
    # 10. Convolutional layer 5
    conv5 = tf.layers.conv2d(conv4, filters=256, kernel_size=(3,3), strides=(1,1), 
                             padding="same", activation=tf.nn.relu)
    
    # 11. Max pooling layer 3
    # The output of this layer will be [batch_size, 6, 6, 256].
    pool3 = tf.layers.max_pooling2d(conv5, pool_size=(3,3), strides=(2,2), padding="valid")
    
    # 12. Flatten
    # Remember that before piping the data into a dense layer we need to flatten all
    # dimensions except the first one which is batch_size.
    flatten = tf.layers.flatten(pool3)
    
    # 13. Dense layer 1
    # Similar to tf.keras, we need to specify the number of neurons and the activation
    # function.
    dense1 = tf.layers.dense(flatten, units=4096, activation=tf.nn.relu)
    
    # 14. Dropout 1
    # The dropout technique is also introduced in AlexNet. This is a technique of setting
    # to zero the output of each neuron with a certain probability in every training step.
    # This way, any neuron cannot rely on the presence of another neuron, so this is a
    # method to reduce overfitting of the model. Dropout is implemented as a layer to be put
    # after the output of a layer in Tensorflow, and we need to specify the probability of
    # dropping a neuron. Also, we only want dropout to happen during training, so we need
    # to provide whether the mode is tf.estimator.ModeKeys.TRAIN here.
    dropout1 = tf.layers.dropout(dense1, rate=0.5, training=(mode==tf.estimator.ModeKeys.TRAIN))
    
    # 15. Dense layer 2
    dense2 = tf.layers.dense(dropout1, units=4096, activation=tf.nn.relu)
    
    # 16. Dropout 2
    dropout2 = tf.layers.dropout(dense2, rate=0.5, training=(mode==tf.estimator.ModeKeys.TRAIN))
    
    # 17. Output layer
    # In the output layer, instead of calculating the probability distribution rendered by
    # the softmax function, we keep the logits without applying the softmax function. This is
    # because the logits will be easier to feed into the function for calculating the loss.
    # When we need the probabilities we can apply softmax to these logits. We'll look at the
    # code in later steps.
    logits = tf.layers.dense(dropout2, units=1000)
    
    # Calculating prediction
    # For prediction, we want to know two things: first, what is the most likely category that 
    # the image belongs to, and second, what are the probabilites with which the image belongs
    # to each category. We create a dictionary with the name of these metrics as keys and their
    # values as values.
    predictions = {
        "classes": tf.argmax(input=logits, axis=1), # axis=0 is the batch_size
        "probabilities": tf.nn.softmax(logits=logits, name="softmax_tensor")} # A name is created so that we can log this tensor.
    
    # If we are in PREDICT mode, we just return an EstimatorSpec with the predictions.
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    
    # If we are not in PREDICT mode, we need to calculate the loss function. Since we're doing
    # multiclass classification, we will still use the sparse categorical crossentropy function.
    # However, in tf.losses there is only the sparse_softmax_cross_entropy function where you
    # need to supply the logits instead of the probabilities. That's why we didn't apply the
    # softmax function in the output layer.
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    
    # If we are in TRAIN mode, we also need to specify the optimizer to use and the training
    # operations. 
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
        train_op = optimizer.minimize(loss=loss, 
                                      global_step=tf.train.get_global_step()) # This global_step parameter is used by TensorFlow
                                                                              # to track the number of train steps that have been
                                                                              # processed. Simply call tf.train.get_global_step()
                                                                              # to obtain this variable.
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_ops)
    
    # If we are in EVAL mode, we also want to know the accuracy on the validation set. Therefore
    # we calculate it and add it as an eval_metric_ops.
    if mode == tf.estimator.ModeKeys.EVAL:
        accuracy = tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])
        eval_metric_ops = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

### Implementing the <code>input_fn</code>

So now we have built the full neural network architecture and defined what operations to perform when we are training, evaluating or predicting with the model. However, we have not yet dealt with how to stream our input data into the model. When building a <code>tf.estimator</code> model, the input data is processed with an input function, or <code>input_fn</code>. This function should return a <code>tf.Dataset</code> of batches of <code>(features, labels)</code>, where <code>features</code> is a dictionary from feature names to <code>tf.Tensor</code> containing batches of features, and <code>labels</code> is a <code>tf.Tensor</code> containing batches of labels. Let's look at how to implement this with code.

But before that, we need to preprocess images so that they are compatible with the input of our neural network.