And now for an alternate implementation -- AlexNet in Gluon.

In [1]:
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
from time import time

  from ._conv import register_converters as _register_converters


In [2]:
num_outputs = 10 # 10 output digits
batch_size = 128 # mini batch
epochs = 10 # total training loops
learning_rate = 0.01 # amount we update parameters

CIFAR images are available via a dataset. This is a little different than Keras, in that you pass the transformer as a function to normalize, and the data contains input images and output labels. The advantage of this structure shows in very large data sets, where the transformation is done on demand, letting you start your model running sooner, incrementally normalizing one batch at a time.

This is integer style pixel data, so we'll need to normalize it on 0-1.

OK -- here is an odd one. Keras convolutions are in image order (x, y, channels). Gluon convolutions are un a unique order to Gluon (channels, x, y). So we need to move the 0, 1, 2 axes of the source image to 2, 0, 1 with transpose.



In [3]:
def transform(data, label):
    data = mx.nd.transpose(data, (2,0,1))
    data = data.astype(np.float32) / 255.0
    return data, label
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.CIFAR10(train=True, transform=transform),
                                      batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.CIFAR10(train=False, transform=transform),
                                     batch_size, shuffle=False)

We'll do the same kind of definition with Gluon as with Keras -- adding layers inside of loops. These parameters will control the layer definitions.

One thing to notice, MxNet  -- the input size can be computed for us automatically.

In [4]:
kernels = [11, 5, 3, 3, 3]
filters = [96, 192, 384, 384, 256]
pooling = [3, 3, 0, 0, 3]
strides = [2, 2, 0, 0, 2]
dense_units = [4096, 4096]

Note, you don't have a final softmax *layer*, MxNet handles he softmax inside this loss function, so the output is a straight linear mapping -- no activation function.


In [6]:
alexnet = gluon.nn.HybridSequential()
with alexnet.name_scope():
    for kernel, filter, pool, stride in zip(kernels, filters, pooling, strides):
        
        alexnet.add(gluon.nn.Conv2D(channels=filter, 
                                    kernel_size=kernel,
                                    padding=(kernel//2),
                                    activation='relu'))
        if pool:
            alexnet.add(gluon.nn.MaxPool2D(pool_size=pool, 
                                           strides=stride))
    alexnet.add(gluon.nn.Flatten())
    for units in dense_units:
        alexnet.add(gluon.nn.Dense(units, activation='relu'))
    alexnet.add(gluon.nn.Dense(num_outputs))

Gluon requires a context, either `cpu` or `gpu`. You can change this to `cpu` if needed.

In [7]:
ctx = mx.gpu()

Gluon requires that parameters be explicitly initialied. Here we are using the Xavier initializer, which is a sensible default.

You must initialize before you can set up a trainer.

In [8]:
alexnet.collect_params().initialize(mx.init.Xavier(), ctx=ctx)

Now let't take a look at the resulting network. We need to feed in a sample batch to infer the network size.

In [9]:
for i, (d, l) in enumerate(train_data):
    print(alexnet.summary(d.as_in_context(ctx)))
    break

--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
               Input                            (128, 3, 32, 32)               0
        Activation-1   <Symbol hybridsequential0_conv0_relu_fwd>               0
        Activation-2                           (128, 96, 32, 32)               0
            Conv2D-3                           (128, 96, 32, 32)           34944
         MaxPool2D-4                           (128, 96, 15, 15)               0
        Activation-5   <Symbol hybridsequential0_conv1_relu_fwd>               0
        Activation-6                          (128, 192, 15, 15)               0
            Conv2D-7                          (128, 192, 15, 15)          460992
         MaxPool2D-8                            (128, 192, 7, 7)               0
        Activation-9   <Symbol hybridsequential0_conv2_relu_fwd>               0
       Activation-10        

If you network doesn't change shape, you can `hybridize` it, which makes Gluon run in a precomplied mode much like Keras.

In [10]:
alexnet.hybridize()

And as always, learning is done with an optimizer and a loss function, learning a classifier with categorical cross entropy.

In [11]:
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(alexnet.collect_params(), 'sgd', {'learning_rate': learning_rate})

Here is where Gluon isn't as convenient as Keras -- defining your own accuracy metric.

In [12]:
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    for i, (data, label) in enumerate(data_iterator):
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=label)
    return acc.get()[1]

And the training loop. Again this isn't as declarative as Keras. It does give you options for more control on comple models however.

In [13]:
smoothing_constant = .01
moving_loss = 0.0

for e in range(epochs):
    start = time()
    for i, (d, l) in enumerate(train_data):
        data = d.as_in_context(ctx)
        label = l.as_in_context(ctx)
        with autograd.record():
            output = alexnet(data)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        trainer.step(data.shape[0])

        #  Keep a moving average of the losses
        curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)
    elapsed = time() - start

    test_accuracy = evaluate_accuracy(test_data, alexnet)
    train_accuracy = evaluate_accuracy(train_data, alexnet)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s, Time %f" % (e, moving_loss, train_accuracy, test_accuracy, elapsed))

Epoch 0. Loss: 2.185494102490571, Train_acc 0.25844, Test_acc 0.2624, Time 18.001178
Epoch 1. Loss: 1.9511077665971492, Train_acc 0.29912, Test_acc 0.298, Time 17.876366
Epoch 2. Loss: 1.8041707043952524, Train_acc 0.3426, Test_acc 0.3507, Time 16.271401
Epoch 3. Loss: 1.6796391537803523, Train_acc 0.38888, Test_acc 0.3875, Time 16.748858
Epoch 4. Loss: 1.5860475260924667, Train_acc 0.41488, Test_acc 0.4174, Time 17.383099
Epoch 5. Loss: 1.4999414477380506, Train_acc 0.45064, Test_acc 0.4543, Time 17.039613
Epoch 6. Loss: 1.4375727949410486, Train_acc 0.47632, Test_acc 0.4715, Time 17.369838
Epoch 7. Loss: 1.3887894382178314, Train_acc 0.52838, Test_acc 0.5201, Time 17.375701
Epoch 8. Loss: 1.3344149353245434, Train_acc 0.51844, Test_acc 0.5071, Time 18.395933
Epoch 9. Loss: 1.2801594779672918, Train_acc 0.5269, Test_acc 0.5148, Time 17.811769


Not shockingly different than the Keras implementation, but in my opinion, less convenient for this type of model. The hand rolled training loops can be a benefit when you need logic inside you model. You can very much mix in rules and computation -- `if` statements and the like -- for very sophisticated production models.

Little things, like the lack of the built in progress bar are not a big deal, but a missing bit of usability compared to Keras as well. Overall -- an interesting set of tradeoffs, size inference, which cures the problems you'll run into with Keras most commonly building models, but much more code to run the actual training.