So here is a multi-layer-perceptron, in all its glory in a very raw MxNet format. Presented this way, it's very difficult to fully avoid the math, though you can certainly treat it as a cookbook recipe. Showing this raw format is mostly to demonstrate the utility of working with a higher level framework like Gluon.



In [1]:
#This code is pure math and show how to develope a ML NN Model with this
import mxnet as mx
import numpy as np
from mxnet import nd, autograd, gluon

  from ._conv import register_converters as _register_converters


Here controlling the size of our network.

In [2]:
num_inputs = 28 * 28 # MNIST image size
num_hidden = 256 # hidden layers
num_outputs = 10 # 10 output digits
batch_size = 64 # mini batch
epochs = 10 # total training loops
learning_rate = 0.001 # amount we update parameters

Let's get started with our friends the MNIST digits. Data load, traditional normalization on [0-1].

In [3]:


def transform(data, label):
    return data.astype(np.float32)/255, label.astype(np.float32)
train_data = gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
                                      batch_size, shuffle=True)
test_data = gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                                     batch_size, shuffle=False)

In [4]:
 #building out the raw layers 
W1 = nd.random_normal(shape=(num_inputs, num_hidden)) #input to hidden
b1 = nd.random_normal(shape=num_hidden)


W2 = nd.random_normal(shape=(num_hidden, num_hidden)) #hidden to hidden
b2 = nd.random_normal(shape=num_hidden)


W3 = nd.random_normal(shape=(num_hidden, num_outputs)) #hidden to output
b3 = nd.random_normal(shape=num_outputs)

params = [W1, b1, W2, b2, W3, b3] #input to hidden to output
for param in params: #flow of parameters
    param.attach_grad()

In [5]:
W1.max() #max value from all the pre initialised arrays


[4.3125005]
<NDArray 1 @cpu(0)>

And just like the images -- we want to make sure out values range 0-1. We'll use an in place update with the `[:]` slice. This is saying -- hey assign all the individual number values to be these number values from the right hand side.

This is one of those things -- that if you forget to do -- scale your number on the range [0-1] -- chances are you won't run get a model that works. This is about the learning rate -- if you have millions of numbers that range say [0-1000], but you are only updating 0.001 each loop -- it's just plain going to take *forever*!

In [6]:
for param in params:
    param[:] = param / param.max()
W1.max()


[1.]
<NDArray 1 @cpu(0)>

We need a function to run over our layers -- we'll use the `relu`. As math goes, this one is pretty simple, it's 0 if we are less than zero. Basically just pulls through positive values!

In [7]:
def relu(X):
    return nd.maximum(X, nd.zeros_like(X))

Now here is out network, as a function. 

In [8]:

def net(X):
    # layer 1
    h1_linear = nd.dot(X, W1) + b1
    h1 = relu(h1_linear)

    # layer 2
    h2_linear = nd.dot(h1, W2) + b2
    h2 = relu(h2_linear)

    # output layer -- softmax will be computed as a loss in the training loop
    yhat_linear = nd.dot(h2, W3) + b3
    return yhat_linear

Loss function and trainer! With raw MxNet -- we're left to ourselves to create our own stochastic gradient descent. Good news is -- we already learned that it really is just subtraction and multiplication!

And our own loss function. Whoo -- that's math-y. Just remember - y are the actual values, from real data -- yhat, those are the ones the model predicts

In [9]:
def softmax_cross_entropy(yhat_linear, y):
    return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

def SGD(params, lr):
    for param in params:
        param[:] = param - lr * param.grad

And an accuracy metric. Loss is interesting to the model and the math, but understanding accuract as a percentage makes more sense to people!

In [10]:
def evaluate_accuracy(data_iterator, net):
    numerator = 0.
    denominator = 0.
    for i, (data, label) in enumerate(data_iterator):
        data = data.reshape((-1, 784))
        label = label
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        numerator += nd.sum(predictions == label)
        denominator += data.shape[0]
    return (numerator / denominator).asscalar()

10 epoch loops, with a very small learning rate. This is basically doing all the work here in code, with one very important exception MxNet is computing the gradients for us -- the `autograd.record`. That's the real observation here, MxNet isn't a deep learning library so much as it is a symbolic math library with support for computing gradients built in.

Same basic learning loop we have previously discussed, for each mini batch, run the network, while capturing the gradients and loss. Then update the parameters based on a learning function -- the optimizer -- and repeat.

In [11]:

for e in range(epochs):
    for i, (data, label) in enumerate(train_data):
        data = data.reshape((-1, 784))
        label_one_hot = nd.one_hot(label, 10)
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label_one_hot)
        loss.backward()
        SGD(params, learning_rate)
    plt.imshow(learned_diagonal.asnumpy(), cmap='binary')
    break


    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" %
          (e, nd.sum(loss).asscalar(), train_accuracy, test_accuracy), 
          flush=True)

Epoch 0. Loss: 2.7073212, Train_acc 0.93986666, Test_acc 0.9323
Epoch 1. Loss: 1.6286682, Train_acc 0.9636833, Test_acc 0.949
Epoch 2. Loss: 1.0781705, Train_acc 0.9727833, Test_acc 0.955
Epoch 3. Loss: 3.8051982, Train_acc 0.97728336, Test_acc 0.955
Epoch 4. Loss: 3.9060752, Train_acc 0.9845, Test_acc 0.9606
Epoch 5. Loss: 0.26946187, Train_acc 0.9859333, Test_acc 0.9602
Epoch 6. Loss: 0.41502324, Train_acc 0.9892, Test_acc 0.963
Epoch 7. Loss: 0.665453, Train_acc 0.99155, Test_acc 0.9634
Epoch 8. Loss: 0.78512937, Train_acc 0.9943, Test_acc 0.9653
Epoch 9. Loss: 0.34645534, Train_acc 0.99516666, Test_acc 0.9647


OK -- that was fun, in a math-y way. Now let's try it developer style!