# Lecture 4: Neural Networks
This lecture will introduce multi-layer neural networks and work through several examples using them.

In [None]:
# import standard mxnet packages
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
import numpy as np

# import matplotlib for plotting
import matplotlib.pyplot as plt
%matplotlib inline

Before we start, there were a few questions about how everything we've learned so far fits into the grand scheme. Specifically, people were having trouble relating KNN and perceptron to other algorithms we've discussed. The question of "why are we talking about these things?" shows a misunderstanding of what machine learning is trying to do.

All machine learning can be structured in the following way:

![mlchart](http://jwfromm.com/GIX513/images/ML_chart.png)

Basically, we have some set of input features and a correspoding output or label. A function maps a set of inputs to an output, so there must be some function that maps our input features to the output/label. The goal of machine learning is to learn what that function is.

In easy cases, the function may be quite simple, like a line. However, for harder problems, the function to be learned can be very complex. The diversity in difficulty of problems is why the field of machine learning has so many different algorithms.

Each of the techniques we've discussed, and all the techniques and algorithms we will discuss, are just different tools for learning F(x).

![toolschart](http://jwfromm.com/GIX513/images/Tools_chart.png)

We've started out with some of the simplest tools, KNN and perceptron. To be perfectly clear, these will work well for some problems and are completely viable tools that show up in industry. In fact, it's much better to use these simple tools when they work than more advanced tools, since the simple tools are cheaper.

However, for some problems, you may need a more sophisticated method for learning F(x). Today, we will start covering the most famous of the more sophisticated tools.

Always remember that each different algorithm we learn about is one of many tools. The most important part in understanding machine learning is figuring out which tool to use for a problem. The goal of this course is to fill your toolbox with enough options that you can make a good choice.

## Neural Networks

The perceptron that we discussed last time is a neural network! Specifically, it is a single neuron.

![perceptron](https://cdn-images-1.medium.com/max/1600/1*n6sJ4yZQzwKL9wnF5wnVNg.png)

It may seem that improving the perceptron would be as simple as adding more neurons. Let's try sticking two perceptrons together and see what we get!

![2perceptrons](http://jwfromm.com/GIX513/images/2_perceptrons.png)

Let's try writing the variables in terms of the inputs.

$y_1 = w_1 \cdot x + b_1$

$y_2 = w_2 \cdot y_1 + b_2$

Let's now substitute the first equation into the second

$y_2 = w_2 \cdot (w_1 \cdot x + b_1) + b_2$

Now, lets simplify a little

$y_2 = w_2 \cdot w_1 \cdot x + (w_2 \cdot b_1 + b_2)$

Interesting, what if we define some new variables?

$w_3 = w_2 \cdot w_1$

$b_3 = w_2 \cdot b_1 + b_2$

Let's substitute these new variables back in to our equation for $y_2$

$y_2 = w_3 \cdot x + b_3$

Hmm, this looks a lot like the equation for a single perceptron, in fact it is identical to it.

![combo](http://jwfromm.com/GIX513/images/combo_perceptrons.png)

This means that our combination of two perceptrons is exactly the same as some other single perceptron! What's the point of putting two together if another perceptron does just as well?

The answer is that there isn't a point. Because each neuron of a perceptron is perfectly linear, combining multiple together does not provide additional information. We need to do something to make a neuron non-linear, that way the combination of many neurons __will__ give us extra information.

![nonlin](https://i.stack.imgur.com/ibYr3.png)

The solution is to add what's called an activation to the output of the neuron. This performs some __non linear__ operation on the output before being fed to the next neuron.

There are actually quite a few options for activations, lets take a look at a few

In [None]:
# ReLU Activation
def relu(x):
    return np.maximum(x, 0)

x = np.arange(-1, 1, .1)
plt.plot(x, relu(x))

In [None]:
# sigmoid activation
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = np.arange(-5, 5, .1)
plt.plot(x, sigmoid(x))

In [None]:
# Tanh Activation
def tanh(x):
    return (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))

x = np.arange(-5, 5, .1)
plt.plot(x, tanh(x))

In [None]:
# softrelu activation
def softrelu(x):
    return np.log(1+np.exp(x))

x = np.arange(-5, 5, .1)
plt.plot(x, softrelu(x))

Let's now take another look at the two neuron case, but add activations.

![combo](http://jwfromm.com/GIX513/images/relu.png)

Again, lets try to express the various outputs algebraically

$y_1 = \text{relu}(w_1 \cdot x + b_1)$

$y_2 = \text{relu}(w_2 \cdot y_1 + b_2)$

Now, just as before, lets try to substitute in $y_1$ to the equation for $y_2$

$y_2 = \text{relu}(w_2 \cdot (\text{relu}(w_1 \cdot x + b_1)) + b_2)$

Hmm, because relu is non-linear, I can't really simplify this anymore. It's clear that there's no way i could pick a $w_3$ and $b_3$ for a single new neuron that would give me the same output. Thus, having two neurons actually is giving us more information!

That's really big news, now we can combine tons of neurons and get more processing power!

To be blunt, the more neurons you add to a network, the more processing power it gets. However, it also becomes more difficult to train, so don't go too crazy.

![neural](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/300px-Colored_neural_network.svg.png)

Now, let's talk about the steps involved in training a neural network.

![flowchart](http://jwfromm.com/GIX513/images/flowchart.png)

Let's also define a few terms that you'll be seeing a lot.

* __Transform__: Some sort of function applied to the input data, for example resizing images to a fixed dimension
* __Minibatch__: A subset of data that is looked at together. Used when there is too much input data to process all at once.
* __DataLoader__: An mxnet utility that help feed data to a neural network
* __Dense__: Another name for a layer of neurons
* __Trainer__: The learning algorithm used (SGD)
* __Loss__: How far off our guess was from the right label
* __Predictions__: The output guess of a network
* __Backwards__: Compute the gradient for all parameters of a network
* __Epoch__: Iterating through all the inputs once
* __Step__: Process a single minibatch and update parameters once
* __Learning Rate__: The speed that the network learns, too fast and it might not learn well, too slow and you'll get tired waiting.

In [None]:
# set the device we should use for computing, we'll just our cpu for now
data_ctx = mx.cpu()
model_ctx = mx.cpu()

In [None]:
# load MNIST, a very simple image classification dataset

# set the size of the training set
num_examples = 60000
# set batch size : how many images should i process at a time?
batch_size = 64
# set the number of pixels per image (32 x 32)
num_inputs = 784
# set the number of possible outputs (0 through 9)
num_outputs = 10
# define a function that scales the image pixels down between 0 and 1
def transform(data, label):
    return data.astype(np.float32)/255, label.astype(np.float32)
# load the training data
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
                                      batch_size, shuffle=True)
# load the test data
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                              batch_size, shuffle=False)

In [None]:
# let's sample 5 random data points from the test set
sample_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                              1, shuffle=True)
for i, (data, label) in enumerate(sample_data):
    data = data.reshape(data.shape[1:-1])
    plt.imshow(data.asnumpy())
    plt.show()
    if i == 5:
        break

In [None]:
# define a minimal neural network
net = gluon.nn.Dense(num_outputs)
# initialize the parameters of the network randomly
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=model_ctx)

![simplenet](http://jwfromm.com/GIX513/images/simplenet.png)

In [None]:
# define the loss function, for virtually all classification problems we will use softmax-crossentropy loss.
# this is a combination of softmax function, which squished things down to a probability between 0 and 100%, 
# and cross entropy loss, which measure how close we are to the correct label.
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

In [None]:
# define an optimizer, let's use basic stochastic gradient descent
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})

In [None]:
# define a function to see how good our model is
def evaluate_accuracy(data_iterator, net):
    # keep track of the accuracy across our dataset
    acc = mx.metric.Accuracy()
    # iterate through all the data
    for i, (data, label) in enumerate(data_iterator):
        # move the data and label to the proper device
        data = data.as_in_context(model_ctx).reshape((-1, 784))
        label = label.as_in_context(model_ctx)
        # run the data through the network
        output = net(data)
        # check what our guess is
        predictions = nd.argmax(output, axis=1)
        # compute accuracy and update our running tally
        acc.update(preds=predictions, labels=label)
    # return the accuracy
    return acc.get()[1]

In [None]:
# lets go ahead and try running our test data through our network. We havent trained anything yet, so we expect 
# low accuracy.
evaluate_accuracy(test_data, net)

In [None]:
# Now let's get to the good part, training the network!

def train(net, trainer, epochs=3):
    # iterate through the epochs
    loss_history = []
    for e in range(epochs):
        # we're going to sum up the loss over the whole epoch
        cumulative_loss = 0
        # iterate through all the training data
        for i, (data, label) in enumerate(train_data):
            # make sure the data is on the right device, flatten the images
            data = data.as_in_context(model_ctx).reshape((-1,784))
            label = label.as_in_context(model_ctx)
            # compute the output and loss while keeping track of gradients
            with autograd.record():
                output = net(data)
                loss = softmax_cross_entropy(output, label)
            # calculate all the derivatives with respect to the loss
            loss.backward()
            # update weights based on the derivative
            trainer.step(batch_size)
            # update our loss for this epoch
            cumulative_loss += nd.sum(loss).asscalar()

        print("Epoch %s. Loss: %s" % (e, cumulative_loss/num_examples))
        loss_history.append(cumulative_loss/num_examples)
    return np.arange(epochs), loss_history

In [None]:
x, y = train(net, trainer)
plt.plot(x,y)

In [None]:
# now lets check the accuracies again
test_accuracy = evaluate_accuracy(test_data, net)
train_accuracy = evaluate_accuracy(train_data, net)
print(test_accuracy)
print(train_accuracy)

In [None]:
def model_predict(net,data):
    output = net(data.as_in_context(model_ctx))
    return nd.argmax(output, axis=1)

# let's sample 10 random data points from the test set
sample_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                              10, shuffle=True)
for i, (data, label) in enumerate(sample_data):
    data = data.as_in_context(model_ctx)
    print(data.shape)
    im = nd.transpose(data,(1,0,2,3))
    im = nd.reshape(im,(28,10*28,1))
    imtiles = nd.tile(im, (1,1,3))
    plt.imshow(imtiles.asnumpy())
    plt.show()
    pred=model_predict(net,data.reshape((-1,784)))
    print('model predictions are:', pred)
    break


Neat! We just trained our first neural network. And what's even better, it got pretty high accuracy! This is especially impressive since our network is literally as simple as it could be while still working at all. Let's see if we can make a spicier net by adding some extra layers.

In [None]:
# use the gluon sequential class to build up a multilayer neural network
spicynet = gluon.nn.HybridSequential()
with spicynet.name_scope():
    spicynet.add(gluon.nn.Dense(128, activation='relu'))
    spicynet.add(gluon.nn.Dense(64, activation='relu'))
    spicynet.add(gluon.nn.Dense(num_outputs))
spicynet.hybridize()
spicynet.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=model_ctx)
spicy_trainer = gluon.Trainer(spicynet.collect_params(), 'sgd', {'learning_rate': .1})

![spicynet](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/mlp_mnist.png)

And that's it! Now we have a 3 layer network that is much more sophisticated than before. We can train and test it in exactly the same way as before.

In [None]:
x, y = train(spicynet, spicy_trainer, epochs=3)
plt.plot(x, y)
evaluate_accuracy(test_data, spicynet)