# Neural Networks

Since our aim is to build machine learning models (whether statistical or neural network) to understand IoT data, let's begin by building some simple models in Python. 

In this lecture we look at implementing neural networks in Python. For the most part we will be using Pytorch, a high-level model-based library for implementing highly complex and sophisticated deep networks. 

But we will start small, and actually code a learning law on our own. We start by implementing a simple Single Layer Perceptron to learn the Lorry/Van Classification Problem that's in the lecture. The table below shows the data we have. We use -1 to mean "Lorry" and "1" to mean "Van":

|Mass    |Length     |Class   |
|:------:|:---------:|:------:|
|10      |6          |-1      |
|20      |5          |-1      |
|5       |4          |1       |
|2       |5          |1       |
|3       |6          |-1      |
|10      |7          |-1      |
|5       |9          |-1      |
|2       |5          |1       |
|2.5     |5          |1       |
|20      |5          |-1      |

Let's begin by importing Numpy and creating our table by defining a function to create our datasets. Each input example contains the mass and length of the vehicle, and the labels are -1 for truck and 1 for van. The make_dataset function returns a 10x2 matrix for the input, and a 10x1 vector for the labels.


In [1]:
import numpy as np

# Create our dataset
def make_dataset():
    train_data = np.array([[[10, 6]], [[20, 5]], [[5, 4]], 
                           [[2, 5]], [[3, 6]], [[10, 7]], 
                           [[5, 9]], [[2, 5]], [[2.5, 5]], 
                           [[20, 5]]])
    train_labels = np.array([-1, -1, 1, 1, -1, -1, -1, 1, 1, -1])
    
    return (train_data, train_labels)


Let's now initialize the SLP. We define the SLP as a dictionary defined as follows:

slp = {

"inputs":<1x3 input vector>,

"weights":<3x1 weights>,

"output":<1x1 output>

}

Although we have only 2 inputs, our input is defined as 1x3 as we need to include the bias. There is only one output, and thus the weights will be 3x1 matrix of random numbers:

In [2]:
# Initialize the SLP:
# We store our SLP as a dictionary. There are 3 inputs since we have Mass,
# Length, and a bias which is always 1.0. There are 3 weights to connect
# the 3 inputs to the output, and a single output

def init_slp(slp):
    slp['inputs'] = np.array([0.0, 0.0, 1.0])
    slp['weights'] = np.random.randn(3, 1)
    slp['output'] = np.array(0)
    

Now we come to the meat of the SLP: The feedforward and learn functions. The feedforward function is defined as:

$$
f(in, w) = \tanh\left(\sum_{i=0}^{n-1}in_i \times w_{i,0}\right)
$$

Since we have defined our input as a $1\times3$ matrix and the weights as a $3\times1$ matrix, the feedforward is simply a matrix multiply.  We use a tanh transfer function since this maps us between -1 and 1. We set a parameter *alpha* to control the speed of learning. The learning function returns the absolute error, which we will later use to compute the mean absolute error (MAE) across all samples:

In [3]:
# Compute the feedfoward
def feed_forward(slp, inputs):
    # Take dot-product of the inputs and the weights
    slp["inputs"][0:2] = inputs
    slp["output"] = np.tanh(np.matmul(slp["inputs"], slp["weights"]))
    return slp["output"]

def learn(slp, alpha, inputs, target):
    feed_forward(slp, inputs)
    
    # Find error
    E = target - slp['output']
    slp["weights"] = np.add(slp["weights"], (alpha * E[0] 
                                          * slp['inputs'].reshape(3,1)))
    return abs(E[0])

Finally we can create our SLP and train it. We iterate 600 times and print the MAE every 50 iterations.

In [4]:
slp = {}
init_slp(slp)
feed_forward(slp, np.array([[20.0, 5.0]]))

(train_in, train_out) = make_dataset()
for i in range(601):
    ctr = 0
    E = 0
    for j, data in enumerate(train_in):
        ctr = ctr + 1
        E = E + learn(slp, 0.1, data, train_out[j])
    
    if (i % 50) == 0:
        print("Iteration %d, Average Absolute Error: %3.2f" % (i, E / ctr))



Iteration 0, Average Absolute Error: 0.99
Iteration 50, Average Absolute Error: 0.46
Iteration 100, Average Absolute Error: 0.44
Iteration 150, Average Absolute Error: 0.43
Iteration 200, Average Absolute Error: 0.43
Iteration 250, Average Absolute Error: 0.42
Iteration 300, Average Absolute Error: 0.42
Iteration 350, Average Absolute Error: 0.42
Iteration 400, Average Absolute Error: 0.41
Iteration 450, Average Absolute Error: 0.35
Iteration 500, Average Absolute Error: 0.01
Iteration 550, Average Absolute Error: 0.01
Iteration 600, Average Absolute Error: 0.01


We can see that the MAE settles at a decent value of 0.01. Now we can try three sample inputs:

|Mass    |Length     |
|:------:|:---------:|
|12      |7          |
|3       |5          |
|15      |12         |



In [5]:
test_inputs = np.matrix([[12, 7], [3, 5], [15, 12]])

print("Mass\tLength\tClass")
print("-----\t------\t-----")

for x in test_inputs:
    y = feed_forward(slp = slp, inputs = x)
    veh_type = 'truck' if y<=0.0 else 'van'
    print("%3.1f\t%3.1f\t%s"% (x[0,0], x[0,1], veh_type))
    

Mass	Length	Class
-----	------	-----
12.0	7.0	truck
3.0	5.0	van
15.0	12.0	truck


Since we didn't put aside some of the training data for testing (there's only 10 of them), we don't have a "gold standard" to evaluate how good this SLP is. That's alright, since our main aim was to see how to implement the learning law. In any case the outputs here seem consistent with the training data (large mass, length -> truck, otherwise it's a van.)

## Pytorch Models

In this course we will use the Pytorch library to implement our neural networks. Pytorch is a convenient high-level library that is built on top of the Torch project, which is in turn a very large and complex library for machine learning.

Full documentation on Pytorch can be found at [Pytorch docs](https://pytorch.org/docs/stable/index.html)

Importantly, Pytorch is also definitely much more convenient to use than NumPy. ;)

Let's begin with building a simple Multi-Layer Perceptron using the Sequential Model to recognize handwritten digits from the famous MNIST dataset.

The MNIST dataset consists of a 28x28 black and white images of handwritten digits:

![MNIST set](./Images/mnist.jpg)

Our job then is to build a classifier that takes a 28x28 image and classify it as one of the 10 digits.

We first will need to install torchvision, a python package with common datasets and models for computer vision tasks: `pip install torchvision`

## Imports

We begin by importing:

    - torch.nn: The basic building blocks for our models.
    - torch.nn.functional: The functions and other components for our models.
    - torch.optim: The optimisers available in pytorch (such as SGD)
    - torchvision: The python package with our MNIST dataset.
    

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

## Designing and Building the Model

Now let's begin building our model. The weights of the neural networks are called "parameters", and these are decided upon using an optimization algorithm. However we ourselves need to decide on "hyperparameters", which refer to the design of the NN:

    - The size and shape of the input
    - Encoding for the input
    - # of hidden layers
    - Size of each hidden layer
    - Transfer functions
    - Size of the output
    
Some of these are easy to determine. Since our inputs are 28x28 images and it's easier to work with a single dimension vector, we will reshape them into a single 784 element input. Hence the input layer will consist of 784 input nodes. We will scale all inputs to between 0 and 1 for performance reasons. There are 10 digits and one output node.

For the rest we apply a combination of two well respected design techniques called "intuition" and "guesswork" and produce the following design:

    - # of input nodes: 784
    - Encoding: Scale between 0 and 1
    - # of hidden layers: 2
    - Size of hidden layer 1: 1024 nodes
    - Transfer function: ReLU (see below)
    - Size of hidden layer 2: 256 nodes
    - Transfer function: ReLU
    - Size of output: 1 node
    
The ReLU, Sigmoid (similar to Softmax) and other transfer functions are shown below. We saw these in the lecture:

![Transfer Functions](./Images/transfer.png)

We also add a "dropout" layer which randomly drops a percentage of the nodes for training, to reduce overfitting. We will look at this in more detail in a later lecture.

Let's build our network. We first create a class `ModelNN` which inherits from `nn.Module` class and then add the layers in the constructor. We also specify the function `forward` where defines the connections between our layers in the model.


In [6]:
class ModelNN(nn.Module):
    def __init__(self):
        super(ModelNN, self).__init__()
        # First hidden layer
        self.l1 = nn.Linear(784, 1024)
        # Randomly drop 30% of this layer for training
        self.dropout1 = nn.Dropout(0.3)
        # Add the second hidden layer. 
        self.l2 = nn.Linear(1024, 256)
        # As before we randomly drop 30% of the nodes for training.
        self.dropout2 = nn.Dropout(0.3)
        self.l3 = nn.Linear(256, 11)
    def forward(self, x):
        # Here we define the forward function, we accept the input data 
        # and return the output data. We use modules defined in the constructor as
        # well as any arbitrary operators.
        # We can see this defines the connections between the layers of our model.
        x = self.l1(x)
        x = F.relu(x)
        x = self.dropout1(x)
        x = self.l2(x)
        x = F.relu(x)
        x = self.dropout2(x)
        output = self.l3(x)
        return output

## Loading the Dataset

This is literally it! We've built the network! Now let's bring in the MNIST dataset. We will reshape the data from 28x28x1 to 784x1, load it with a `batch_size` of 60, and normalise the data.

In [7]:
batch_size = 60

# Load the data, normalise it, and reshape it. 
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,)),
                                transforms.Lambda(lambda x: torch.flatten(x))])

training_set = datasets.MNIST('../data', train=True, download=True, transform=transform)
test_set = datasets.MNIST('../data', train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(training_set, batch_size = batch_size)
test_loader = torch.utils.data.DataLoader(test_set, batch_size = batch_size)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


100%|███████████████████████████████████████████████████████████████████| 9912422/9912422 [00:00<00:00, 10800451.40it/s]


Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|███████████████████████████████████████████████████████████████████████| 28881/28881 [00:00<00:00, 24878967.72it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz





Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|███████████████████████████████████████████████████████████████████| 1648877/1648877 [00:00<00:00, 10574757.49it/s]


Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████████████████████████████████████████████████████████████████████| 4542/4542 [00:00<00:00, 4188770.62it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw






## Start Training

Now that we have built the network, and loaded and properly encoded the data, let's start training. Here we will train the modeld. We will train for 10 epochs in batches of 60 samples. "Batches" are useful for controlling memory usage, especially when you are working in memory limited environments like GPUs.

In [8]:
# Model, Loss, and Optimizer
model = ModelNN()

# We use a "cross entropy" loss function which is more sophisticated 
# than the simple mean-squared loss function in the lecture and well 
# suited for classification problems.
criterion = nn.CrossEntropyLoss()

# Create a Stochastic Gradient Descent optimizer with a learn rate of 0.01
# and a momentum of 0.9 which helps control "overshoot"
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum = 0.9)

# Train the Model
num_epochs = 10
for epoch in range(num_epochs):
    for i, (data, labels) in enumerate(train_loader):
        # Forward pass
        outputs = model(data)
        loss = criterion(outputs, labels)
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if (i+1) % 500 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

print("Training complete!")

model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in train_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f'Accuracy of the model on the 60000 training images: {100 * correct / total} %')

Epoch [1/10], Step [500/1000], Loss: 0.5517
Epoch [1/10], Step [1000/1000], Loss: 0.1218
Epoch [2/10], Step [500/1000], Loss: 0.2186
Epoch [2/10], Step [1000/1000], Loss: 0.0603
Epoch [3/10], Step [500/1000], Loss: 0.0886
Epoch [3/10], Step [1000/1000], Loss: 0.0933
Epoch [4/10], Step [500/1000], Loss: 0.1601
Epoch [4/10], Step [1000/1000], Loss: 0.0375
Epoch [5/10], Step [500/1000], Loss: 0.0819
Epoch [5/10], Step [1000/1000], Loss: 0.1009
Epoch [6/10], Step [500/1000], Loss: 0.1247
Epoch [6/10], Step [1000/1000], Loss: 0.0968
Epoch [7/10], Step [500/1000], Loss: 0.0861
Epoch [7/10], Step [1000/1000], Loss: 0.0347
Epoch [8/10], Step [500/1000], Loss: 0.0252
Epoch [8/10], Step [1000/1000], Loss: 0.0148
Epoch [9/10], Step [500/1000], Loss: 0.1010
Epoch [9/10], Step [1000/1000], Loss: 0.0779
Epoch [10/10], Step [500/1000], Loss: 0.0585
Epoch [10/10], Step [1000/1000], Loss: 0.0552
Training complete!
Accuracy of the model on the 60000 training images: 98.55833333333334 %


## Evaluation

Finally once training is over we evaluate the network for performance:


In [10]:
# Test the Model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f'Accuracy of the model on the 10000 test images: {100 * correct / total} %')

Accuracy of the model on the 10000 test images: 97.77 %
