# Cooking a simple neural network library - revised

## Ingredients

- `numpy`
- [a loss function](#Loss-function)
- [some layers](#Layers)
- [a neural net](#Neural-network)
- [an optimizer](#Optimizer)
- [a batch data provider](#Batch-generator)
- [a training routine](#Training)
- \+ [application exercises](#Application-exercise)

Hopefully by the end of this tutorial you will have an understanding of the building blocks needed for training (deep) neural networks. 

## Foreword

We will purely rely on numpy for this tutorial. Make sure to import it here.

In [None]:
import numpy as np

#### Object Oriented Python

Object-oriented Python, a.k.a _classes_, will be used in this tutorial.  
For those not familiar with Python classes, know that you will only be required to write some definitions and Python code **within the** class **methods** and **not** actually **write any class**.  

If you want to know more about Python classes, here is a step by step [tutorial](https://aboucaud.github.io/slides/2016/python-classes).

---

## Loss function

A loss function measures how good our predictions are, compared to the expected values. It is the cost function that needs to be minimised. The loss function must be differentiable, as its gradient is needed to adjust the parameters of the network through backpropagation.

Below is generic loss class. It implements 
- `loss()` : **the loss** computated from the expected label and the predicted one,
- `grad()` : **the gradient of the loss**, needed for the backpropagation.

### Exercice #1 - mean square error loss

***5 min*** - *Implement the `MeanSquareError` class*

In this exercise, `predicted` and `actual` are vectors (1-d numpy arrays).  
You must implement both the loss and its derivative using these two vectors.

For info, the mean square error loss function is defined as

$${\rm loss_{MSE}}(y_{true}, y_{pred}) = \sum \left(y_{pred} - y_{true}\right) ^ 2$$
and its gradient
$$\nabla {\rm loss_{MSE}}(y_{true}, y_{pred}) = 2 \cdot (y_{pred} - y_{true})$$

In [None]:
class MeanSquareError:
    def loss(self, y_predicted, y_true):
        return np.sum((y_predicted - y_true) ** 2)
    
    def grad(self, y_predicted, y_true):
        return 2 * (y_predicted - y_true)

## Layers


### Neuron

A neuron $\mathscr{N}$ has multiple input values (vector $\mathbf{x}$ of size $m$) and a single output $z$. Each neuron is characterised by its weights (vector $\mathbf{w}$ of size $m$) and a constant bias $b$ to perform the linear operation

$$\begin{aligned} 
\mathscr{N}_{\mathbf{w}, b}(\mathbf{x}) &= \sum_i w_i.x_i + b \\
                                   &= \begin{bmatrix} w_{0} & \cdots & w_{m}\end{bmatrix} \begin{bmatrix} x_0 \\ \vdots \\ x_m \end{bmatrix} + b\\
                                   &= \mathbf{w}^T\mathbf{x} + b \\
                                   &= z
\end{aligned}$$
where $m$ is the input size, and output a single value $z$.


### Linear layer

A linear layer $\mathscr{L}$ is a set of neurons, and can therefore be represented by a matrix of weights $\mathbf{W}$ and a vector of constants $\mathbf{b}$.  
For a layer of $n$ neurons, the matrix $\mathbf{W}$ is therefore $(m,n)$ and the vector $\mathbf{b}$ is of size $n$.  

The operation realized by the layer on the input vector $\mathbf{x}$ of size $m$ is

$$\begin{aligned} 
\mathscr{L}_{\mathbf{W}, \mathbf{b}}(\mathbf{x}) 
% \begin{bmatrix} y_0 \\ \vdots \\ y_n \end{bmatrix}
    &= \begin{bmatrix} \sum_i W_{i, 0}.x_i + b_0 \\ \vdots \\  \sum_i W_{i, n}.x_i + b_n \end{bmatrix} \\
    &= \begin{bmatrix} W_{0,0} & \cdots & W_{m,0} \\ \vdots & & \vdots \\ W_{0,n} & \cdots & W_{m,n} \\\end{bmatrix} . \begin{bmatrix} x_0 \\ \vdots \\ x_m \end{bmatrix} + \begin{bmatrix} b_0 \\ \vdots \\ b_n \end{bmatrix} \\
    &= \mathbf{W}^T\mathbf{x} + \mathbf{b} \\
    &= \mathbf{z}
\end{aligned}$$

which is a matrix multiplication and an addition, that produces an output vector $\mathbf{z}$ of size $n$.

### Activation layer

After the layer forward pass, there might be an activation layer whose role is to break the linearity of the network. The so-called activation layer is thus a non-linear fonction $f$ acting on the output $\mathbf{z}$ of the linear layer. 

$$\begin{aligned}
\mathbf{a} &= f(\mathbf{z}) \\
           &= f(\mathbf{W}^T\mathbf{x} + \mathbf{b})
\end{aligned}$$

The activation layer conserves the shape.

### Backpropagation (computing of the layer gradients)

For the backward pass, each layer receives a gradient vector for the preceding layer.

The ***chain rule*** connects the the loss $\mathscr{C}$ (for cost) to the weights and biases of layer $i$ and yields the following relations  :

$$\begin{aligned}
\dfrac{\partial \mathscr{C}}{\partial \mathbf{W}^{i}} 
    &= \dfrac{\partial \mathscr{C}}{\partial \mathbf{a}^{i}} \cdot \dfrac{\partial \mathbf{a}^{i}}{\partial \mathbf{z}^{i}} \cdot \dfrac{\partial \mathbf{z}^{i}}{\partial \mathbf{W}^{i}} \\
    &= \mathbf{\nabla}\mathscr{C}^{i} \cdot f'(\mathbf{z}^{i}) \cdot \mathbf{x}^{i} \\
    &= \mathbf{\nabla_W^i}
\end{aligned}$$

and 

$$\begin{aligned}
\dfrac{\partial \mathscr{C}}{\partial \mathbf{b}^{i}} 
    &= \dfrac{\partial \mathscr{C}}{\partial \mathbf{a}^{i}} \cdot \dfrac{\partial \mathbf{a}^{i}}{\partial \mathbf{z}^{i}} \cdot \dfrac{\partial \mathbf{z}^{i}}{\partial \mathbf{b}^{i}} \\
    &= \mathbf{\nabla}\mathscr{C}^{i} \cdot f'(\mathbf{z}^{i}) \\
    &= \mathbf{\nabla_b^i}
\end{aligned}$$

where $\mathbf{\nabla}\mathscr{C}^{i}$ is the gradient vector of the loss propagated at layer $i$.

---

In this tutorial, we define sequential neural nets, made of one or more layers.
The layer will  pass its inputs forward
and propagate gradients backward. 

For example, a neural net might look like `inputs -> Linear -> Tanh -> Linear -> output`

The base class for a layer has a dictionary to store parameters ($\mathbf{W}$, $\mathbf{b}$) and gradients ($\mathbf{\nabla_W}$, $\mathbf{\nabla_b}$) and implements a forward and a backward method.

### Exercice #2 - linear layer

***10 - 15 min*** - *Implement the `forward` and `backward` methods of the linear layer.*

The mathematical recap above is here to help you.  

Be aware that neural networks are generally trained in batches (see [batch generator part](#Batch-generator)), essentially in order to 
- save some foward / backward computing steps
- reduce the noise produced by extreme input vectors at the optimisation step.

We therefore introduce the concept of ***batch_size***, which is the number of simultaneous trained inputs.
For this reason, the input and output arrays of the layers are not actual vectors but **matrices** whose shape of one dimension is the ***batch_size***.

Hints:
- matrix products can be written either with `np.dot(m1, m2)` or `m1 @ m2` with recent Python versions (3.5+)
- pay a specific attention to the shape of the input and output matrices for the matrix product
- $\mathbf{W}^T\mathbf{x}$ is written `x @ W`

In [None]:
class LinearLayer:
    """
    Inputs are of size (batch_size, input_size)
    Outputs are of size (batch_size, output_size)
    """
    def __init__(self, input_size, output_size):
        self.params = {}
        self.grads = {}
        # Initialize the weights and bias with random values
        self.params["w"] = np.random.randn(input_size, output_size)
        self.params["b"] = np.random.randn(output_size)

    def forward(self, inputs):
        """
        inputs shape is (batch_size, input_size)
        W shape is (input_size, output_size)
        b shape is (output_size)
        """
        self.inputs = inputs
        W = self.params["w"]
        b = self.params["b"]
        # Compute here the feed forward pass
        return inputs @ W + b
        
    def backward(self, grad):
        """
        grad shape is (batch_size, output_size)
        return shape is (batch_size, input_size)
        gradW shape is the same as W shape
        gradb shape is the same as b shape
        """
        X = self.inputs
        W = self.params["w"]
        # Compute here the gradient parameters for the layer
        self.grads["w"] = X.T @ grad
        self.grads["b"] = np.sum(grad, axis=0)
        # Compute here the feed backward pass
        return grad @ W.T

### Activation layers

### Exercice #3 - tanh

***5 min*** - *Implement the hyperbolic tangent and sigmoid layers and their derivatives.*

Look for the definitions in the lecture.


In [None]:
class Tanh:
    def __init__(self):
        self.params = {}
        self.grads = {}

    def forward(self, inputs):
        self.inputs = inputs
        return np.tanh(inputs)
    
    def backward(self, gradients):
        f_prime = 1 - np.tanh(self.inputs) ** 2
        return f_prime * gradients

## Neural network

A neural net is a collection of layers. It takes care of sequentially calling the layers `forward` and a `backward` methods in the right order.

In addition, it implements a getter method `params_and_grads` that will be used by the optimizer to update the values of the weights and bias of each layer.

In [None]:
class NeuralNet:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.optimizer = None
    
    def add(self, layer):
        self.layers.append(layer)
        
    def compile(self, loss, optimizer):
        self.loss = loss
        self.optimizer = optimizer
        
    def predict(self, inputs):
        """
        The forward pass takes the layers in order
        """
        for layer in self.layers:
            inputs = layer.forward(inputs)
        return inputs

    def backprop(self, grad):
        """sequential gradient computation and backward pass to the next layer"""
        for layer in reversed(self.layers):
            grad = layer.backward(grad)
        return grad

## Optimizer

The role of the optimizer is to adjust the network parameters (weights and biases of the linear layers here) based on the gradients computed during backpropagation.

The main attribute of an optimizer is the _learning rate_ (a.k.a. `lr`), which defines the size of the jump taken in the direction of the gradients. 

### Exercice #4 - Stochastic Gradient Descent

***5 min*** - write the optimizer step

Here we have a very basic implementation of a _Stochastic Gradient Descent_ (a.k.a. `SGD`). 

The step that needs to be written iterates over the neural network layers and updates the layers parameters in the direction _opposite_ to the gradient.

In [None]:
class SGD:
    def __init__(self, lr=0.01) -> None:
        self.lr = lr

    def step(self, layers):
        for layer in layers:
            for name, param in layer.params.items():
                grad = layer.grads[name]
                param -= self.lr * grad
                

In [None]:
# Add update method to the NeuralNet class to take a step
def update(self):
    self.optimizer.step(self.layers)               

NeuralNet.update = update

## Batch generator

It can be costly to compute the gradients and update the weights after every entry of the training dataset. In order to minimize such computational cost, the inputs of the network are traditionally fed in batches and the gradients are thus averages over those batches of data.

A batch size of 32 is a default in multiple training sets. Some recent [study](https://arxiv.org/abs/1804.07612) claims this number is the perfect balance between computing efficiency and training stability.

During an epoch the network will iterate over the whole dataset. Adding some shuffling in the process ensures the batches are not fed exactly in the same order at each epoch.

In [None]:
class BatchIterator:
    def __init__(self, batch_size=32, shuffle=True):
        self.batch_size = batch_size
        self.shuffle = shuffle

    def __call__(self, inputs, targets):
        starts = np.arange(0, len(inputs), self.batch_size)
        if self.shuffle:
            np.random.shuffle(starts)

        for start in starts:
            end = start + self.batch_size
            batch_inputs = inputs[start:end]
            batch_targets = targets[start:end]
            yield batch_inputs, batch_targets

## Training

The training routine uses all objects defined above and executes actions **in the right order** to train the neural network.

The dataset being usually small with respect to the number of free parameters of the neural net, going through the dataset multiple times during the training is a necessity. This corresponds to the number of epochs, which has to be specified.

### Exercise #5 - build the training routine

***10 min*** - write the sequential steps needed for training at each epoch

_Hints_:
- feed forward
- compute the loss and the gradients
- feed backwards
- update the net

In [None]:
def train(self, inputs, targets, batch_size=32, epochs=2000):
    iterator = BatchIterator(batch_size=batch_size)
    for epoch in range(epochs):
        epoch_loss = 0.0
        for (batch_inputs, batch_targets) in iterator(inputs, targets):
            X = batch_inputs
            y_true = batch_targets
            # Compute the predictions of the current network
            y_predicted = self.predict(X)
            # Compute the loss
            epoch_loss += self.loss.loss(y_predicted, y_true)
            # Compute the gradient of the loss
            grad = self.loss.grad(y_predicted, y_true)
            # Backpropagate the gradients
            self.backprop(grad)
            # Update the network
            self.update()
            
        # Print status every 100 iterations
        if epoch % 100 == 0:
            print(epoch, epoch_loss)

In [None]:
# Add the function to the NeuralNet class
NeuralNet.fit = train

## Application exercise

Now that you have build your own neural network library, let's use it to solve a problem and then put it in application.

### XOR

Canonical problem in ML as there is not linear way to map the inputs to the output.

```
[0, 0] => 0  
[0, 1] => 1  
[1, 0] => 1  
[1, 1] => 0  
```

Because of the extremely small size of the dataset, we will **forget** about the prescriptions on _training, validation and test sets_ for this example, which **you shouldn't do in practice**. 

In [None]:
X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
y = np.array([[0], [1], [1], [0]])

In [None]:
def print_xor_results(net, inputs, targets):
    predictions = net.predict(inputs)
    print('\nX => y => y_pred => round(y_pred)')
    for a, b, c in zip(inputs, targets, predictions):
        print(f'{a} => {b} => {c} => {c.round()}')

To help visualise the final decisions to which the network as converge, the decision contours can be drown from it using a grid of parameters.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def plot_decision_contours(network, bounds=[0, 1, 0, 1]):
    # Create an array of points to plot the decision regions
    x_min, x_max, y_min, y_max = bounds
    rows, cols = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    X_grid = np.c_[rows.ravel(), cols.ravel()]
    # Apply the decision function on the two vectors
    values = network.predict(X_grid)
    # Reshape the array to recover the squared shape
    values = values.reshape(rows.shape)
    
    plt.figure(figsize=(5, 5))
    # Plot decision region
    plt.pcolormesh(rows, cols, values > 0.5, 
                   cmap='Paired')
    plt.grid(False)
    # Plot decision boundaries
    plt.contour(rows, cols, values, 
                levels=[.25, .5, .75],
                colors=['k', 'k', 'k'], 
                linestyles=['--', '-', '--'])
    
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)

Here is an attempt at solving the XOR problem using a single linear layer

In [None]:
# Initialize loss and optimizer
sgd = SGD(lr=0.05)
mse = MeanSquareError()

# Create empty neural network
net1 = NeuralNet()
# Add layers
net1.add(LinearLayer(input_size=2, output_size=1))
# Add loss and optimizer
net1.compile(loss=mse, optimizer=sgd)
# Train the model
net1.fit(X, y, batch_size=32, epochs=2000)

print_xor_results(net1, X, y)

A single linear layer does not work, as expected. XOR is a typical non-linear problem.

Let's have a look at the decision contours of the optimised net.

### Exercise #6 - solve XOR with a Neural Net

***5 min*** - Write a more advanced neural net (using additional linear and activation layers) until the predictions match the target values.

In [None]:
sgd = SGD(lr=0.01)
mse = MeanSquareError()

net2 = NeuralNet()
net2.add(LinearLayer(input_size=2, output_size=4))
net2.add(Tanh())
net2.add(LinearLayer(input_size=4, output_size=1))

net2.compile(loss=mse, optimizer=sgd)

net2.fit(X, y, batch_size=32, epochs=2000)

print_xor_results(net2, X, y)

In [None]:
plot_decision_contours(net2)

### Exercise #7 - write the same model using Keras

***10 min*** - Based on the `Keras` examples given in the [lecture](https://aboucaud.github.io/slides/2019/neural-networks-asterics), as well as the section on loss and optimizers, solve the XOR problem using `Keras` methods.

In [None]:
# Write down the keras model below
#---------------------------------
# Star by the necessary imports
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

# Write the model architecture
model = Sequential()
model.add(Dense(4, input_dim=2, activation='tanh'))
model.add(Dense(1))

# Compile the model
model.compile(loss='mse', optimizer='sgd')

# Train the model (no validation_split required here)
model.fit(X, y, epochs=2000, verbose=0)
#---------------------------------

# Once trained, this will then predict the values (equivalent of `.forward()`)
y_pred_keras = model.predict(X)

# And print the results
print_xor_results(model, X, y)

In [None]:
plot_decision_contours(model)

## Acknowledgements

The idea and the code for this tutorial have been for the most part inspired by the video "Deep Learning Madness" https://youtu.be/o64FV-ez6Gw by [Joel Grus](https://twitter.com/joelgrus)