# Cooking a simple neural network library

## Ingredients

- `numpy`
- [a loss function](#Loss-function)
- [some layers](#Layers)
- [a neural net](#Neural-network)
- [an optimizer](#Optimizer)
- [a batch data provider](#Batch-generator)
- [a training routine](#Training)
- \+ [application exercises](#Application-exercise)

Hopefully by the end of this tutorial you will have an understanding of the building blocks needed for training (deep) neural networks. 

## Foreword

We will purely rely on numpy for this tutorial. Make sure to import it here.

In [None]:
import numpy as np

#### Object Oriented Python

Object-oriented Python, a.k.a _classes_, will be used intensively in this tutorial.  
For those not familiar with Python classes, know that you will only be required to write some definitions and Python code **within the** class **methods** and **not** actually **write any class**.  

If you want to know more about Python classes, here is a step by step [tutorial](https://aboucaud.github.io/slides/2016/python-classes).

#### Type hints

This notebook uses a feature from Python 3.5+ called ***type hints*** or ***type annotations*** (see [PEP 0526](https://www.python.org/dev/peps/pep-0526/)). This acts like optional static typing since Python will still run if the type does not match, but has two main advantages IMO:
- make sure you understand what you're doing
- act like documentation for an external user

The types for the base Python objects (lists, dicts, iterables) can be found in the [`typing` library](https://docs.python.org/3/library/typing.html).
For instance, here are all the needed imports for this tutorial.

In [None]:
from typing import (Dict, Tuple, Callable, 
                    Sequence, Iterator, NamedTuple)

Any other Python object can serve as a type. We will use the `numpy.ndarray` in this tutorial to mock a tensor. We thus create a `Tensor` object to use as type hint throughout the code, and an object `Func` for a function that acts element-wise on a tensor and returns a tensor.

In [None]:
from numpy import ndarray as Tensor

Func = Callable[[Tensor], Tensor]

#### Checking the type

If type hints are optional for Python, they can still be used to actually check the consistency of the code. For this task, there is a module called [`mypy`](https://github.com/python/mypy) that can be used (not in this tutorial). 

Check out the [doc](http://mypy-lang.org/) if you are interested.

---

## Loss function

A loss function measures how good our predictions are, compared to the expected values. It is the cost function that needs to be minimised. The loss function must be differentiable, as its gradient is needed to adjust the parameters of the network through backpropagation.

Below is generic loss class. It implements 
- `loss()` : **the loss** computated from the expected label and the predicted one,
- `grad()` : **the gradient of the loss**, needed for the backpropagation.

In [None]:
class Loss:
    def loss(self, predicted: Tensor, actual: Tensor) -> float:
        raise NotImplementedError

    def grad(self, predicted: Tensor, actual: Tensor) -> Tensor:
        raise NotImplementedError

### Exercice #1 - mean square error loss

***5 min*** - *Implement the `MeanSquareError` class*

In this exercise, `predicted` and `actual` are vectors (1-d numpy arrays).  
You must implement both the loss and its derivative using these two vectors.

For info, the mean square error loss function is defined as

$$MSE(y_{true}, y_{pred}) = \sum \left(y_{pred} - y_{true}\right) ^ 2$$



In [None]:
class MeanSquareError(Loss):
    def loss(self, predicted: Tensor, actual: Tensor) -> float:
        return ...
    
    def grad(self, predicted: Tensor, actual: Tensor) -> Tensor:
        return ...

## Layers


### Neuron

A neuron $\mathscr{N}$ has multiple input values (vector $\mathbf{x}$ of size $m$) and a single output $z$. Each neuron is characterised by its weights (vector $\mathbf{w}$ of size $m$) and a constant bias $b$ to perform the linear operation

$$\begin{aligned} 
\mathscr{N}_{\mathbf{w}, b}(\mathbf{x}) &= \sum_i w_i.x_i + b \\
                                   &= \begin{bmatrix} w_{0} & \cdots & w_{m}\end{bmatrix} \begin{bmatrix} x_0 \\ \vdots \\ x_m \end{bmatrix} + b\\
                                   &= \mathbf{w}^T\mathbf{x} + b \\
                                   &= z
\end{aligned}$$
where $m$ is the input size, and output a single value $z$.


### Linear layer

A linear layer $\mathscr{L}$ is a set of neurons, and can therefore be represented by a matrix of weights $\mathbf{W}$ and a vector of constants $\mathbf{b}$.  
For a layer of $n$ neurons, the matrix $\mathbf{W}$ is therefore $(m,n)$ and the vector $\mathbf{b}$ is of size $n$.  

The operation realized by the layer on the input vector $\mathbf{x}$ of size $m$ is

$$\begin{aligned} 
\mathscr{L}_{\mathbf{W}, \mathbf{b}}(\mathbf{x}) 
% \begin{bmatrix} y_0 \\ \vdots \\ y_n \end{bmatrix}
    &= \begin{bmatrix} \sum_i W_{i, 0}.x_i + b_0 \\ \vdots \\  \sum_i W_{i, n}.x_i + b_n \end{bmatrix} \\
    &= \begin{bmatrix} W_{0,0} & \cdots & W_{m,0} \\ \vdots & & \vdots \\ W_{0,n} & \cdots & W_{m,n} \\\end{bmatrix} . \begin{bmatrix} x_0 \\ \vdots \\ x_m \end{bmatrix} + \begin{bmatrix} b_0 \\ \vdots \\ b_n \end{bmatrix} \\
    &= \mathbf{W}^T\mathbf{x} + \mathbf{b} \\
    &= \mathbf{z}
\end{aligned}$$

which is a matrix multiplication and an addition, that produces an output vector $\mathbf{z}$ of size $n$.

### Activation layer

After the layer forward pass, there might be an activation layer whose role is to break the linearity of the network. The so-called activation layer is thus a non-linear fonction $f$ acting on the output $\mathbf{z}$ of the linear layer. 

$$\begin{aligned}
\mathbf{a} &= f(\mathbf{z}) \\
           &= f(\mathbf{W}^T\mathbf{x} + \mathbf{b})
\end{aligned}$$

The activation layer conserves the shape.

### Backpropagation (computing of the layer gradients)

For the backward pass, each layer receives a gradient vector for the preceding layer.

The ***chain rule*** connects the the loss $\mathscr{C}$ (for cost) to the weights and biases of layer $i$ and yields the following relations  :

$$\begin{aligned}
\dfrac{\partial \mathscr{C}}{\partial \mathbf{W}^{i}} 
    &= \dfrac{\partial \mathscr{C}}{\partial \mathbf{a}^{i}} \cdot \dfrac{\partial \mathbf{a}^{i}}{\partial \mathbf{z}^{i}} \cdot \dfrac{\partial \mathbf{z}^{i}}{\partial \mathbf{W}^{i}} \\
    &= \mathbf{\nabla}\mathscr{C}^{i} \cdot f'(\mathbf{z}^{i}) \cdot \mathbf{x}^{i} \\
    &= \mathbf{\nabla_W^i}
\end{aligned}$$

and 

$$\begin{aligned}
\dfrac{\partial \mathscr{C}}{\partial \mathbf{b}^{i}} 
    &= \dfrac{\partial \mathscr{C}}{\partial \mathbf{a}^{i}} \cdot \dfrac{\partial \mathbf{a}^{i}}{\partial \mathbf{z}^{i}} \cdot \dfrac{\partial \mathbf{z}^{i}}{\partial \mathbf{b}^{i}} \\
    &= \mathbf{\nabla}\mathscr{C}^{i} \cdot f'(\mathbf{z}^{i}) \\
    &= \mathbf{\nabla_b^i}
\end{aligned}$$

where $\mathbf{\nabla}\mathscr{C}^{i}$ is the gradient vector of the loss propagated at layer $i$.

---

In this tutorial, we define sequential neural nets, made of one or more layers.
The layer will  pass its inputs forward
and propagate gradients backward. 

For example, a neural net might look like `inputs -> Linear -> Tanh -> Linear -> output`

The base class for a layer has a dictionary to store parameters ($\mathbf{W}$, $\mathbf{b}$) and gradients ($\mathbf{\nabla_W}$, $\mathbf{\nabla_b}$) and implements a forward and a backward method.

In [None]:
class Layer:
    def __init__(self) -> None:
        self.params: Dict[str, Tensor] = {}
        self.grads: Dict[str, Tensor] = {}

    def forward(self, inputs: Tensor) -> Tensor:
        raise NotImplementedError

    def backward(self, grad: Tensor) -> Tensor:
        raise NotImplementedError

### Exercice #2 - linear layer

***10 - 15 min*** - *Implement the `forward` and `backward` methods of the linear layer.*

The mathematical recap above is here to help you.  

Be aware that neural networks are generally trained in batches (see [batch generator part](#Batch-generator)), essentially in order to 
- save some foward / backward computing steps
- reduce the noise produced by extreme input vectors at the optimisation step.

We therefore introduce the concept of ***batch_size***, which is the number of simultaneous trained inputs.
For this reason, the input and output arrays of the layers are not actual vectors but **matrices** whose shape of one dimension is the ***batch_size***.

Hints:
- matrix products can be written either with `np.dot(m1, m2)` or `m1 @ m2` with recent Python versions (3.5+)
- pay a specific attention to the shape of the input and output matrices for the matrix product
- $\mathbf{W}^T\mathbf{x}$ is written `x @ W`

In [None]:
class Linear(Layer):
    """
    Inputs are of size (batch_size, input_size)
    Outputs are of size (batch_size, output_size)
    """
    def __init__(self, input_size: int, output_size: int) -> None:
        # Inherit from base class Layer
        super().__init__()
        # Initialize the weights and bias with random values
        self.params["w"] = np.random.randn(input_size, output_size)
        self.params["b"] = np.random.randn(output_size)

    def forward(self, inputs: Tensor) -> Tensor:
        """
        inputs shape is (batch_size, input_size)
        """
        self.inputs = inputs
        # Compute here the feed forward pass
        return ... 
        

    def backward(self, grad: Tensor) -> Tensor:
        """
        grad shape is (batch_size, output_size)
        """
        # Compute here the gradient parameters for the layer
        self.grads["w"] = ...
        self.grads["b"] = ...  
        # Compute here the feed backward pass
        return ...             

### Activation layers

In [None]:
class Activation(Layer):
    """
    An activation layer just applies a function
    elementwise to its inputs
    """
    def __init__(self, f: Func, f_prime: Func) -> None:
        super().__init__()
        self.f = f
        self.f_prime = f_prime

    def forward(self, inputs: Tensor) -> Tensor:
        self.inputs = inputs
        return self.f(inputs)

    def backward(self, grad: Tensor) -> Tensor:
        """
        if y = f(x) and x = g(z)
        then dy/dz = f'(x) * g'(z)
        """
        return self.f_prime(self.inputs) * grad

### Exercice #3 - tanh and sigmoid

***5 min*** - *Implement the hyperbolic tangent and sigmoid layers and their derivatives.*

Look for the definitions in the lecture.


In [None]:
def tanh(x: Tensor) -> Tensor:
    # Write here the tanh function
    return ...  

def tanh_prime(x: Tensor) -> Tensor:
    # Write here the derivative of the tanh
    return ...  

class Tanh(Activation):
    def __init__(self):
        super().__init__(tanh, tanh_prime)
        

def sigmoid(x: Tensor) -> Tensor:
    # Write here the sigmoid function
    return ...  

def sigmoid_prime(x: Tensor) -> Tensor:
    # Write here the derivative of the sigmoid
    return ...  

class Sigmoid(Activation):
    def __init__(self):
        super().__init__(sigmoid, sigmoid_prime)

## Neural network

A neural net is a collection of layers. It takes care of sequentially calling the layers `forward` and a `backward` methods in the right order.

In addition, it implements a getter method `params_and_grads` that will be used by the optimizer to update the values of the weights and bias of each layer.

In [None]:
class NeuralNet:
    def __init__(self, layers: Sequence[Layer]) -> None:
        self.layers = layers

    def forward(self, inputs: Tensor) -> Tensor:
        """
        The forward pass takes the layers in order
        """
        for layer in self.layers:
            inputs = layer.forward(inputs)
        return inputs

    def backward(self, grad: Tensor) -> Tensor:
        """
        The backward pass is the other way around
        """
        for layer in reversed(self.layers):
            grad = layer.backward(grad)
        return grad

    def params_and_grads(self) -> Iterator[Tuple[Tensor, Tensor]]:
        for layer in self.layers:
            for name, param in layer.params.items():
                grad = layer.grads[name]
                yield param, grad

## Optimizer

The role of the optimizer is to adjust the network parameters (weights and biases of the linear layers here) based on the gradients computed during backpropagation.

The main attribute of an optimizer is the _learning rate_ (a.k.a. `lr`), which defines the size of the jump taken in the direction of the gradients. 

In [None]:
class Optimizer:
    def step(self, net: NeuralNet) -> None:
        raise NotImplementedError

### Exercice #4 - Stochastic Gradient Descent

***5 min*** - write the optimizer step

Here we have a very basic implementation of a _Stochastic Gradient Descent_ (a.k.a. `SGD`). 

The step that needs to be written iterates over the neural network layers and updates the layers parameters in the direction _opposite_ to the gradient.

In [None]:
class SGD(Optimizer):
    def __init__(self, lr: float = 0.01) -> None:
        self.lr = lr

    def step(self, net: NeuralNet) -> None:
        for param, grad in net.params_and_grads(): 
            # Write here the parameters update
            ...

## Batch generator

It can be costly to compute the gradients and update the weights after every entry of the training dataset. In order to minimize such computational cost, the inputs of the network are traditionally fed in batches and the gradients are thus averages over those batches of data.

A batch size of 32 is a default in multiple training sets. Some recent [study](https://arxiv.org/abs/1804.07612) claims this number is the perfect balance between computing efficiency and training stability.

During an epoch the network will iterate over the whole dataset. Adding some shuffling in the process ensures the batches are not fed exactly in the same order at each epoch.

In [None]:
Batch = NamedTuple("Batch", [("inputs", Tensor), ("targets", Tensor)])


class DataIterator:
    def __call__(self, inputs: Tensor, targets: Tensor) -> Iterator[Batch]:
        raise NotImplementedError

        
class BatchIterator(DataIterator):
    def __init__(self, batch_size: int = 32, shuffle: bool = True) -> None:
        self.batch_size = batch_size
        self.shuffle = shuffle

    def __call__(self, inputs: Tensor, targets: Tensor) -> Iterator[Batch]:
        starts = np.arange(0, len(inputs), self.batch_size)
        if self.shuffle:
            np.random.shuffle(starts)

        for start in starts:
            end = start + self.batch_size
            batch_inputs = inputs[start:end]
            batch_targets = targets[start:end]
            yield Batch(batch_inputs, batch_targets)

## Training

The training routine uses all objects defined above and executes actions **in the right order** to train the neural network.

The dataset being usually small with respect to the number of free parameters of the neural net, going through the dataset multiple times during the training is a necessity. This corresponds to the number of epochs, which has to be specified.

### Exercise #5 - build the training routine

***10 min*** - write the sequential steps needed for training at each epoch

_Hints_:
- feed forward
- compute the loss and the gradients
- feed backwards
- update the net

In [None]:
def train(net: NeuralNet, inputs: Tensor, targets: Tensor,
          loss: Loss = MeanSquareError(), 
          optimizer: Optimizer = SGD(),
          iterator: DataIterator = BatchIterator(),
          num_epochs: int = 5000) -> None:
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        for batch in iterator(inputs, targets):
            # Write here the various steps (in order) needed 
            # at each epoch
            ...
        # Print status every 50 iterations
        if epoch % 50 == 0:
            print(epoch, epoch_loss)

## Application exercise

Now that you have build your own neural network library, let's use it to solve a problem and then put it in application.

### XOR

Canonical problem in ML as there is not linear way to map the inputs to the output.

```
[0, 0] => 0  
[0, 1] => 1  
[1, 0] => 1  
[1, 1] => 0  
```

Because of the extremely small size of the dataset, we will **forget** about the prescriptions on _training, validation and test sets_ for this example, which **you shouldn't do in practice**. 

In [None]:
X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
y = np.array([[0], [1], [1], [0]])

def print_xor_results(inputs: Tensor, targets: Tensor, predictions: Tensor) -> None:
    print('\nX => y => y_pred => round(y_pred)')
    for x, y, z in zip(inputs, targets, predictions):
        print(f'{x} => {y} => {z} => {z.round()}')
        
def train_xor(net: Optimizer, inputs: Tensor, targets: Tensor, epochs: int = 2000):
    train(net, inputs, targets, num_epochs=epochs)
    predictions = net.forward(inputs)
    print_xor_results(inputs, targets, predictions)

Here is an attempt at solving the XOR problem using a single linear layer

In [None]:
net1 = NeuralNet([
    Linear(input_size=2, output_size=1),
])

train_xor(net1, X, y)

This does not work, as expected since XOR is a typical non-linear problem

### Exercise #6 - solve XOR with a Neural Net

***5 min*** - Write a more advanced neural net (using additional linear and activation layers) until the predictions match the target values.

In [None]:
net2 = NeuralNet([
    # Add the layers here
    ...
])

train_xor(net2, X, y)

### Exercise #7 - write the same model using Keras

***10 min*** - Based on the `Keras` examples given in the lecture, as well as the section on loss and optimizers, solve the XOR problem using `Keras` methods.

In [None]:
# Write down the keras model below
#---------------------------------
# Star by the necessary imports

# Write the model architecture
model = ...

# Compile the model


# Train the model (no validation_split required here !!!!)

#---------------------------------

# Once trained, this will then predict the values (equivalent of `.forward()`)
y_pred_keras = model.predict(X)

# And print the results
print_xor_results(X, y, y_pred_keras)

Up to you now..

## Acknowledgements

The idea and the code for this tutorial have been for the most part inspired by the video "Deep Learning Madness" https://youtu.be/o64FV-ez6Gw by [Joel Grus](https://twitter.com/joelgrus)