# TD 5

[Use PyTorch for all questions]

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import glob
import time

  from .autonotebook import tqdm as notebook_tqdm


## Width vs Depth

Our goal here is to compare the performances of basic networks.
We will create both very wide and very deep networks, and see which ones are better.

We will try to fit a sequence of functions with increasing complexity, both with a wide and with a deep network.
The first part concentrates on the minimal amount of neurons needed to fit the function with an optimal network (setting the weights and biases manually).
Then, the second part studies the training of the same networks, to fit the same functions. 

### Theory

Taking the function of interest: $f: \mathbb{R} \to \mathbb{R}$ to be linear by segment, with 4 segments:
- $f(x) = 0$ on $\left] -\infty, 0 \right]$
- $f(x) = 2x$ on $\left] 0, \frac{1}{2} \right]$
- $f(x) = 2-2x$ on $\left] \frac{1}{2}, 1 \right]$
- $f(x) = 0$ on $\left] 1, \infty \right]$

Define $f : x\: \rightarrow f(x)$ as a python function, using numpy.

Let also:
- $g(x, 2) = f \circ f(x)$
- $g(x, 3) = f \circ f \circ f(x)$
- $g(x, 4) = f \circ f \circ f \circ f(x)$
- etc...

Define $g : x\: \rightarrow g(x, l)$ as a python function for all $l \in \mathbb{N}^*$.

Plot $f$ and $g$ on $\left] -0.2, 1.2 \right]$

Define a basic "rectange" network class (width is the same in all hidden layers);
leave the number of layers and number of neurons per layer as parameters, and use ReLU activation function.
The input and output are 1D, since we fit functions $\mathbb{R} \to \mathbb{R}$.

With 4 hidden layers and 5 neurons per layer, your network class should create a network as follows:

<img src="../images/rectangle_network.svg" alt="rectangle network diagram" style="width: 35em;"/>

Implement $f$ with a (basic) rectangle network with 1 hidden layer of 3 neurons.
Set the weights youself to fit exactly the function.

*Hint:*
$f(x) = 2x_+ -4(x-\frac{1}{2})_+ +2(x-1)_+$
$\qquad \qquad$ (where $\alpha_+$ is $ReLU(\alpha)$)

Now, implement $g$ for `level = 4`, by increasing the width (and keeping a single hidden layer).
Use again a rectangle network and set the weights youself.

*Hint: try to find the weight for `level = 2`, then `level = 3`, and deduce the pattern.*

How many neurons did you need in the hidden layer? How will that evolve when the level increase?

We need $2^{level}+1$ neurons, this increases exponentially with `level`.

-----

Now, implement $g$ for `level = 4`, by increasing the depth (and keeping 3 neuron per hidden layer).
Again use a rectangle network and set the weights youself.

*Hint: try to find the weight for `level = 2`, then `level = 3`, and deduce the pattern.*

How many neurons did you need in the hidden layer? How will that evolve when the level increase?

We need $3*{level}$ neurons, this increases linearly with `level`.

In a semilogy, plot the number of neurons used to replicate $g$ as a function of `level` (ranging from 1 to 15), by increasing the width and the depth.

In a semilogy, plot the number of parameters used to replicate $g$ as a function of `level` (ranging from 1 to 15), by increasing the width and the depth.

### Training

We will try to fit `g` with `level = 4` both with deep and wide networks; This time, by training the network, instead of manually setting the weights.

First, train a wide network, of course, it will need more neurons than the mathematically optimal solutions.
Try with 5 times more neurons than in the optimal solution, and about $15000$ epochs.

*Convergence is not reached, but given a little more time, it should converge nicely towards the solution.*

Now, train a deep network, again, give a little slack on the number of neurons compared to the optimal solution.
Try with 10 times more layers than necessart, and 15 neurons per layer instead of 3, and about $5000$ epochs.

- *See that the training takes much more time than with the wide network, despite the fact that we have less epochs.*
- *Observe also that there is no sign of convergence; Try to explain this.*

**Conclusion:**

The number of neurons is a good measure of the size (in terms of memory) of your network;
For the same amount of neurons, deep networks, can catch more complexity, but are harder to train.
On the contrary, wide networks catch less complexity, but are easier to train.

*Your goal as an AI engineer is to find the best architectures, so that your networks are both trainable and able to catch the complexity of the observed phenomenon.*