# Deep L-Layer Neural Network

In this section, the shallow neural network model discussed in the previous sections, will be generalized to a deep network with L layers. Technically, logestic regression is a very shallow neural network. However, a network with quite large L layers is considered to be a deep network. As L get bigger, the network is said to be deeper.

## Notations

Below, is a review of the previous notations as well as the new notations associated with notations from the deep L-layers networks.

- $L$ refers to the number of layers.
- $n^{[l]}$ refers to the number of units in the layer l.
- $a^{[l]}$ denotes the activation applied in the layer l. $a^{[0]}$ denotes the inputs.
- $w^{[l]}$ represents the weights of layer l.
- $b^{[l]}$ represents the biases of layer l.

### The general formula for the forward pass

$$z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$$

$$a^{[l]} = g^{[l]}(z^{[l]})$$

# L-layers network dimensions

For an L-layers network, the dimension of each of its variables and parameters are as follows.

| variable     | dimension                 | dimension on $m$ training examples  | Notes                                                                                   |
| :--------------------: | :-----------------------: | :---------------------------------: | :-------------------------------------------------------------------------------------: |
| $$w^{[l]}$$            | $(n^{[l]},(n^{[l-1]})$    | $(n^{[l]},(n^{[l-1]})$              |                                                                                         |
| $$\partial w^{[l]}$$   | $(n^{[l]},(n^{[l-1]})$    | $(n^{[l]},(n^{[l-1]})$              |                                                                                         |
| $$b^{[l]}$$            | $(n^{[l]},1)$             | $(n^{[l]},1)$                       | In Python, the vector will be added to the results of $(W.T \times X)$ via broadcasting |
| $$\partial b^{[l]}$$   | $(n^{[l]},1)$             | $(n^{[l]},1)$                       |                                                                                         |
| $$z^{[l]}$$            | $(n^{[l]},1)$             | $(n^{[l]},m)$                       |                                                                                         |
| $$\partial z^{[l]}$$   | $(n^{[l]},1)$             | $(n^{[l]},m)$                       |                                                                                         |
| $$a^{[l]}$$            | $(n^{[l]},1)$             | $(n^{[l]},m)$                       | $a^0$ refers to the inputs                                                            |
| $$\partial a^{[l]}$$   | $(n^{[l]},1)$             | $(n^{[l]},m)$                       | $a^0$ refers to the inputs                                                            |


# Why deep representation?

earlier layers of the network usually detects general simple features like edges in an image, for instance. Deeper layers compute more complex functions like face eyes, noses itc.

An intuition why deep representation works better than shallow representation comes from the circuit theory. In circuit theory, building multi level network of gates to compute basic circuit functions, say the XOR function, turns out to be easily calculated with only $\mathcal{O}(\log{n})$ layers. However, it needs an exponentially large number of units if such calculation is restricted to be on only one layer! This is because an $2^{n-1}$ gates are needed to exhaust all the possible configurations of the $n$ inputs. 

# Parameters vs Hyperparameters

Parameters of the neural network model are $w$ and $b$. However, there are many other numbers that need to be set. For instance:

- Learning rate $\alpha$.
- Number of iterations.
- Number of hidden layers L.
- Number of hidden units $n^{[l]}$ in the layer l.
- Choice of activation function.

These are some hyperparameters up to the current materials. However, there are other hyperparameters that will come on the way like the momentum term, minibatch size, regularization parameters, etc.

The current practice of applied deep learning is a very empirical process. The practitioner will start first with the idea, write the code, conduct the experiment, reiterate on the code and the hyperparameters, experiment again and so on until reaching to a the sought results.

---