# Deep Neural Networ

## Deep L-layer Neural Network

Logistic regression often has only one layer, which is output logistic layer. However, deep neural network has more than one layer.

### Notations

$$ n^{[l]}: \text{number of units in layer } l $$
$$ a^{[l]}: \text{activation of layer } l $$
$$ a^{[0]} = x: \text{input} $$
$$ a^{[1]} = g^{[1]}(z^{[1]}) $$
$$ a^{[L]} = \hat{y}: \text{output} $$
$$ W^{[l]}: \text{weight matrix of layer } l $$
$$ b^{[l]}: \text{bias vector of layer } l $$


## Forward Propagation in a Deep Network

$$ Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]} \tag{1}$$
$$ A^{[l]} = g^{[l]}(Z^{[l]}) \tag{2}$$
$$ \hat{Y} = A^{[L]} = g^{[L]}(Z^{[L]}) \tag{3}$$

## **Getting matrix dimisions right**

for a NN has the following dimensions:\
L=5

$n^{[l]}$ | number of units in layer l
-|-
$n^{[0]}$ | $n_x$ = 2
$n^{[1]}$ | 3
$n^{[2]}$ | 5
$n^{[3]}$ | 4
$n^{[4]}$ | 2
$n^{[5]}$ | 1

$ Z^{[1]}=W^{[1]}X+b^{[1]}$

$Z^{[1]}$ | $W^{[1]}$ | $X$ | $b^{[1]}$
-|-|-|-
$(3,1)$ | $(3,2)$ | $(2,1)$ | $(3,1)$

**The most important thing is to get the dimensions of the matrices right.**
Parameters | Dimensions
-|-
$Dim(X)$, $Dim(A^{[0]})$ | $(n_x,m)$
$Dim(W^{[l]})$ | $(n^{[l]},n^{[l-1]})$
$Dim(b^{[l]})$ | $(n^{[l]},1)$
$Dim(dW^{[l]})$ | $(n^{[l]},n^{[l-1]})$
$Dim(db^{[l]})$ | $(n^{[l]},1)$
$Dim(Z^{[l]})$ | $(n^{[l]},m)$
$Dim(A^{[l]})$ | $(n^{[l]},m)$
$Dim(dZ^{[l]})$ | $(n^{[l]},m)$
$Dim(dA^{[l]})$ | $(n^{[l]},m)$

## Why Deep?

Image Recognition
- input layer (layer 0): pixels
- hidden layer 1: edges
- hidden layer 2: shapes
- hidden layer 3: objects
- hidden layer 4: faces
- output layer: person

Speech Recognition
- input layer (layer 0): sound waves
- hidden layer 1: low level sounds
- hidden layer 2: phonemes
- hidden layer 3: syllables
- hidden layer 4: words
- output layer: sentence

Circuit Theory 
> The complexity is O(log(n)), while if one hidden layer then exponential growth
- input layer (layer 0): transistors
- hidden layer 1: logic gates
- hidden layer 2: adders
- hidden layer 3: arithmetic logic units
- hidden layer 4: microprocessors
- output layer: computer

## Building Blocks

+ LAYER L:
- $n^{[l]}$ = number of units in layer l
- $W^{[l]}$ = weight matrix of shape $(n^{[l]}, n^{[l-1]})$
- $b^{[l]}$ = bias vector of shape $(n^{[l]}, 1)$

+ FORWARD PROPAGATION:
- $a^{[l-1]}$ = input vector of shape $(n^{[l-1]}, 1)$
- $a^{[l]}$ = output vector of shape $(n^{[l]}, 1)$
- $z^{[l]}$ = linear transformation of $a^{[l-1]}$ by $W^{[l]}$ and $b^{[l]}$ of shape $(n^{[l]}, 1)$
- $g^{[l]}$ = activation function of layer l
- cache = tuple of $(z^{[l]}, W^{[l]}, b^{[l]}, a^{[l-1]})$

+ BACKWARD PROPAGATION:
- $da^{[l]}$ = input derivative of $a^{[l]}$ with respect to the cost function
- $da^{[l-1]}$ = output derivative of $a^{[l-1]}$ with respect to the cost function
- $dW^{[l]}$ = output derivative of $W^{[l]}$ with respect to the cost function
- $db^{[l]}$ = output derivative of $b^{[l]}$ with respect to the cost function


## Forward and Backward Propagation

Input: $A^{[l-1]}$
$$ Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]}$$
$$ A^{[l]} = g^{[l]}(Z^{[l]})$$

Input: $A^{[l]}$
$$ dZ^{[l]} = dA^{[l]} * g^{[l]'}(Z^{[l]})$$
$$ dW^{[l]} = \frac{1}{m} dZ^{[l]}A^{[l-1]T}$$
$$ db^{[l]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = W^{[l]T}dZ^{[l]}$$

## Parameters and Hyperparameters

Parameters: Variables that determine the network's predictions. They are learned during the training process.

Hyperparameters: Variables that determine the network's structure. They are set before the training process.

**Applied deep learning is a very empirical process.**

Idea --> Code --> Experiment --> Idea --> ...
