# Deep Neural Networks
- For some functions, the required number of hidden units it's impractically large
- Deep networks can produce many more linear regions than shallow networks 
- From a pratical standpoint, they can be used to describe a broader family of functions

## Composing Neural Networks
- Composing two networks, such that the output of one becomes the input of the second
- First
  - $h_{1} = a[\theta_{10} + \theta_{11}x]$ <br> $h_{2} = a[\theta_{20} + \theta_{21}x]$ <br> $h_{3} = a[\theta_{30} + \theta_{31}x]$ <br> $y=\phi_{0} + \phi_{1}h_1 + \phi_{2}h_2 + \phi_{3}h_3$
- Second
  -  - $h_{1}' = a[\theta_{10}' + \theta_{11}'y]$ <br> $h_{2}' = a[\theta_{20}' + \theta_{21}'y]$ <br> $h_{3}' = a[\theta_{30}' + \theta_{31'}y]$ <br> $y'=\phi_{0}' + \phi_{1}'h_1' + \phi_{2}'h_2' + \phi_{3}'h_3'$
- Number of linear regions produced is greather than for a shallow network with $6$ hidden units
- Increase to $9$ linear regions (3x3)
- The first network folds the input space $x$ back onto itself so that multiple inputs generate the same output
- Then, the second network applies a function, which is replicated at all points that were folded on top of one another

## Deep Neural Networks
- Case with 2 hidden layers and 3 hidden units
  - The hidden units from the first layer are usual linear functions followed by ReLU.
  - Pre-activations of the second layer
    - Three linear functions on the hidden units
  - At the second hidden layer, another ReLU function is applied to each function which clips and adds new joints to each
  - Final output is a linear combination of these hidden units
## Hyperparameters
- Modern networs might have thousands of hidden units and hundreds of layers
- Width: Number of hidden units
- Depth: Number of hidden layers
- Number of layers as $K$ and the number of hidden units in each layer as $D_{1}, D_{2},...$

## Matrix Notation
- $\begin{bmatrix} h_1 \\ h_2 \\ h_3 \end{bmatrix}$ = $a[\begin{bmatrix} \theta_{10} \\ \theta_{20} \\ \theta_{30} \end{bmatrix} + \begin{bmatrix} \theta_{11} \\ \theta_{21} \\ \theta_{31} \end{bmatrix}x]$
- $\begin{bmatrix} h_1' \\ h_2' \\ h_3' \end{bmatrix}$ = $a[\begin{bmatrix} \psi_{10} \\ \psi_{20} \\ \psi_{30} \end{bmatrix} + 
  \begin{bmatrix} \psi_{11} & \psi_{12} & \psi_{13} \\
  \psi_{21} & \psi_{22} & \psi_{23} \\
  \psi_{31} & \psi_{32} & \psi_{33} \end{bmatrix} \begin{bmatrix} h_1 \\ h_2 \\ h_3 \end{bmatrix}]$
- $y' = \phi_{0}' + \begin{bmatrix} \phi_1' & \phi_2' & \phi_3'\end{bmatrix} \begin{bmatrix} h_1' \\ h_2' \\ h_3' \end{bmatrix}$

- OR, 
  - $h = a[\theta_{0} + \theta x]$
  - $h' = a[\psi_{0} + \Psi h]$
  - $y' = \phi_{0}' + \phi'h'$
- Notation:
  - Vector of hidden units at layer $k$ as $h_k$
  - Vector of biases that contributes to hidden layer $k+1$ as $\Beta_{k}$ and the weights(slopes) that are applied to the $k^{th}$ layer as $\Omega_{k}$
  - A general deep network can be written as 
    - $h_{k} = a[\Beta_{k-1} + \Omega_{k-1}h_{k-1}]$
- In the $k^{th}$ layer
  - $\Beta_{k-1}$ size will be $D_{k}$
  - $\Omega_{k}$ has size $D_{k+1}D_{k}$

## Shallow vs Deep
- Ability to approximate different functions
  - Both can approximate any function (Universal Approximation Theorem)
- Number of linear regions
  - Shallow:
    - With one input, one output and $D$ units, can create up to $D+1$ linear regions and is defined by $3D+1$ parameters
      - $2D$ between input and hidden (weight and bias pair), $D$ for the hidden and output and + 1 for the bias 
  - Deep
    - With one input, one output, $K$ layers and  $D > 2$ hidden units
      - up to $(D+1)^K$ linear regions
      - $3D + 1 + (K-1)D(D+1)$ params
        - $3D+1$ like the shallow
        - $(K-1)$ (removing one of the layer (input or output)) times $D(D-1)$ that is the matrix size between capacities in consecutive layers
  - Deep networks produce many more linear regions per parameters
- Depth Efficiency
  - Some functions can be approximated much more efficiently with deep networks
- Training/Generalization
  - Deep networks are easier to fit
  - Maybe because overparametrized deep models have a large family of functions that are easy to find
  - Deep networks seem to generalize more
- In practice, the best results are achieved using using deep networks