In [3]:
from preamble import *
HTML('''<style>html, body{overflow-y: visible !important} .CodeMirror{min-width:100% !important;} .rise-enabled .CodeMirror, .rise-enabled .output_subarea{font-size:100%; line-height:1.0; overflow: visible;} .output_subarea pre{width:100%}</style>''') # For slides
#HTML('''<style>html, body{overflow-y: visible !important} .output_subarea{font-size:100%; line-height:1.0; overflow: visible;}</style>''') # For slides
InteractiveShell.ast_node_interactivity = "all"

### Artificial Neuron
![Neuron](images/neuron.png)



### Can be stacked
![Single layer mlp](images/single-layer-mlp.png)




### Multiple layers of stacked neurons
![Multiplayer perceptron](images/mlp.png)


### Motivation
    - Compositional features
    
![XOR visualization](images/xor-visualization.png)

![XOR visualization](images/xor-visualization.png)

![XOR visualization](images/xor-visualization3.png)

![Training single Neuron](images/training-single-neuron2.png)


![Single neuron nonlinear](images/single-neuron-nonlinear-features.png)

### Multilayer Perceptron (MLP) 
![Image of a Neural Network](images/mlp.png) 

- Directed acyclic graph
- Nodes are artificial neurons
- Edges are connections between them 
* Feedforward Neural Network
    - Neurons are ogranized in layers
    - No connection between neurons within a layer
    - All neurons in the same layer of the same type

### Multilayer Perceptron (MLP) 
![Image of a Neural Network](images/mlp.png)

* Each layer creates a new representation of the input data:
* $h^{(0)} = f^{(0)}(\mathbf{x})$
* $h^{(1)} = f^{(1)}(\mathbf{h^{(0)}})$
* $y = f^{(2)}(\mathbf{h^{(1)}})$


* Overall MLP is a function $f$
* $y=f(x,\theta)$

* Nested functions: $f^{(3)}(f^{(2)}(f^{(1)}(x))))$
    * First layer: $f^{(1)}$
    * Second layer: $f^{(2)}$
    * Third layer: $f^{(3)}$
    


MLP
===
* Linear models have limitations capturing all realtionships between the features $x$
* Non-linear models are difficult to train (non-convex) optimization
    * local minimia, no guarantees of convergence
* Linear models of non-linear transformation of $x$; $\phi(x)$
    * Choosing $\phi(x)$ is the challenge
    * If $\phi$ is general transforation RBF; then model generalization is suffering
    * Problem specific $\phi$ requires siginificant human effort and does not scale well
* The strategy of Deep Learning is to learn $\phi$
* $y=f(x;\theta,w)=\phi(x,\theta)^\top w$
    * No convexity
* MLP -> Deterministic mapping from x to y without recurrence


### MLP Capacity


Multiple Neurons can carve many regions
    - approximate arbitrary functions
    - Universal approaximation theorem
        - However, the number of neurons needed is binomial
        - Upper bound on the number of regions we can separate with multi layer network grows exponentially

## Number of layers (Depth) 

### Single hidden layer
- Universal approximation theorem (Hornik, 1991)
    - "a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough units"
- Scales poorly
    - To learn a complex function the model needs exponentially many neurons
    - $\sum_{j=0}^{n_1}{\dbinom{n_0}{k}}$
- However, since both the shallow network and the deep network can learn the same functions, we need to analyze the value of the depth in another way

- Basically, they split the input space in piecewise linear units. It seems that deep neural networks have more segments (with the same number of neurons) which allows them to produce a more complex function approximate. Essentially, after partitioning the original input space piecewise linearly, each subsequent layer recognizes pieces of the original input such that the composition of these layers correspondingly identifies an exponential number of input regions. This is caused by the deep hierarchy which allows to apply the same computation across different regions of the input space.

![folding the input space](images/folding-space.png)

- The number of piecewise linear segments the input space can be split into grows exponentially with the number of layers of a deep neural network, whereas the growth is only polynomial with the number of neurons. This explains why deep neural networks perform so much better than shallow neural networks. 

- Finally, for shallow very wide network, no known algorithm that can learn on any type of function Learning algorithm that can optimize those parameters


Architecture
    - Input
    - Hidden layers
    - Output

Can solve XOR

![MLP XOR Start](images/mlp-xor-start.png)


## Gradient Descent
- Model: 
    - $o_\mathbf{\theta} = \phi_1(\mathbf{w_1}^\top \phi_2( \mathbf{w_2}^\top\phi(\mathbf{w_3}^\top x)))$
    - $\theta : \{\mathbf{W}\}$
- Loss function: 
    - $L(\mathbf{x}, y; \mathbf{W}) = \frac{1}{2n}\sum_{i=0}^{n}{(o_\theta - y)^2}$ 
    
- Gradient of $L$ wrt $\mathbf{W}$:
    
    - $\frac{\partial}{\partial W}{L(.)}$ 

- biases omitted for simplicity

### Layered representation

![MLP consolidated](images/mlp-consolidated.png)



![MLP consolidated](images/mlp-consolidated2.png)

Computation Graph
- Vectorized form
![MLP Compute Graph](images/mlp-compute-graph.png)

![Backprop Node](images/backprop-node2.png)

![Backprop Node](images/backprop-node-jacobian.png)

$$J = {\partial(\mathbf{F}) \over \partial(\mathbf{W})} =  
\left\vert\matrix{{\partial f_1 \over \partial w_1} & {\partial f_1 \over \partial w_2} & {\partial f_1 \over \partial w_3} \cr 
{\partial f_2 \over \partial w_1} & {\partial f_2\over \partial w_2} & {\partial f_2 \over \partial w_3} \cr 
{\partial f_3 \over \partial w_1} & {\partial f_3 \over \partial w_2} & {\partial f_3 \over \partial w_3}}\right\vert $$
- Activation of neuron n:
    - $f_n$
- Parameters of neuron n:
    - $w_n$

$$J = {\partial(\mathbf{F}) \over \partial(\mathbf{W})} =  
\left\vert\matrix{{\partial f_1 \over \partial w_1} & 0 & 0 \cr 
0 & {\partial f_2\over \partial w_2} & 0 \cr 
0 & 0 & {\partial f_3 \over \partial w_3}}\right\vert $$

![Mlp backprop compute graph](images/mlp-backprop-graph.png)


![MLP XOR Start](images/mlp-xor-start.png)




![MLP XOR Start](images/mlp-xor-end.png)



![MLP XOR Start](images/mlp-xor-end.png)




(Show the examples with 3-4 neurons and xor)

(Show examples with two layers tand 3 neurons)

(Discuss the capacity (single layer vs. deep networks))