In [2]:
from preamble import *
HTML('''<style>html, body{overflow-y: visible !important} .CodeMirror{min-width:100% !important;} .rise-enabled .CodeMirror, .rise-enabled .output_subarea{font-size:100%; line-height:1.0; overflow: visible;} .output_subarea pre{width:100%}</style>''') # For slides
#HTML('''<style>html, body{overflow-y: visible !important} .output_subarea{font-size:100%; line-height:1.0; overflow: visible;}</style>''') # For slides
InteractiveShell.ast_node_interactivity = "all"

## Agenda

- Introduction and Motivation
- Artificial Neuron
- Gradient Descent
- Backpropagation
- Perceptron
- Multilayered Perceptron
- MLP Classification
- **Model Design**
- Optimization

- Convolutional Neural Network
- Recurrent Neural Network

## Model Design
![Model Design](images/nn-model-design.png)
- Input format
- Output layer
- Loss function(s)
- Model Architecture
- Optimization parameters

## Output layer

### Regression

* Linear units -> Gaussian output distributions
    - Given a vector of feature activations $h$
    - $\hat{y}=W^Th+b$
    - $p(y|x)=\mathcal{N}(y;\hat{y},I)$


### Classification

* Sigmoid units -> Bernoulli output distributions
    - $p(y=1|x)$

* Softmax units -> Multinoulli output distributions
    * multiple neurons output the probability of each class
    * Normalized with the softmax function
    * $p(y=j|x)=\frac{e^{\mathbf{x}^\top w_j}} {\sum^{n}_{k=1}{e^{\mathbf{x}^\top w}}}$
    * strictly positive
    * sums to one


## Different activation functions

### Linear Activation function

$$g(z) = z$$
$$h = g(W^\top x + b)$$
![Linear Activation](images/linear.png)
- Usually used as a last layer activation for doing regression
- If all neurons are linear, the MLP is linear, which limits the generalization

### Logistic Sigmoid

- $\phi(z) = \frac{1}{1 + e^{-z}}$
- $h = \phi(W^\top x + b)$

![Logit](images/sigmoid.png)

- Positive, bounded, strictly increasing

### Hyperbolic Tangent

- $\phi(z) = \tanh(z)$
- $h = \phi(W^\top x + b)$
![Hyperbolic Tangent](images/tanh.png)

- Positive, negative, bounded, strictly increasing

### Rectified Linear Units
- $\phi(z) = max\{0, z\}$
- $h = \phi(W^\top x + b)$
![Relu](images/relu.png)

* Bounded below by 0, no upper bound, monotonically increasing
* Not differentiable at 0
* Produces sparse activations
* Addresses the vanishing gradient problem
* Tip: Bias initialization to small positive values
* Variations: Leaky ReLU, PReLU, Maxout


## Depth and Width

- Capacity
- Compositional features


### Number of layers

#### Single hidden layer
- Universal approximation theorem (Hornik, 1991)

*"a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough units"*

- Capacity scales poorly
    - To learn a complex function the model needs exponentially many neurons

- Shallow and deep network can learn the same functions
- Models with sequence of layers:
    - Each layer can partitioning the original input space piecewise linearly
    - Each subsequent layer recognizes pieces of the original input
    - Apply the same computation across different regions

![folding the input space](images/folding-space.png)

- The segments grows:
    - exponentially with the number of layers 
    - polynomial with the number of neurons
    
- Should we use very deep networks for any problem?


## Agenda

- Introduction and Motivation
- Artificial Neuron
- Gradient Descent
- Backpropagation
- Perceptron
- Multilayered Perceptron
- MLP Classification
- Model Design
- **Optimization**

- Convolutional Neural Network
- Recurrent Neural Network

## Gradient Descenet

* $\theta \gets \theta - \alpha \nabla_\theta L(\mathbf{x}; \theta)$

- Overfitting
    - Early stop, Learning rate adaptation
    - Weight decay L1/L2 regularization (ridge regression)

- Momentum
    * $v \gets \gamma v - \alpha \nabla_\theta L(\mathbf{x}; \theta)$
    * $\theta \gets \theta - v$

- Nestorov momentum
- AdatGrad
- AdaDelta
- Adam
- RMSProp

## Stochastic Gradient Descent

- Learn in Batches
- Reduce learning rate when it plateaus
    - Learning rate adaptation


![](images/sgd-methods.gif)

![](images/sgd-methods2.gif)

![](images/sgd-methods3.gif)

![](images/sgd-methods4.gif)

## Vanishing gradient

![Logit](images/sigmoid.png)



### Sigmoid activations 2 layers
![Logit](images/vanish-grad-02.png)

### Sigmoid activations 3 layers
![Logit](images/vanish-grad-03.png)

### Sigmoid activations 4 layers
![Logit](images/vanish-grad-04.png)

### Sigmoid activations ReLU
![Logit](images/vanish-grad-01.png)

## Regularization 
- L1/L2
    - Weights
    - Activations
- Sparsity

- https://keras.io/regularizers/
- https://www.tensorflow.org/api_guides/python/contrib.layers#Regularizers



## Initialization

- Depends on the activation function
    - ReLU, small postive weights
    
- (Grolot et al. 2010)
- https://keras.io/initializers/
- https://www.tensorflow.org/api_guides/python/contrib.layers#Initializers
    

## Dropout
![Dropout](images/dropout.jpeg)