# Components

**Parameters**: the vector of weights learned

**Layers**: a fundamental architectural unit composed of neurons with activation functions

**Activation functions**: linear or nonlinear transformations to the parameter, the weights and the biases 
- Sigmoid (fallen out of favor in hidden units)
- tanh
- hard tanh
- ReLU

**Loss functions:**
- Squared loss
- Logistic Loss
- Hinge loss
- Negative Log Likelihood

Loss functions fall into 3 categories:
- Classification
- Regression
- Reconstruction

**Optimization methods**:

**Hyperparameters**:




# Hidden layers
## How they work
1. accept a vector of inputs
2. affine transformation z = W.T\*x + b
3. element-wise nonlinear activation function g(z)

## Activation functions
### ReLU
        h = g(W.T*x + b)
        where g(z) = max{0,z}
- popular choice for hidden layers
- not differentiable at all points, but software can choose to returne one-sided derivative
- initialize with 0.1
- cannot learn with gradient-based methods when activation is zero

### absolute value ReLU
- fixes alphai = -1 to get g(z) = |z|

### leaky ReLU
        when zi < 0: hi = g(z, alpha)i = max(0, zi) + alphai*min(0,zi)
- fixes alpha to a small value like 0.01

### parametric ReLU (PReLU)
- treats alpha as a learnable parameter

### Maxout units
- instead of applying g(z) element-wise, breaks z into groups of k values
- each maxout unit outputs the maximum element of the group
- piecewise linear function that responds to multiple directions in the input x space
- learns the activation function
- has k weight vectors, needing more regularization
- have redundancy that helps resist catastrophic forgetting

### Sigmoid 
$$       g(z) = \sigma(z) $$
- are less popular as hidden units since they saturate
- more common outside feed-forward networks, such as in recurrent networks, probabilistic models, autoencoders (which can't use piecewise linear activation functions)

### hyperbolic tangent

$$ g(z) = tanh(z) = 2\sigma(2z)-1$$

### Linear
- if all hidden units are linear, the whole network will be linear
- but these save parameters, and are sometimes used for that reason

### Softmax
- usually an output, sometimes a hidden unit
- represent a probability distribution over a discrete variable with k possible values
- only used in advanced architectures that learn to manipulate memory

### Radial basis function (RBF)
$$ h_i = \exp{(\frac{-1}{\sigma^2}||W-x||^2)}$$
        
### Softplus
$$ g(z) = \log(1+\exp{a}) $$
- smooth version of rectifier
- not as good as a rectifier, unintuitively


# Forward propagation

$$ Z = W^TX+b$$

We have X with dimensions $(n_\text{image}, m)$, where n is the number of pixels and m is the number of training examples
$$\begin{equation*}
\mathbf{X} =  \begin{vmatrix}
 &\mathbf{item1} & \mathbf{item2} & ... & \mathbf{m} \\
\mathbf{x_1}& | &  | & ... & |\\
\mathbf{x_2}& X^{(1)} &  X^{(2)} & ... &  X^{(m)}\\
\mathbf{x_3}& | &  | & ... & |
\end{vmatrix}
\end{equation*}$$


$W^T$ is of dimension $(n^l, n^{l-1})$, which for the first hidden layer looks like:


$$\begin{equation*}
\mathbf{W^T} =  \begin{vmatrix}
 &\mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} \\
\mathbf{item1}& | &  | & |\\
\mathbf{item2}& X^{(1)} &  X^{(2)} & X^{(m)}\\
\vdots& \vdots&\vdots&\vdots\\
\mathbf{m} & | &  | & | 
\end{vmatrix}
\end{equation*}$$

and for the subsequent hidden layers looks like:

$$\begin{equation*}
\mathbf{W^T} =  \begin{vmatrix}
 &\mathbf{node1} & \mathbf{node2} & \mathbf{node3} \\
\mathbf{item1}& | &  | & |\\
\mathbf{item2}& X^{(1)} &  X^{(2)} & X^{(m)}\\
\vdots& \vdots&\vdots&\vdots\\
\mathbf{m} & | &  | & | 
\end{vmatrix}
\end{equation*}$$

And finally $b^{[l]}$ has dimensions $(n^l, m)$, with a single vector broadcast for each example:
$$\begin{equation*}
\mathbf{b} =  \begin{vmatrix}
 &\mathbf{item1} & \mathbf{item2} & ...& \mathbf{m} \\
\mathbf{node1}& | &  | & ...& |\\
\mathbf{node2}& b &  b & ... & b\\
\mathbf{node3} & | &  | & ...& | 
\end{vmatrix}
\end{equation*}$$

So that gives Z of dimensions $(n^1, m)$

$$\begin{equation*}
\mathbf{Z^{[1]}} =  \begin{vmatrix}
 &\mathbf{item1} & \mathbf{item2} & ... & \mathbf{m} \\
\mathbf{node1}& | &  | & ...& |\\
\mathbf{node2}& Z^{[1](1)} &  Z^{[1](2)} & ...& Z^{[1](m)}\\
\mathbf{node3}& | &  | & ... & |
\end{vmatrix}
\end{equation*}$$

Now to compute A:

$$A^{[1]}=\sigma(Z^{[1]})$$

$A^{[l]}$ has the same dimensions as Z: $(n^l, m)$:
$$\begin{equation*}
\mathbf{A^{[1]}} =  \begin{vmatrix}
 &\mathbf{item1} & \mathbf{item2} & ... & \mathbf{m} \\
\mathbf{node1}& | &  | & ...& |\\
\mathbf{node2}& A^{[1](1)} &  A^{[1](2)} & ...& A^{[1](m)}\\
\mathbf{node3}& | &  | & ... & |
\end{vmatrix}
\end{equation*}$$

# Cost function

The cost function is calculated from the loss between $A^{[L]}$, the final output layer activation function (a.k.a. $\widehat{y}$), and y, the target variable.  Note that we're summing over all layers 1 to n, but normalizing according to all examples in set m:

$$ J(W^{[1]}, b^{[1]},..., W^{[L]}, b^{[L]}) = \frac{1}{m}\sum^n_{i=1}\mathscr{L}(\widehat{y^{(i)}}, y)$$

Where $\mathscr{L}(\widehat{y}, y) = -\sum^{n_{labels}}_i y_i\log (a_i)$

# Initialization
### For ReLU
- set Variance of $W_i$ to $\frac{2}{n}$
- can be done by multiplying a random variable by the square root of the variance:
$$ W^{[l]} = np.random.randn(shape)\times\sqrt{\frac{2}{n^{l-1}}}$$

### For tanh
- use Xavier initialization (with 1 on top instead of 2):
$$\sqrt{\frac{1}{n^{l-1}}}$$

# Regularization


### L1
- sparsifies matrix by removing weights where appropriate


### L2, aka "weight decay"
- reduces weights to small numbers where appropriate
Cost function is updated to be: 
$$ J = \frac{1}{m}\sum^n_{i=1}\mathscr{L}(\widehat{y}, y) + \frac{\lambda}{2m}\sum^{L}_{l} \text{||}W^{[L]}\text{||}_F^2$$

Where $\text{||}W^{[L]}\text{||}_F^2 = \sum^{n^{[\mathscr{l}-1]}}_{i=1} \sum^{n^{[\mathscr{l}]}}_{j=1} W_{ij}^{[\mathscr{l}]}$

Which gets used in backpropagation for the derivative w.r.t. W:

$$dW^{[l]} = dW^{[l]} + \frac{\lambda}{m}W^{[l]}$$

### dropout
#### Inverted dropout is the most common
- create a "keep probability" and generate random numbers with the shape of the nodes for that layer
- you get a list of True and False, the latter nullifying that node when you multiply by the A matrix (activations)
- in order to get a full-scale output, you need to scale up the signal from non-dropped nodes
- so divide each activation function by the "keep probability" to scale up

$$ g(Z)_{\text{dropout}} = \frac{\text{on/off node vector}}{\text{keep probability}}\times g(Z)$$

### dropconn

### early stopping
- Watch training and dev set error together
- Training error will continue to decrease
- But at some point, dev set error will start to rise due to overfitting
- Choose snapshot at # iterations representing dev set error minimum

#### Pros: 
- computationally effective
#### Cons:
- this doesn't allow orthogonalization:
- want to be able to control overfitting and underfitting separately
- optimizing cost function J (controlling underfitting) can be addressed using cost function and hyperparameters such as learning rate, momentum
- optimizing for overfitting can be addressed using regularization, augmentation, etc.
- instead, try L2 regularization with different values of lambda

### image augmentation


# Back-propagation

Need to compute:

$$
\frac{d\mathscr{L}(a,y)}{dA^{[L]}} = -\frac{y}{a}+\frac{1-y}{1-a} \text{ for logistic regression},\\
\frac{d\mathscr{L}(a,y)}{dZ^{[L]}} = \frac{d\mathscr{L}}{dA^{[L]}}\times \frac{dA^{[L]}}{dZ^{[L]}} = (a-y)\times-\frac{y}{a}+\frac{1-y}{1-a} \text{ for logistic regression, and }\\
\frac{d\mathscr{L}(a,y)}{dW^{[L]}} = X \frac{d\mathscr{L}}{dZ^{[L]}},  \frac{d\mathscr{L}(a,y)}{db}=\frac{d\mathscr{L}(a,y)}{dZ^{[L]}}
$$

And this continues all the way back through the network.  For each layer for which $W^{[L]}$ and $b$ are computed, the weights are updated.

Now to update with learning rate and L2 regularization:


$$ W^{[L]} := W^{[L]} - \alpha\frac{d\mathscr{L}(a,y)}{dW^{[L]}} +  \frac{\lambda}{m}W^{[L]}$$

Which rearranges to $$ (1-\frac{\alpha \lambda}{m})W^{[L]} - \alpha\frac{d\mathscr{L}(a,y)}{dW^{[L]}}$$

# Mini-batch
- split X and y into batches of equal sizes
- # iterations per epoch is equal to the number of mini batches
- a single pass through the training set is 1 epoch
#### Batch gradient descent
- when you set mini-batch size m to the entire training set
- too long per iteration
#### Stochastic gradient descent
- when you set mini-batch size m = 1
- lose speedup from vectorization
#### Best mini-batch size
- if small training set (m<=2000), use batch gradient descent
- choose a power of 2
- mini-batch must fit in CPU/GPU memory

# Improving Gradient descent

### Momentum
- with each iteration t, compute usual derivatives on current mini-batch
- compute the exponentially weighted average:
$$V_{dW} = \beta V_{dW} + (1-\beta) dW$$
- allows derivatives over several steps to be averaged out to take larger steps in a direction that was consistent over last few steps, and smaller steps where direction changed for each step
- think of dW as acceleration and $V_{dW}$ as velocity, with $\beta$ being friction

Therefore, updates look like:
$$ W = W - \alpha V_{dW}$$
** try $\beta$ =0.9**, which is the average over last 10 iterations


### RMSprop
- exponentially weighted average of the squares of the derivatives
$$S_{dW} = \beta S_{dW} + (1-\beta) dW^2$$
$$ W = W - \alpha \frac{dW}{\sqrt{S_{dW}}}$$
- add a small epsilon to the denominator square root (e.g. 10^-8) to avoid numerical blow-ups

### Adam optimization algorithm
- puts momentum and RMSprop together

### Learning rate decay
- slowly reduce learning rate as training approaches convergence
- below, decay rate and epoch number are used:

$$ \alpha = \frac{1}{1+r_{decay}\times n_{epoch}}$$

- or exponential decay of learning rate
- or step functions 
- or manual decay (watching model as it trains)

# How to tune hyperparameters
#### most important
- alpha
- beta (momentum)
- mini-batch size
- # hidden units

#### second tier
- # layers
- learning rate decay

#### choose a sample randomly
- don't do a grid (too many points)
- try out a random set of points in hyperparameter space
- use a coarse to fine scheme (zoom into a region that was working well in last sample)

#### choose the right scale
- 