# Hyperparameter Tuning
## Hyperparameters - list by importance
1. $\alpha$, learning rate
2. $\beta$, momentum term; #hidden units; #mini-batch size
3. #layers; learning-rate decay
4. $\beta_1$, momentum term; $\beta_2$, RMS term; $\epsilon$, correction term

## Hyperparameter search
Choose random points instead of using a grid.

Continue to randomly choose hyperparameter values from a smaller optimal area

### Appropriate Scale
In the scale [0.0001, 1], choose equivalently between each digit.

$r = -4*np.random.rand()$, $r\ \epsilon\ [-4, 0]$

$\alpha = 10^r$, $\alpha\ \epsilon\ [10^{-4}, 1]$

the boundaries of $r [a, b]$ are logrithmic

As for $\beta$, $0.9 \sim 0.999$ for example, the scale is regard to $1-\beta$

In this case, $r\ \epsilon\ [-3, -1]$, $\beta = 1 - 10^r$, $\beta\ \epsilon\ [0.9, 0.999]$

## Two approaches to tune a model
1. Panda: babysit a model by gradually tuning each hyperparameter - for low computational resources
2. Caviar: parallelly trying different sets of hyperparameters and choose the best performance

## Batch Normalization - Normalizing Activations in a Network
To stabilize later layers' training, maintain the standard deviation and mean even though input might change

As normalizing $X$, normalize $Z^{[l]}$ to improve the learning of $W^{[l+1]}$ and $b^{[l+1]}$

Take $Z^{[l]} = [z^{(1)}, z^{(2)}, ..., z^{(m)}]$

$\mu = \frac{1}{m}\sum_{i}{}z^{(i)}$, the sum is regard to $m$ examples, $\mu$ is a vector of $(n^{[l]}, 1)$

$\sigma^2 = \frac{1}{m}\sum_{i}{}(z^{(i)} - \mu)^2$, 

$z^{(i)}_{norm} = \frac{z^{(i) - \mu}}{\sqrt{\sigma^2\ +\ \epsilon}}$, has mean = 0 and standard deviation = 1

$\widetilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta$, has mean = $\beta$ and standard deviation = $\gamma$; if $\gamma = \sqrt{\sigma^2 + \epsilon}$, $\beta = \mu$, $\widetilde{z}^{(i)} = z^{(i)}_{norm}$

$\beta^{[l]}$ is for batch normalization, different from optimization hyperparameters $\beta, \beta_1, \beta_2$

$\epsilon$ is used to prevent $\sigma = 0$

$b^{[l]}$ is useless in batch normalization, for adding the same constant $b^{[l]}_i$ to one row of data doesn't change the z-score

Has a little regularization effect as "drop out", since batch norm adds noise to units

### Batch Norm at Test Time
Instead of mini batches, test data are one at a time

In this case, $\mu$ and $\sigma$ should be using values derived from training

For layer $l$, $X^{\{t\}} \to \mu^{\{t\}[l]}, \sigma^{\{t\}[l]}$

$\mu =$ exponentially weighted average of $\mu^{\{t\}[l]}$, and $\sigma =$ exponentially weighted average of $\sigma^{\{t\}[l]}$, then use in calculating $\widetilde{z}^{(i)}$

(Is $\mu$ with respect to one layer or it's a constant across layers?)