# Practical Aspects of Deep Learning


## Setting up your Machine Learning Application

### Train / Dev / Test sets

Iterative process:\
layers\
hidden units\
learning rate\
activation functions\
train/dev/test sets

Data:\
train set: 70%\
Hold-out cross validation/development dev set: 30%\
test set: 30%

For big data (1M+):\
train set: 98%\
dev/test set: 1%

### Bias / Variance

High bias (underfitting):\
train set error: 15%\
dev set error: 14%\
high error on both sets

High variance (overfitting):\
train set error: 1%\
dev set error: 11%\
low error on train set, high error on dev set

High bias and high variance:\
train set error: 15%\
dev set error: 30%\
high error on both sets

Low bias and low variance:\
train set error: 0.5%\
dev set error: 1%\
low error on both sets

### Basic Recipe for Machine Learning

High bias (underfitting):
- bigger network
- train longer
- NN architecture search
- Adam
- RMSprop

High variance (overfitting):
- more data
- regularization
- NN architecture search
- Adam
- RMSprop
- dropout
- data augmentation
- early stopping
- batch normalization
- L2 regularization


## Regularizing your Neural Network

### Regularization

+ Regularization
$$ min_{W, b} $$
$$J(w,b)=\frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} ||W^{[l]}||_{F}^{2}$$
$$\text{Frobenius norm: }||W^{[l]}||_{F}^{2} = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (W_{ij}^{[l]})^{2} $$
$$ dW^{[l]} = \text{from backprop} + \frac{\lambda}{2m}W^{[l]} $$

### Why Regularization Helps

+ Intuition
  + If $\lambda$ is too large, the network will be more linear
  + If $\lambda$ is good, the network will be less prone to overfitting
  + $\lambda$ is a hyperparameter that you can tune using a dev set


### Dropout Regularization

+ Dropout
  + At each iteration, shut down each neuron of a layer with probability $1 - keep\_prob$ or keep it with probability $keep\_prob$
  + At test time, don't do anything (don't apply dropout)
+ Implementation details
  + At test time, multiply each dropout layer by keep_prob to keep the same expected value for the activations
  + Ex. $a^{[3]} = np.random.rand(a^{[3]}.shape[0], a^{[3]}.shape[1]) < keep_prob$
  + $a^{[3]} = np.multiply(a^{[3]}, d^{[3]})$
  + $a^{[3]} /= keep_prob$

### Understanding Dropout

+ Intuition
  + Dropout randomly shuts down some neurons on each iteration
  + $d^{[3]}$ is a matrix of the same shape as $a^{[3]}$
  + Each entry of $d^{[3]}$ is 0 or 1 with equal probability
  + Ex. $keep\_prob = 0.8 \rightarrow$ 80% chance that a neuron will be kept
  + $\rightarrow$ At test time, multiply by 0.8 to keep the same expected value for the activations


## Other Regularization Methods

### Data Augmentation
- Mirroring
- Random Cropping
- Rotation
- Shearing
- Local Warping
- Color Shifting
- ...
- Applying above methods to training set, but not to dev/test sets
- The augmented data must still be reasonable
- The augmented data must be label preserving
- The augmented data must be cheap to generate
- The augmented data must not hurt performance of the model
- The augmented data must be generated by a computer
- The augmented data must not be too easy to classify
- The augmented data must not be too difficult to classify
- ...
- State-of-the-art object detection algorithms use data augmentation
- State-of-the-art speech recognition algorithms use data augmentation
- State-of-the-art image recognition algorithms use data augmentation

### Early Stopping  
- Stop training when the error on the dev set starts to increase (after some epoch)
- It is a regularization technique

### Orthogonalization
Optimizing cost: Tring Everything to decrease J(w,b)
Not overfit: Regularization

## Setting Up your Optimization Methods

### Normalizing Inputs

- Subtract mean:
$$\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}$$
$$x := x - \mu$$
- Normalize variance:
$$\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} x^{(i)^2}$$
$$x := \frac{x}{\sigma^2}$$
> Remember to use the same $\mu$ and $\sigma^2$ for training and testing sets.
> The normalized inputs should have zero mean and equal variance. The distribution should be similar to a Gaussian distribution. So that the cost function will be more symmetric and easier to optimize.

### Vanishing / Exploding Gradients

For a large NN with L layers with equal w, b=0:

$$\hat{y} = w^{L}a^{[0]}$$

If $w>1$, then $\hat{y}$ will explode. If $w<1$, then $\hat{y}$ will vanish.

### Weight Initialization for Deep Networks

For a single neuron:

$$z^{[l]} = W^{[l]}A^{[l-1]}$$

If $W^{[l]}$ and $A^{[l-1]}$ are large, then $z^{[l]}$ will be large, and the activation function will be in the flat region, which will slow down the learning.

So we want wi to be small.

n=$n^{[l-1]}$
Var(W) = 1/n
$W^{[l]}$ = np.random.randn(shape) * np.sqrt(1/n)

If ReLU is used, then $W^{[l]}$ = np.random.randn(shape) * np.sqrt(2/n)

If tanh is used, then $W^{[l]}$ = np.random.randn(shape) * np.sqrt(1/n)

Xavier initialization: np.random.randn(shape) * np.sqrt(1/(n[l-1]+n[l]))

### Numerical Approximation of Gradients

$$ \frac{f( \theta + \epsilon) -f( \theta - \epsilon)}{2 \epsilon} = g(\theta)$$
$$ f'( \theta) \approx g(\theta)=\lim_{\epsilon \rightarrow 0}\frac{f( \theta + \epsilon) -f( \theta - \epsilon)}{2 \epsilon}$$

### Gradient Checking

Take W[l] and b[l] and reshape them into a big vector $\theta$.

J($\theta$) = J(W[1], b[1], ..., W[L], b[L])

Take dW[l] and db[l] and reshape them into a big vector d$\theta$.

Is d$\theta$ close to gradient of J($\theta$)?

for each i:
$$ d\theta_{approx}[i] = \frac{J(\theta_1, \theta_2, ..., \theta_i + \epsilon, ..., \theta_n) - J(\theta_1, \theta_2, ..., \theta_i - \epsilon, ..., \theta_n)}{2 \epsilon}$$
$$ \approx d\theta[i] = \frac{\partial J}{\partial \theta_i}$$
Check ($epsilon=10^{-7}$):
$$ \frac{||d\theta_{approx} - d\theta||_2}{||d\theta_{approx}||_2 + ||d\theta||_2} \approx 10^{-7}$$
If the result is not close enough, then there is a bug in the backpropagation.

### Gradient Checking Implementation Notes

Don't use in training - only to debug.

If algorithm fails grad check, look at components to try to identify bug.

Remember regularization.

Doesn't work with dropout. You can first turn off dropout, run gradient check, and then turn on dropout again.

Run at random initialization; perhaps again after some training.

Don't use with mini-batches. If you use mini-batches, then you have to run gradient check on each mini-batch, and then average the results.