# Practical Aspects of Deep Learning


## Setting up your Machine Learning Application

### Train / Dev / Test sets

Iterative process:\
layers\
hidden units\
learning rate\
activation functions\
train/dev/test sets

Data:\
train set: 70%\
Hold-out cross validation/development dev set: 30%\
test set: 30%

For big data (1M+):\
train set: 98%\
dev/test set: 1%

### Bias / Variance

High bias (underfitting):\
train set error: 15%\
dev set error: 14%\
high error on both sets

High variance (overfitting):\
train set error: 1%\
dev set error: 11%\
low error on train set, high error on dev set

High bias and high variance:\
train set error: 15%\
dev set error: 30%\
high error on both sets

Low bias and low variance:\
train set error: 0.5%\
dev set error: 1%\
low error on both sets

### Basic Recipe for Machine Learning

High bias (underfitting):
- bigger network
- train longer
- NN architecture search
- Adam
- RMSprop

High variance (overfitting):
- more data
- regularization
- NN architecture search
- Adam
- RMSprop
- dropout
- data augmentation
- early stopping
- batch normalization
- L2 regularization


## Regularizing your Neural Network

### Regularization

+ Regularization
$$ min_{W, b} $$
$$J(w,b)=\frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} ||W^{[l]}||_{F}^{2}$$
$$\text{Frobenius norm: }||W^{[l]}||_{F}^{2} = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (W_{ij}^{[l]})^{2} $$
$$ dW^{[l]} = \text{from backprop} + \frac{\lambda}{2m}W^{[l]} $$

### Why Regularization Helps

+ Intuition
  + If $\lambda$ is too large, the network will be more linear
  + If $\lambda$ is good, the network will be less prone to overfitting
  + $\lambda$ is a hyperparameter that you can tune using a dev set


### Dropout Regularization

+ Dropout
  + At each iteration, shut down each neuron of a layer with probability $1 - keep\_prob$ or keep it with probability $keep\_prob$
  + At test time, don't do anything (don't apply dropout)
+ Implementation details
  + At test time, multiply each dropout layer by keep_prob to keep the same expected value for the activations
  + Ex. $a^{[3]} = np.random.rand(a^{[3]}.shape[0], a^{[3]}.shape[1]) < keep_prob$
  + $a^{[3]} = np.multiply(a^{[3]}, d^{[3]})$
  + $a^{[3]} /= keep_prob$

### Understanding Dropout

+ Intuition
  + Dropout randomly shuts down some neurons on each iteration
  + $d^{[3]}$ is a matrix of the same shape as $a^{[3]}$
  + Each entry of $d^{[3]}$ is 0 or 1 with equal probability
  + Ex. $keep\_prob = 0.8 \rightarrow$ 80% chance that a neuron will be kept
  + $\rightarrow$ At test time, multiply by 0.8 to keep the same expected value for the activations
