# Deep Learning

## Initialisation of Weights
- The weights $W^{[l]}$ should be initialized randomly to break symmetry
- Do not initialize weights to be too large, else the learning rate will be very slow
    - When using Relu/Sigmoid activation functions, the gradient at very large/very small values is close to 0, hence learning rate becomes very slow
    - Prevents vanishing/exploding gradients
    - Recommended Initialization Methods:
        - **Xavier Initialization**: Scaling factor of `sqrt(1./layers_dims[l-1]`
        - **He Initialization**: Scaling factor of `sqrt(2./layers_dims[l-1]`

## Bias and Variance
Large difference in **Bayes error and train set error** $\rightarrow$ High Bias (Underfitting)   
Large difference in **train set and validation set error** $\rightarrow$ High Variance (Overfitting)  
Large difference in **validation set and test set error** $\rightarrow$ Data Mismatch (Distribution of data in validation set not same as test set)

Fixes for High Bias (Underfitting):
- Train a bigger neural network
- Train longer

Fixes for High Variance (Overfitting):
- Use more data
- Regularization
    - L1/L2 Regularization
        - L1 Loss produces a sparse weights result
        - L2 Loss more commonly used
    - Dropout
        - Intuition: Can't rely on any one feature, so have to spread out weights
    - Data Augmentation
    - Weight Decay
    - Early Stopping
        - Issue: Affects two areas, both optimizes cost function and regularises

Fixes for Data Mismatch:
- Ensure data distribution of validation set is same as test set

### Activation Functions
- `sigmoid` function: range from 0 to 1, use for binary classification output layer
- `tanh` function: range from -1 to +1, almost always better than the sigmoid function due to mean 0 (data is centered)
- `Relu` function: `max(0, z)` **recommended**

## Normalizing
### Normalizing Inputs
- Helps to **speed up training** as it affects the rate of learning for gradient descent
    - Has a symmetrical descent, instead of an elongated descent direction

### Batch Normalization
Normalizes all the output activations $a^l$ to train $w^{l+1}$ and $b^{l+1}$ faster

Implementation:
 - Given some intermediate values in layer L of NN: $z^{(1)}, \dots, z^{(m)}$
 - $\mu = \frac{1}{m}\sum_i z^{(i)}$
 - $\sigma^2 = \frac{1}{m}\sum_i (z^{(i)} - \mu)^2$
 - $z^{(i)}_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}$
 - $\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta$
 
Last step is because you may not want all your hidden unit values to have mean 0 and variance 1 (e.g. having a sigmoid function)

### Adam Optimizer
The **Adam** optimizer combines both the concepts of using **momentum** as well as **RMSprop**.

# ML Strategy

Approach for any problem: Quickly produce a first model and see what errors it is making, then iterate on it.

## Setting Up Your Goal

Evaluate your models using a **Single Number Evaluation Metric**.

Can use human level performance as the Bayes error $\rightarrow$ the smallest unavoidable bias attainable

## Error Analysis

Manually look at a subset of the incorrectly labelled examples in the validation/dev set. Label each example with the appropriate reason for mismatch.

Provides a way to quickly approximate what errors to focus on, provides a benchmark for the "maximum" improvement to the model if a certain error is fixed.

## Transfer Learning

Transfer learning from A $\rightarrow$ B

Reasons for Use:
- Task A and B have the same input $x$
- Have a lot more data for Task A than Task B
- Low level features from A could be helpful for learning B

# Tensorflow

## Streaming the Data

Here you should take note of an important extra step that's been added to the batch training process: 

- `tf.Data.dataset = dataset.prefetch(8)` 

What this does is prevent a memory bottleneck that can occur when reading from disk. `prefetch()` sets aside some data and keeps it ready for when it's needed. It does this by creating a source dataset from your input data, applying a transformation to preprocess the data, then iterating over the dataset the specified number of elements at a time. This works because the iteration is streaming, so the data doesn't need to fit into the memory. 

`X_train = X_train.batch(minibatch_size, drop_remainder=True).prefetch(8)` # <<< extra step    
`Y_train = Y_train.batch(minibatch_size, drop_remainder=True).prefetch(8)` # loads memory faster 