## Table of Contents

- [1 - Setting up the ML Problem](#1)
    - [1.1 - Train/Dev/Test Sets](#1-1)
    - [1.2 - Bias/Variance](#1-2)
    - [1.3 - Normalizing Inputs](#1-3)
    - [1.4 - Initialisation of Weights](#1-4)
    - [1.5 - Activation Functions](#1-5)
    - [1.6 - Optimisation Algorithms](#1-6)
    - [1.7 - Batch Normalisation](#1-7)
- [2 - ML Strategy](#2)
    - [2.1 - Setting up your Goal](#2-1)
    - [2.2 - Error Analysis](#2-2)
    - [2.3 - Transfer Learning](#2-3)
- [3 - Convolutional Neural Networks (CNN)](#3)
    - [3.1 - Intro to CNN](#3-1)
    - [3.2 - CNN Networks](#3-2)
        - [3.2.1 - Classic Networks](#3-2-1)
        - [3.2.2 - ResNets](#3-2-2)
        - [3.2.3 - Inception Network](#3-2-3)
        - [3.2.4 - Mobile Net](#3-2-4)
    - [3.3 - Object Detection](#3.3)
        - [3.3.1 - YOLO (You Only Look Once)](#3-3-1)
        - [3.3.2 - Image Segmentation: U-Net](#3-3-2)
        - [3.3.3 - Face Recognition (FaceNet)](#3-3-3)
        - [3.3.4 - Neural Style Transfer](#3-3-4)
- [4 - Sequential Models - RNN](#4)
    - [4.1 - Intro to RNN](#4-1)
    - [4.2 - Word Embeddings](#4-2)
        - [4.2.1 - Word2Vec](#4-2-1)
        - [4.2.2 - Negative Sampling](#4-2-2)
        - [4.2.3 - GloVe Word Vectors](#4-2-3)
    - [4.3 - Sequence to Sequence Models](#4-3)
        - [4.3.1 - Conditional Language Models](#4-3-1)
    - [4.4 - Attention Models](#4-4)
        - [4.4.1 - Self Attention](#4-4-1)
        - [4.4.2 - Multi-Head Attention](#4-4-2)
        - [4.4.3 - Transformers](#4-4-3)
- [5 - Tensorflow](#5)

<a name='1'></a>
# 1 - Setting up the ML Problem

<a name='1-1'></a>
## 1.1 - Train/Dev/Test Sets
- Train set data can be of any distribution, does not need to be of same distribution as dev/test set
- **Dev set should have same data distribution as the test set, and should be representative of actual use case data.**

<a name='1-2'></a>
## 1.2 - Bias/Variance
**High Bias:** large difference in Bayes error and train set error **(Underfitting)**
- Train a bigger neural network
- Train longer
- Adjust neural network architecture

**High Variance:** large difference in train set and validation set error **(Overfitting)**
- Use more data for training
- Regularization
    - L1/L2 Regularization
        - L1 Loss produces a sparse weights result, can be used for feature selection
        - L2 Loss is more commonly used
     - Dropout:
         - Intuition: Forces the neural network to be unable to rely on any one feature/node, so have to spread out weights.
     - Data Augmentation
         - Increases amount of training data and reduces overfitting
     - Weight Decay
     - Early Stopping
         - Issue: Affects two areas simultaneously, it optimizes the cost function and regularises

<a name='1-3'></a>
## 1.3 - Normalizing Inputs
- Helps to **speed up training** as it affects the rate of learning in gradient descent, due to the descent direction being more directed toward the minimum
    - Ensures a more symmetrical descent direction, instead of an elongated bowl cost function

<a name='1-4'></a>
## 1.4 - Initialisation of Weights
- The weights $W^{[l]}$ should be **initialized randomly to break symmetry**
- **Do not initialize weights to be too large**, else the learning rate will be very slow
    - When using Relu/Sigmoid activation functions, the gradient at very large/very small values is close to 0, hence learning rate becomes very slow
    - Prevents vanishing/exploding gradients
    - Recommended Initialization Methods:
        - **Xavier Initialization**: Scaling factor of `sqrt(1./layers_dims[l-1])`
        - **He Initialization**: Scaling factor of `sqrt(2./layers_dims[l-1])`

<a name='1-5'></a>
## 1.5 - Activation Functions
- `sigmoid`: range from 0 to 1, use for binary classification output layer
- `tanh`: range from -1 to +1, almost always better than the sigmoid function due to mean 0 (data is centered)
- `Relu`: `max(0, z)` **recommended**

<a name='1-6'></a>
## 1.6 - Optimisation Algorithms
- Gradient Descent with Momentum
- RMSprop
- **Adam**: combines both the concepts of using momentum as well as RMSprop
    - $\alpha$: need to tune
    - $\beta_1$: 0.9 $\rightarrow$ $dW$
    - $\beta_2$: 0.999 $\rightarrow$ $dW^2$
    - $\epsilon$: $10^{-8}$
- Learning rate decay

<a name='1-7'></a>
## 1.7 - Batch Normalisation
Normalizes all the output activations $a^l$ to train $w^{l+1}$ and $b^{l+1}$ faster, and helps with **internal covariate shift**

Implementation:
 - Given some intermediate values in layer L of NN: $z^{(1)}, \dots, z^{(m)}$
 - $\mu = \frac{1}{m}\sum_i z^{(i)}$
 - $\sigma^2 = \frac{1}{m}\sum_i (z^{(i)} - \mu)^2$
 - $z^{(i)}_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}$
 - $\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta$
 
Last step is because you may not want all your hidden unit values to have mean 0 and variance 1 (e.g. having a sigmoid function)

<a name='2'></a>
# 2 - ML Strategy

<a name='2-1'></a>
## 2.1 - Setting up your Goal
- Single Number Evaluation Metric
    - Makes it easier to compare different algorithms performance across different datasets
- Satisficing and Optimizing Metrics
- Train/Dev/Test Distributions
    - Choose a dev and test set to reflect data expected to get in the future and consider important to perform well on
    - Set dev set size to be big enough to detect differences in models being tried out

<a name='2-2'></a>
## 2.2 - Error Analysis
**Quickly build a first model, then analyse the errors it is making and iterate on it**

Manually look at a subset of the incorrectly labelled examples in the validation/dev set. Label each example with the appropriate reason for mismatch.

Provides a way to quickly approximate what errors to focus on, provides a benchmark for the "maximum" improvement to the model if a certain error is fixed.

<a name='2-3'></a>
## 2.3 - Transfer Learning
**Transfer learning from A $\rightarrow$ B**

**Reasons for Use:**
- Task A and B have the same input $x$
- Have a lot more data for Task A than Task B
- Low level features from A could be helpful for learning B

<a name='3'></a>
# 3 - Convolutional Neural Networks (CNN)

<a name='3-1'></a>
## 3.1 - Intro to CNN
**Reasons for CNN**
- **Parameter sharing**: A feature detector that is useful in one part of the image is probably useful in another part of the image
- **Sparsity of Connections**: In each layer, each output layer depends only on a small number of inputs

**Padding**: Extra zeros added to the border of image
- Valid: Any convolution where output size is **not equal** to the input size
- Same: Pad the input so that the output size is the **same** as the input size
**Strides**
- The number of units to move when taking the next filter
    
If layer $l$ is a convolutional layer:
- $f^{[l]} = \text{filter size}$
- $p^{[l]} = \text{padding }$
- $s^{[l]} = \text{stride}$
- $n_C = \text{number of filters (channels)}$
- $\text{Input} = n_H^{[l-1]} * n_W^{[l-1]} * n_C^{[l-1]}$
- $\text{Output} = n_H^{[l]} * n_W^{[l]} * n_C^{[l]}$

$$n_H = \Bigl\lfloor \frac{n_{H_{prev}} - f^{[l]} + 2 * p^{[l]}}{s^{[l]}} \Bigr\rfloor +1$$
$$n_W = n_H$$
    
**Types of Layers**
- Convolutional (CONV)
- Pooling (POOL)
- Fully Connected (FC)

<a name='3-2'></a>
## 3.2 - CNN Networks

<a name='3-2-1'></a>
### 3.2.1 - Classic Networks
- LeNet-5
- AlexNet
- VGG-16

<a name='3-2-2'></a>
### 3.2.2 - ResNets (Residual Networks)
ResNets have skip connections/"shortcuts" that pass the output from $a^{[l]}$ to the input for $a^{[l+2]}$

<font color = 'blue'>

**What you should remember**:

- Very deep "plain" networks don't work in practice because vanishing gradients make them hard to train.  
- Skip connections **help address the Vanishing Gradient problem**. They also make it easy for a ResNet block to learn an identity function. 
- There are two main types of blocks: The **identity block** and the **convolutional block**. 
- Very deep Residual Networks are built by stacking these blocks together.

<a name='3-2-3'></a>
### 3.2.3 - Inception Network

<a name='3-2-4'></a>
### 3.2.4 - MobileNets
Designed to provide fast and computationally efficient performance by using **depthwise separable convolutions**

<font color='blue'>

**What you should remember**:
    
* MobileNetV2's unique features are: 
  * Depthwise separable convolutions that provide lightweight feature filtering and creation
  * Input and output bottlenecks that preserve important information on either end of the block
* Depthwise separable convolutions deal with both spatial and depth (number of channels) dimensions

<a name='3-3'></a>
## 3.3 - Object Detection

**Non-Max Suppression**: ensures that only one bounding box is drawn per object
- Discard all boxes with $p_c \leq probability$
- While there are any remaining boxes
    - Pick the box with the highest $p_c$ and output as prediction
    - Discard any remaining box with IoU (Intersection over Union) $\geq probability$ with the box output in the previous step
    
**Anchor boxes**: define anchor boxes for the shapes of objects to be detected, to address issues of overlapping objects

<a name='3-3-1'></a>
### 3.3.1 - YOLO (You Only Look Once) Algorithm

Fast object detection model that draws bounding boxes around the objects

<font color='blue'>

**What you should remember**:
    
- YOLO is a state-of-the-art object detection model that is fast and accurate
- It runs an input image through a CNN, which outputs a 19x19x5x85 dimensional volume. 
- The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
- You filter through all the boxes using non-max suppression. Specifically: 
    - Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
    - Intersection over Union (IoU) thresholding to eliminate overlapping boxes

<a name='3-3-2'></a>
### 3.3.2 - Image Segmentation: U-Net

<font color='blue'>

**What you should remember**:

* **Semantic image segmentation predicts a label for every single pixel in an image**
* U-Net uses an equal number of convolutional blocks and transposed convolutions for downsampling and upsampling
* **Skip connections** are used to prevent border pixel information loss and overfitting in U-Net

<a name='3-3-3'></a>
### 3.3.3 - Face Recognition (Face Net)

<font color='blue'>

**What you should remember**:
    
- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
- Triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
- The same encoding can be used for verification and recognition. Measuring distances between two images' encodings allows you to determine whether they are pictures of the same person.
- Face recognition is a one-shot learning problem

<a name='3-3-4'></a>
### 3.3.4 - Neural Style Transfer

<font color = 'blue'>

**What you should remember**:
    
- The style of an image can be represented using the Gram matrix of a hidden layer's activations. 
- You get even better results by combining this representation from multiple different layers. 
- This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
- Minimizing the style cost will cause the image $G$ to follow the style of the image $S$. 


<a name='4'></a>
# 4 - Sequential Models - RNN

<a name='4-1'></a>
## 4.1 - Intro to RNN

<font color = 'blue'>

**What you should remember**:
    
- The recurrent neural network, or RNN, is essentially the repeated use of a single cell.
- A basic RNN reads inputs one at a time, and remembers information through the hidden layer activations (hidden states) that are passed from one time step to the next.
    - The time step dimension determines how many times to re-use the RNN cell
- Each cell takes two inputs at each time step:
    - The hidden state from the previous cell
    - The current time step's input data
- Each cell has two outputs at each time step:
    - A hidden state 
    - A prediction
    
## LSTM

<b>What you should remember</b>:
 
- An LSTM is similar to an RNN in that they both use hidden states to pass along information, but an LSTM also uses a cell state, which is like a long-term memory, to help deal with the issue of vanishing gradients
- An LSTM cell consists of a cell state (long-term memory), a hidden state (short-term memory), along with 3 gates that constantly update the relevancy of its inputs:
    - A <b>forget</b> gate, which decides which input units should be remembered and passed along. It's a tensor with values between 0 and 1. 
        - If a unit has a value close to 0, the LSTM will "forget" the stored state in the previous cell state.
        - If it has a value close to 1, the LSTM will mostly remember the corresponding value.
    - An <b>update</b> gate, again a tensor containing values between 0 and 1. It decides on what information to throw away, and what new information to add.
        - When a unit in the update gate is close to 1, the value of its candidate is passed on to the hidden state.
        - When a unit in the update gate is close to 0, it's prevented from being passed onto the hidden state.
    - And an <b>output</b> gate, which decides what gets sent as the output of the time step

## GRU
- Has less gates than LSTM

<a name='4-2'></a>
## 4.2 - Word Embeddings
Creates vectors to represent the "meaning" of each word, with words of similar meaning being closer together

Can learn word embeddings from large text corpus (1 - 100B words). **Useful for transfer learning.**

<font color = 'blue'>

**What you should remember**:
    
- If you have an NLP task where the **training set is small**, using word embeddings can help your algorithm significantly. 
- **Word embeddings allow your model to work on words in the test set that may not even appear in the training set.**
- Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
    - To use mini-batches, the sequences need to be **padded** so that all the examples in a mini-batch have the **same length**. 
    - An `Embedding()` layer can be initialized with pretrained values. 
        - These values can be either fixed or trained further on your dataset. 
        - If however your labeled dataset is small, it's usually not worth trying to train a large pre-trained set of embeddings.   
    - `LSTM()` has a flag called `return_sequences` to decide if you would like to return every hidden states or only the last one. 
    - You can use `Dropout()` right after `LSTM()` to regularize your network.

<a name='4-2-1'></a>
### 4.2.1 - Word2Vec
Trained using **Skip-grams**

Issues: Softmax is slow to compute

<a name='4-2-2'></a>
### 4.2.2 - Negative Sampling


<a name='4-2-3'></a>
### 4.2.3 - GloVe Word Vectors

<a name='4-3'></a>
## 4.3 - Sequence to Sequence Models

<a name='4-3-1'></a>
### 4.3.1 - Conditional Language Models
Picks the next word, based on the highest conditional probability of all the previous words selected.

**Beam Search** is used instead of greedy search, as beam search enables a wider search range, allowing for more diverse sentences to be constructed. Beam search stores the top $n$ sentences at every iteration, and loops on each to find the next best $n$ sentences. **Runs fast but not guaranteed to find exact maximum for argmax**.

**Error Analysis for Beam Search**: If conditional probability of correct sentence $P(y^*|x) >$ conditional probability of outputted sentence $P(\hat{y}|x)$, then **RNN is working correctly and beam search range $n$ should be increased**.

<a name='4-4'></a>
## 4.4 - Attention Models

Solves the issue of long sequences by paying attention to the relevant parts in the sequence

<a name='4-4-1'></a>
## 4.4.1 - Self-Attention
$A(q, K, V)$ = attention-based vector representation of a word
$$
Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
- Query ($Q$) 
- Key ($K$)
- Value ($V$)

<a name='4-4-2'></a>
## 4.4.2 - Multi-Head Attention
Stacks of Self-Attention in order to answer multiple questions

<font color='blue'>
    <b>What you should remember</b>:

- The combination of self-attention and convolutional network layers allows of parallization of training and *faster training*.
- Self-attention is calculated using the generated query Q, key K, and value V matrices.
- Adding positional encoding to word embeddings is an effective way of include sequence information in self-attention calculations. 
- Multi-head attention can help detect multiple features in your sentence.
- Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences.

<a name='4-4-3'></a>
## 4.4.3 - Transformer
When traning a Transformer network, all the data is fed into the model at once. This dramatically reduces training time but loses the information of ordering of the data, hence we use positional encoding - to specifically encode the positions of your inputs and pass them into the network using these sine and cosine formulas:
$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{1}$$
<br>
$$
PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{2}$$

* $d$ is the dimension of the word embedding and positional encoding
* $pos$ is the position of the word.
* $i$ refers to each of the different dimensions of the positional encoding.

Satisfies the following criteria:
- It should output a unique encoding for each time-step (word’s position in a sentence)
- Distance between any two time-steps should be consistent across sentences with different lengths.
- Our model should generalize to longer sentences without any efforts. Its values should be bounded.
- It must be deterministic.



<a name='5'></a>
# 5 - Tensorflow
## Streaming the Data

Here you should take note of an important extra step that's been added to the batch training process: 

- `tf.Data.dataset = dataset.prefetch(8)` 

What this does is prevent a memory bottleneck that can occur when reading from disk. `prefetch()` sets aside some data and keeps it ready for when it's needed. It does this by creating a source dataset from your input data, applying a transformation to preprocess the data, then iterating over the dataset the specified number of elements at a time. This works because the iteration is streaming, so the data doesn't need to fit into the memory. 

`X_train = X_train.batch(minibatch_size, drop_remainder=True).prefetch(8)` # <<< extra step    
`Y_train = Y_train.batch(minibatch_size, drop_remainder=True).prefetch(8)` # loads memory faster 

You may have encountered `dataset.prefetch` in a previous TensorFlow assignment, as an important extra step in data preprocessing. 

Using `prefetch()` prevents a memory bottleneck that can occur when reading from disk. It sets aside some data and keeps it ready for when it's needed, by creating a source dataset from your input data, applying a transformation to preprocess it, then iterating over the dataset one element at a time. Because the iteration is streaming, the data doesn't need to fit into memory.

You can set the number of elements to prefetch manually, or you can use `tf.data.experimental.AUTOTUNE` to choose the parameters automatically. Autotune prompts `tf.data` to tune that value dynamically at runtime, by tracking the time spent in each operation and feeding those times into an optimization algorithm. The optimization algorithm tries to find the best allocation of its CPU budget across all tunable operations. 

To increase diversity in the training set and help your model learn the data better, it's standard practice to augment the images by transforming them, i.e., randomly flipping and rotating them. Keras' Sequential API offers a straightforward method for these kinds of data augmentations, with built-in, customizable preprocessing layers. These layers are saved with the rest of your model and can be re-used later.  Ahh, so convenient! 

As always, you're invited to read the official docs, which you can find for data augmentation [here](https://www.tensorflow.org/tutorials/images/data_augmentation).

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_dataset = train_dataset.prefetch(buffer_size=AUTOTUNE)