# Batch vs Mini-batch vs stochastic gradient descent

In the previous sections, neural networks models were trained on the entire training set containing m examples. This means that, in order to compte the cost function and update the model parameters, a complete pass, called *epoch* over all the $m$ training examples is inevitable. This may be extremely time consuming in this deep learning regime where the size of the training set is huge.

This motivates the idea to run the gradient descent on a subset of the $m$ training examples and update the parameters. These subsets are refered to as batches and the gradient descent is referred to as **mini-batch gradient descent**. It usually run faster than the gradient descent with all training examples, $m$.

In the case where the batches are of size 1, the gradient descent is called a **stochastic gradient descent**.

In the case where the batch size is $m$, the gradient descent algorithm is just a **batch gradient descent**.

## The implementation

Below is a pseudo-code implementation for the gradient descent with $T$ batches

```
repeat for t=1 to t=T:
  get the batch x_t
  calculate forward propagation with x_t
  compte cost for batch t
  apply back propagation
  update weights
```

When training batch vs mini-batch gradient descent, the cost function tends to descent smoothly in the case of batch gradient descent. However, in the case of mini-batch gradient descent, the general trend of the cost graph will go down as the number of iteration increases however the graph is oscillating on contrast to smooth behavior on the batch gradient descent. Generally, as the mini-batch size $t$ is closer to 1, the graph oscillation increases. In the extreme case of stochastic gradient descent, the algorithm may not reach to the global optimum due to the noisy oscillation it makes while descending. Consider the following image, taken from AndrewNg course, describing such situation.

![batch vs minibatch gradient descent](images/batch_vs_minibatch_vs_stochastic.png)

The image is a plot for the contours of the gradient descent while going down to the global minimum. The magenta oscillating graph represent the case of stochastic gradient descent while the blue smooth almost linear graph represents the batch gradient descent training. The green graph represents the case of mini-batch gradient descent with reasonable batch size $t$. However, reaching the global minimum is, also, not guaranteed.

## General guidelines on choosing the mini-batch size $t$

- If the training set is small, the batch gradient descent is the best candidate.
- If the training set is quite large, the batch size can be in between [64,512]
- Generally, setting $t$ to be a power of 2 performs better.
- Make sure the training mini-batch fits on the GPU/CPU memory.

# Exponentially weighted average

exponentially weighted average is a statistical tool to estimate the average of variable values from its history. For example, to estimate the average of day's temperature based on the previous days temperatures history.

The general formula of exponentially weighted average, sometimes called moving average, is:

$$V_{t} = \beta V_{t-1} + (1-\beta)\theta_{t}$$

Couple of notes on the parameter $\beta$:

- $\beta \in [0,1]$
- controls how the history affects the moving average. As long as $\beta$ increases, we are given more weights to the previous history. In particular, the number of values taken from the history is approximately $\frac{1}{1-\beta}$.
- for small values of $\beta$, the moving average becomes more susceptible to outliers. 
- The graph for small values or $\beta$ is more noisy and becomes smoother as $\beta$ increases.