# 1. Batch Normalization
At this point we should know that normalizing our data for machine learning models is typically a good idea. That is when we make all of the input features have mean of 0 and variance of 1. 

We accomplish this by subtracting by the mean and dividing by the standard deviation of the data. 

#### $$X_{normalized} = \frac{X - \mu}{\sigma}$$

Recall, the reason we want to do this is because this is the region where our nonlinear activation functions are the most active/dynamic - aka that is where they change the most. 

So, how can we think of batch normalization? Well, instead of making normalization part of the preprocessing stage, we make it part of the neural network itself. 

<br>
<img src="images/old-norm.png">

<img src="images/new-norm.png">

However, what makes this useful is that we perform normalization at every layer! Recall, that each layer of a neural network is like a little logisitic regression. So rather than just normalize the data once, we will normalize it before we do every little logistic regression. 

---

<br>
# 2. Exponentially Smoothed Averages
We are going to take a minute to dig into something that may seem straight forward: How to calculate an average. The first thought we all have when being asked to do this is: Why not just add all of the sample data points, and then divide by the number of data points, resulting in the sample mean:

#### $$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n$$

But now let's suppose that you have a large amount of data-so much so that all of your X's cannot fit into memory at the same time. Is it still possible to calculate the sample mean? Yes it is! We can read in one data point at a time, and then delete each data point after we've looked at it. It is shown below that the current sample mean can actually be expressed in terms of the previous sample mean and the current data point.

#### $$\bar{X}_N =  \frac{1}{N}\sum_{n=1}^NX_n = \frac{1}{N}\Big((N-1)\bar{X}_{N-1} + X_N \Big) = (1 - \frac{1}{N})\bar{X}_{N-1}+\frac{1}{N}X_N$$

We can then express this using simpler symbols. We can call $Y$ our output, and we can use $t$ to represent the current time step:

#### $$Y_t = (1 - \frac{1}{t})Y_{t-1} + \frac{1}{t}X_t$$

Great, so we have solved our problem of how to calculate the sample mean when we can't fit all of the data into memory, but we can see that there is this $\frac{1}{t}$ term. This says that as $t$ grows larger, the current sample has less and less of an effect on the total mean. Clearly this makes sense, because as $t$ grows that means the total number of $X$'s we've seen has grown. We also decrease the influence of the previous $Y$ by $1 - \frac{1}{t}$. This means that each new $Y$ is just part of the old $Y$, plus part of the newest $X$. But in the end, it balances out to give us exactly the sample mean of $X$. 

For convenience we can call this $\frac{1}{t}$ term $\alpha_t$. What if we were to say that we did not want $\alpha$ to be $\frac{1}{t}$? What if we said that we wanted each data point to matter equally at the time that we see it, so that we can set alpha to be a constant? Of course, $\alpha$ needs to be less than 1 so that we don't end up negating the previous mean. 

#### $$0 < \alpha_t = constant < 1 $$
#### $$Y_t = (1 - \alpha)Y_{t-1} + \alpha X_t$$

So what does this give us? 

### 2.1 The Exponentially-smoothed average
This gives us what is called the exponentially smoothed average! We can see why it is called exponential when we express it in terms of only $X$'s. 

#### $$Y_t = (1 - \alpha)^tY_0 + \alpha \sum_{\tau = 0}^{t - 1}(1- \alpha)^\tau X(t- \tau)$$

If the equation above is not clear, the expansion below should clear up where everything is coming from and *why* this is called exponential. Let's say we are looking at $Y_{100}$:

#### $$Y_{100} = (1-\alpha)^{100}Y_0 + \alpha * X_{100} + \alpha * (1 - \alpha)^1*X_{99} + \alpha * (1 - \alpha)^2 * X_{98}+ ...$$

We can see the exponential term start to accumulate along the $(1 - alpha)$! Now, does this still give us the mean, aka the expected value of $X$? Well, if you take the expected value of everything, we can see that we arrive at the expected value of $X$:

#### $$(1 - \alpha)E[y(t-1)] + \alpha E[X(t)] = (1-\alpha)E(X) + \alpha E(X) = E(X)$$

We do arrive at the expected value of $X$, so we can see that the math does checkout! Of course, this is assuming that the distribution of $X$ does not change over time. Note that if you have come from a signal processing background, you may recognize this as a **low-pass filter**. Another way to think about this is that you are saying *current values matter more*, and *past values matter less* in an exponentially decreasing way. So, if $X$ is not stationary (meaning it's distribution changes over time), then this is actually a better way to estimate the mean (average) then weighting all data points equally over all time.


---

<br>
# 3. Batch Normalization - Theory
Recall, before we input any of our data in ML algorithms, we like to normalize the data first. Normalization means subtracting the mean, and dividing by the standard deviation. This forces our data to have a mean of 0 and a variance of 1. We like to do this because it keeps our inputs in a specific range, and we know that our sigmoid and tanh functions are most saturated when the inputs are small. 

What **batch normalization** does, is that instead of you normalizing the data manually, before putting it into the neural network, normalization will occur at every layer of the neural network. You can think of it as building normalization into the neural net vs. doing the calculation yourself before it is input into the neural net. So, the normalization occurs at the neural network level, and not the data level. 

<br>
## 3.1 How does it really work?
First and foremost, it should be known that the name batch normalization comes from the fact that we will be doing batch gradient descent. In other words, during training we will be looking at a small batch of data and doing one gradient descent step on that, looking at the next batch, and so on. 

So, it is called batch normalization because the mean and standard deviation that we calculate are the sample mean and sample standard deviation of the batch. Keep in mind that this only applies during training, because it is only during training that we will have batches of data. 

```
X_B = next batch of data
mu_B = mean(X_B)
sigma_B = std(X_B)
Y_B = (X_B - mu_B) / sigma_B
```

You may still wonder, "when does batch normalization actually happen?" Well, batch normalization happens right before we pass it through the activation function. So, before we had two steps: 
1. Do the linear transformation
2. Then pass it through the activation function. 

<img src="images/regular.png">

With batch gradient descent, we have 3 steps:
1. Do the linear transformation
2. Perform batch normalization 
3. Then pass it through the activation function

<img src="images/batch-gradient.png">

<br>
### 3.1.1 Naming Convention
One important thing to remember is that we are going to call the input to batch normalization $X$, and the output $Y$. This may be confusing at first since generall the input to the neural network is referred to as $X$ and the output is referred to as $Y$. So, for this lecture only does $X$ refer to the inputs of batch normalization and $Y$ refers to the outputs of batch normalization. 

<br>
## 3.2 Missing Step
Now, so far this may seem very simple. You have a batch of activations, you calculate their mean and standard deviation, you standardize the activations, and then pass it through the activation function as normal. Note, we often add a small number to denominator in order to prevent dividing by 0 if the variance of the batch is 0.

```
X_B = next batch of data (Note: X refers to activation here)
mu_B = mean(X_B)
sigma_B_squared = var(X_B)
Y_B = (X_B - mu_B)/ sqrt(sigma_B_squared + epsilon)
```

However, at this point we are missing one crucial and slightly counter intuitive step! This step comes right before we pass through the activation function. So, what was shown previously was actually missing a step. So, right after normalize the data, we actually scale it back. Right after standardizing the data, we actually scale it to something else, giving it a different mean and different standard deviation. We call this second scale parameter $\gamma$, and the second location parameter $\beta$. 

#### $$\hat{X}_B = \frac{X_B - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
#### $$Y = \gamma\hat{X}_B + \beta$$

This may seem very counterintuitive, because this entire concept was motivated with the idea that standardization is good. So, why are we now unstandardizing our data by multiplying by $\gamma$ and adding $\beta$? The reason is that standardization may not be good, but we do not know. So, **we let the neural network decide by using gradient descent**. In other words, we will optimize $\gamma$ and $\beta$ to do whatever is best. 

Note, these parameters will be updated using back propagation, just like all of the other weights in the neural network. Luckily, now that we know theano and tensorflow, we don't have to worry about taking any derivatives, and we can just focus on how the neural network was built. 

<br>
## 3.3 Letting the Neural Network learn what is best
So, lets suppose that standardization is good. Then, the neural network will learn that $\gamma$ should be close to 1, and $\beta$ should be close to 0. However, if standardization is bad and something else will lead to better results, then the neural network will learn a better $\gamma$ and a better $\beta$ - aka they will be whatever minimizes the cost! In other words, the neural network is learning through gradient descent what the best scale and shift of the data should be. This is not necessarily 1 and 0, it is whatever minimizes our cost function. 

## 3.4 Another Problem
So, we now know how to train a neural network with batch gradient descent, but we still have another problem. Let's say it is now time to do prediction; we can say we have 1 sample input, and we would like to calculate a prediction for it. Well, we can't do what we have been doing! Because if we subtract the mean of one sample from the sample itself, we would just get a vector of 0s. So, clearly this is not what we want to do during test time. 

What would be nice is if we kept track of all the sample means and sample standard deviations we saw during training, we could calculate an overall (global) mean and overall (global) standard deviation, and use those during test time! This is exactly what we do; we keep a running mean and running variance! This will look similar to RMSprop and Adam Smoothing, since it is an exponentially smoothed average! We keep a running mean and running variance, and we do this using exponential smoothing. We will call our global mean $\mu$ and our global variance $\sigma^2$.

```
for each batch B:
    mu = decay*mu + (1 - decay)*mu_B
    sigma_squared = decay*sigma_squared + (1 - decay)*sigma_B_squared
```

Now, theoreticall you could just calculate the global mean and global variance to the be the mean and variance of all of your training data:
#### $$\mu = mean(X_{train})$$
#### $$\sigma^2 = var(X_{train})$$

The problem with that is that it does not scale if your data is too large. At the same time, that may be a simpler way to think of $\mu$ and $\sigma$.

## 3.5 Test/Prediction 
During test time, based on our above discussion:

#### $$\hat{x}_{test} = \frac{x_{test} - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
#### $$y_{test} = \gamma \hat{x}_{test} + \beta$$

Where $\mu$ and $\sigma$ are whatever we calculated during training. 

## 3.6 Implementation
Now, in order to implement these, you could do it manually. However, there are functions that are built into both theano and tensorflow that can help us. For tensorflow that is:

**TensorFlow Implementation**<br>
```
tf.nn.batch_normalization
# or
tf.contrib.layers.batch_norm
```

And for Theano:
**Theano Implementation**<br>
```
from theano.tensor.nnet.bn import batch_normalization_train,
batch_normalization_test
```

One thing to keep in mind is that these are all element wise operations. So, this applies equally to scalars and vectors. 

## 3.7 Theory
So, after all of this discussion on the mechanics of batch normalization, you may still be wondering, how does batch normalization actually help? Well, the authors of the paper that introduced batch gradient descent mention **internal covariant shift**. That sounds very complicated, however, covariate is really just another term for input features, so we can call that X. By shift, they mean that as we perform batch gradient descent the distribution of X may change as we change the network. Aka, this is saying that the distribution of input features can change during training. 

What happens when the data shifts is that the weights then have to compensate, because what they expected the data to look like before is not what it looks like now. So, if you input data has changed then all of the weights in front of it in the neural network will then have to change. The authors claim that this increases training time, because it requires lowering the learning rate and careful weight initialization. 

But, lets say we do batch normalization, and the data in each layer is normalized. Let's assume the ideal case where all of the data is precisely of mean 0 and variance 1. Well, now the weights won't have to adapt to different looking data, because all of the data looks the same. 

This has two effects that have been shown by the authors of the original paper. 
1. First, it allows us to increase the learning rate, leading to faster training. 
2. Secondly, it acts like a regularizer. Because the inputs no longer have to take on extreme values, neither will the weights. It has been demonstrated that sometimes by including batch normalization in the network, you can eliminate the need for dropout regularization, since BN is already doing something like regularization. 

One final note: Usually a neural network has a weight matrix and bias vector as parameters. However, we no longer need the bias because the batch normalization already shifts the data!