### Introduction

**Batch normalization** is a popular method to fasten deep-network training process also solving the gradient vanishing or exploding problem. In [this post](https://kentchun33333.github.io/), I am going to first discuss some ideas, then take a glance on algorithms in the [paper](https://arxiv.org/pdf/1502.03167v3.pdf),and finally take a deep look on how tensorflow implementation it. 




### 1. Concepts about Normalization

It is quite common to use normalization in neural network, especially deep convnet. Activation function is working like an filtering/magnification in feature signals, and on the other hand, the various normalization method is working like an smoothing/de-amplifier. The underlying concept is that ** only difference matters while delievering information**. Just think the information like electricity in cpu, nowadays the driving voltage is much lower than the past cpu, but it carried more heavy work, compute even faster and spend less energy.

### 2. How to Chose Normalization Method 
**There are multiple ways to normalize our data**. Before batch normalization layer, there are several ways/methods like as :
- per-image-normalize
- per-image whitening, 
- per-batch normalization,
- per-batch whitening,
- local constrast normalization (LCN) 
- local response normalization (LRN)...etc 
- batch normalization 
- layer normalization
- instance normalization 
- group normalization

** The strategy to add what normalization is correlated on the operation before/after it.** 

For example, if you are going to applied a RELU with a threshod equal to 0.5 ( which is not common) after a normalization-layer, you probabily dont want to ouput of this normalization-layer to be at range between 0,1. Since the following RELU-layer would swipe too many information... 

For another example, you would like to introduce a **depth-wise or channel-wise normalization** after the conv-layer. Because, the conv-layer is actually a depth-wish operation. 

For more informations about normalization, check this [post](http://yeephycho.github.io/2016/08/03/Normalizations-in-neural-networks/).

### 3. Math Expression of Batch Normalization


There are two different operations in Batch Normalization.

- Training: to calculate mini batch mean in order to normalize the batch

- Inference: apply pre-calculated mini batch statistics
  - (To calculate this mini batch statics, we using moving average)
  - running_mean = momentum * running_mean + (1 - momentum) * sample_mean
  - running_var = momentum * running_var + (1 - momentum) * sample_var

$ \begin{array} \\
\text{Algorithm 1: Batch Normalization within Batch} \\  
\hline 
\text{Input: Values of x over a mini-Batch : } B \{  x_{1 \text{ ... m }} \}  \\
\text{Parameters: } \beta \text{ and } \gamma \\
\text{Output: A set of } Y : \{ y_{i} = \text{ BatchNorm}_{\beta, \gamma}(x^{i}) \} \\
\text{ } \mu_{ \beta } \leftarrow  \frac{1}{m} \sum_{i=1}^m (x_{i}) \text{ ----------------- mini-batch mean}\\
\text{ } \alpha^{2}_{ \beta } \leftarrow  \frac{1}{m} \sum_{i=1}^m (x_{i} - \mu_{\beta})^{2} \text{ --------- mini-batch variance}\\
\text{ } \hat{x}_{i} \leftarrow \frac{x_{i} - \mu_{\beta}}{\sqrt{\alpha^{2}_{\beta}+ \epsilon}} \text{ ---------------------- normalization where epsilon is the number to prevent dividing zero } \\
\text{ } y_{i} \leftarrow \gamma \hat{x}_{i} + \beta = \text{BN}_{\gamma, \beta}(x_{i}) \text{ -------- scale and shift}\\
\end{array} $


$ \begin{array} \\
\text{Algorithm 2: Training with Batch-Normalized } \\  
\hline 
\text{Input : Network N with trainable parameters } \Theta \text{ ; subset of activations } \{ x^{(k)}\}^{K}_{k=1} \\
\text{Output : Batch-normalized network for inference, N}^{inf}_{\text{BN}} \\
\text{ - N}^{\text{tr}}_{\text{BN}} \leftarrow \text{ N  (Training BN network)} \\
\text{ - } \textbf{for } k = 1 ... K \textbf{  do } : \\ 
\text{ - Add transformation } y^{(k)} = \text{BN}_{\gamma^{(k)}, \beta^{(k)}}(x^{(k)}) \text{ to N}^{\text{tr}}_{\text{BN}} \text{ ( Alg. 1 )} \\ 
\text{ - Modify each layer in N}^{\text{tr}}_{\text{BN}} \text{ with input } x^{(k)} \text{ to take } y^{(k)} \text{ instead} \\
\text{ - } \textbf{end for }\\
\\
\text{ - Train N}^{\text{tr}}_{\text{BN}} \text{ to optimize the parameters } \Theta \cup \{ \gamma^{(k)}, \beta^{(k)} \}^{K}_{k=1}\\
\end{array} $

$ \begin{array} \\
\text{Algorithm 3: Inference with Batch-Normalized} \\  
\hline 
\text{ - N}^{\text{inf}}_{\text{BN}} \leftarrow \text{N}^{\text{tr}}_{\text{BN}} \text{ (inference BN network with forzen parameters)} \\
\\
\text{ - } \textbf{for } k = 1 ... K \textbf{  do } : \\ 
\text{ - //For clarity} x=x^{(k)}, \gamma = \gamma^{(k)}, \beta = \beta^{(k)} ... etc \\
\text{ - Process multiple training mini-batches B, each of size m, and average over them :}\\
\text{ - E}[x] \leftarrow \text{E}_{\beta}[\mu_{\beta}]\\
\text{ - Var}[x] \leftarrow \frac{m}{m-1} \text{E}_{\beta}[\alpha^{2}_{\beta}]\\
\text{ - In N}^{\text{inf}}_{\text{BN}} \text{, replace the transform } y = \text{BN}_{\gamma, \beta}(x) \text{ with } y = \frac{\gamma}{\sqrt{\text{Var}[x]+\epsilon}} \dot x + (\beta - \frac{\gamma \text{E}[x]}{\sqrt{\text{Var}[x]+\epsilon}})\\
\text{ - } \textbf{end for }\\
\end{array} $

### 4. Tensorflow Implementations

```
import tensorflow as tf


def batch_norm(x, phase_train, scope='bn', affine=True):
    """
    Batch normalization on convolutional maps.
    from: https://stackoverflow.com/questions/33949786/how-could-i-
    use-batch-normalization-in-tensorflow
    Only modified to infer shape from input tensor x.
    Parameters
    ----------
    x
        Tensor, 4D BHWD input maps
    phase_train
        boolean tf.Variable, true indicates training phase
    scope
        string, variable scope
    affine
        whether to affine-transform outputs
    Return
    ------
    normed
        batch-normalized maps
    """
    with tf.variable_scope(scope):
        shape = x.get_shape().as_list()

        beta = tf.Variable(tf.constant(0.0, shape=[shape[-1]]),
                           name='beta', trainable=True)
        gamma = tf.Variable(tf.constant(1.0, shape=[shape[-1]]),
                            name='gamma', trainable=affine)

        batch_mean, batch_var = tf.nn.moments(x, [0, 1, 2], name='moments')
        ema = tf.train.ExponentialMovingAverage(decay=0.9)
        ema_mean, ema_var = ema.average(batch_mean), ema.average(batch_var)

        def mean_var_with_update():
            """Summary
            Returns
            -------
            name : TYPE
                Description
            """
            ema_apply_op = ema.apply([batch_mean, batch_var])
            with tf.control_dependencies([ema_apply_op]):
                return tf.identity(batch_mean), tf.identity(batch_var)
        mean, var = tf.cond(phase_train,
                                          mean_var_with_update,
                                          lambda: (ema_mean, ema_var))

        normed = tf.nn.batch_norm_with_global_normalization(
            x, mean, var, beta, gamma, 1e-3, affine)
    return normed
    
# this is the function of tf.nn.batch_normalization
def batch_normalization(x,
                        mean,
                        variance,
                        offset,
                        scale,
                        variance_epsilon,
                        name=None):
  with ops.name_scope(name, "batchnorm", [x, mean, variance, scale, offset]):
    inv = math_ops.rsqrt(variance + variance_epsilon)
    if scale is not None:
      inv *= scale
    return x * math_ops.cast(inv, x.dtype) + math_ops.cast(
        offset - mean * inv if offset is not None else -mean * inv, x.dtype)
```

### Reference 

- http://yeephycho.github.io/2016/08/03/Normalizations-in-neural-networks/

- https://github.com/pkmital/CADL/blob/master/session-4/libs/batch_norm.py

- http://cthorey.github.io./backpropagation/

- http://r2rt.com/implementing-batch-normalization-in-tensorflow.html

- https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

- https://www.zhihu.com/question/38102762

- http://shuokay.com/2016/10/15/wavenet/

- http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

- http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow

- https://github.com/leichaocn/normalization_of_neural_network/blob/master/batch_normalization_practice.ipynb