# Normalization in Neural Networks

In this notebook I'll discuss the different types of Normalizations that are commonly used in Neural Networks, along with their applications and implementations

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers as L

# Batch Normalization

Batch Normalization was first discussed in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf)

They define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training. This adversely affects training speed because the later layers have to adapt to the shifted distribution.

They proposed that by whitening the inputs to each layer,we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.

Whitening is linearly transforming inputs to have zero mean, unit variance, and be uncorrelated.

**The paper introduces Batch Normalization as follows:**

1. Normalize each feature independently to have zero mean and unit variance:
<center><h3>$$ \hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}} $$</h3></center>
where $ x = (x^{(1)}...x^{(d)}) $ is the d-dimensional input.
2. The estimates of mean and variance are from the mini-batch for normalization; instead of calculating the mean and variance across the whole dataset.
3. Normalizing each feature to zero mean and unit variance could affect what the layer can represent. To overcome this each feature is scaled and shifted by two trained parameters.
<center><h3>$$ y^{(k)} = \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} $$</h3></center>
where $ y^{(k)} $ is the output of the batch normalization layer.
4. An exponential moving average of mean and variance is calculated during the training phase and is then used during inference.

In [2]:
class BatchNormalization(L.Layer):
    def __init__(self, eps=1e-5, momentum=0.1, **kwargs):
        super().__init__(**kwargs)
        self.eps = eps
        self.momentum = momentum   
    
    def call(self, x):
        x_shape = x.shape
        batch_size = x_shape[0]
        channels = x.shape[-1]
        
        self.exp_mean = tf.zeros(channels)
        self.exp_var = tf.ones(channels)
        
        self.scale = tf.zeros(channels)
        self.shift = tf.ones(channels)
        
        x = tf.reshape(x, (batch_size, -1, channels))
        
        mean = tf.reduce_mean(x, [0, 1])
        mean_x2 = tf.reduce_mean((x ** 2), [0, 1])
        var = mean_x2 - mean ** 2
        
        self.exp_mean = (1 - self.momentum) * self.exp_mean + self.momentum * mean
        self.exp_var = (1 - self.momentum) * self.exp_var + self.momentum * var
        
        mean = self.exp_mean
        var = self.exp_var
        
        x_norm = (x - tf.reshape(mean, (1, 1, -1))) / tf.reshape(tf.sqrt(var + self.eps), (1, 1, -1))
        x_norm = tf.reshape(self.scale, (1, 1, -1)) * x_norm +tf.reshape(self.shift, (1, 1, -1))
        return tf.reshape(x_norm, x_shape)        

In [3]:
x = tf.random.normal((32, 24, 24, 3))
assert BatchNormalization()(x).shape == x.shape


User settings:

   KMP_AFFINITY=granularity=fine,verbose,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_ENABLE_TASK_THROTTLING=true
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
   KMP_REDUCTION_BARRIER='1,1'
   KMP_REDUCTION_BAR

# Layer Normalization
Layer normalization, introduced in the paper [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf) is a simpler normalization method that is generally used for NLP tasks but works on a wider range of settings.

<center><h3>$ LN(x) = \gamma . \frac{X - E_{H, W, C}[X]}{\sqrt{Var_{H, W, C}[x] + \epsilon}} + \beta $</h3></center>

In [4]:
class LayerNormalization(L.Layer):
    def __init__(self, eps=1e-5,  **kwargs):
        super().__init__(**kwargs)
        self.eps = eps
    
    def call(self, x):
        normalized_shape = x.shape[1:]
                
        self.gain = tf.zeros(normalized_shape)
        self.bias = tf.ones(normalized_shape)
        
        dims = [-(i + 1) for i in range(len(normalized_shape))]
          
        mean = tf.reduce_mean(x, dims, keepdims=True)
        mean_x2 = tf.reduce_mean((x**2), dims, keepdims=True)
        var = mean_x2 - mean ** 2
        
        x_norm = (x - mean) / tf.sqrt(var + self.eps)
        x_norm = self.gain * x_norm + self.bias

        return x_norm

In [5]:
x = tf.random.normal((32, 24, 24, 3))
assert LayerNormalization()(x).shape == x.shape

# Instance Normalization
Instance normalization was introduced in the paper [Instance Normalization: The Missing Ingredient for Fast Stylization](https://arxiv.org/pdf/1607.08022.pdf) to improve style transfer.

<center><h3>$ IN(x) = \gamma . \frac{X - E_{H, W}[X]}{\sqrt{Var_{H, W}[x] + \epsilon}} + \beta $</h3></center>

In [6]:
class InstanceNormalization(L.Layer):
    def __init__(self, eps=1e-5,  **kwargs):
        super().__init__(**kwargs)
        self.eps = eps
    
    def call(self, x):
        x_shape = x.shape
        batch_size = x_shape[0]
        channels = x.shape[-1]
        
        self.scale = tf.zeros(channels)
        self.shift = tf.ones(channels)
        
        x = tf.reshape(x, (batch_size, -1, channels))
        
        mean = tf.reduce_mean(x, [1], keepdims=True)
        mean_x2 = tf.reduce_mean((x ** 2), [1], keepdims=True)
        var = mean_x2 - mean ** 2
        
        x_norm = (x - mean) / tf.sqrt(var + self.eps)
        x_norm = tf.reshape(self.scale, (1, 1, -1)) * x_norm +tf.reshape(self.shift, (1, 1, -1))
        
        return tf.reshape(x_norm, x_shape) 

In [7]:
x = tf.random.normal((32, 24, 24, 3))
assert InstanceNormalization()(x).shape == x.shape

# Group Normalization
Batch Normalization works well for large enough batch sizes but not well for small batch sizes, because it normalizes over the batch. Training large models with large batch sizes is not possible due to the memory capacity of the devices.

Group Normalization introduced in the paper [Group Normalization](https://arxiv.org/pdf/1803.08494.pdf), normalizes a set of features together as a group. This is based on the observation that classical features such as SIFT and HOG are group-wise features. The paper proposes dividing feature channels into groups and then separately normalizing all channels within each group.

All normalization layers can be defined by the following computation.
<center><h3>$ \hat{x}_{i} = \frac{1}{\sigma_{i}}(x_{i} - \mu_{i}) $</h3></center>
where $ \mu_{i} $ and $ \sigma_{i} $ are mean and standard deviation
<center><h3>$ \mu_{i}  = \frac{1}{m}\Sigma_{k \in S_{i}} x_{k} $</h3></center>
<center><h3>$ \sigma_{i}  = \sqrt{\frac{1}{m}\Sigma_{k \in S_{i}} (x_{k} - \mu_{i})^{2} + \epsilon} $</h3></center>
$ S_{i} $ is the set of indexes across which the mean and standard deviation are calculated for index i. m is the size of the set $ S_{i} $ which is the same for all i.

The definition of $ S_{i} $ is different for Batch normalization, Layer normalization, and Instance normalization.


**Batch Normalization**
<center><h3> $ S_{i} = \{ k|k_{c} = i_{c} \} $ </h3></center>
The values that share the same feature channel are normalized together.
<br><br>

**Layer Normalization**
<center><h3> $ S_{i} = \{ k|k_{n} = i_{n} \} $ </h3></center>
The values from the same sample in the batch are normalized together.
<br><br>

**Instance Normalization**
<center><h3> $ S_{i} = \{ k|k_{n} = i_{n}, k_{c} = i_{c} \} $ </h3></center>
The values from the same sample and same feature channel are normalized together.

<br><br>

**Group Normalization**
<center><h3> $ S_{i} = \{ k|k_{n} = i_{n}, floor(\frac{k_{c}}{C/G}) = floor(\frac{i_{c}}{C/G}) \} $ </h3></center>
where $ G $ is the number of groups and $ C $ is the number of channels.

Group normalization normalizes values of the same sample and the same group of channels together.

In [8]:
class GroupNormalization(L.Layer):
    def __init__(self, groups, channels, eps=1e-5,  **kwargs):
        super().__init__(**kwargs)
        self.eps = eps
        self.groups = groups
        self.channels = channels
        
        self.scale = tf.zeros(channels)
        self.shift = tf.ones(channels)
        
    def call(self, x):
        x_shape = x.shape
        batch_size = x_shape[0]
        
        x = tf.reshape(x, (batch_size, -1, self.groups))
        
        mean = tf.reduce_mean(x, [1], keepdims=True)
        mean_x2 = tf.reduce_mean((x ** 2), [1], keepdims=True)
        var = mean_x2 - mean ** 2
        
        x_norm = (x - mean) / tf.sqrt(var + self.eps)
        x_norm = tf.reshape(x_norm, (batch_size, -1, self.channels))
        x_norm = tf.reshape(self.scale, (1, 1, -1)) * x_norm +tf.reshape(self.shift, (1, 1, -1))
        
        return tf.reshape(x_norm, x_shape) 

In [9]:
x = tf.random.normal((32, 24, 24, 8))
assert GroupNormalization(2, 8)(x).shape == x.shape

# Weight Standardization
Batch Normalization doesn't work well when the batch size is too small, which happens when training large networks because of device memory limitations. The paper [Micro-Batch Training with Batch-Channel Normalization and Weight Standardization](https://arxiv.org/pdf/1903.10520.pdf) introduces Weight Standardization with Batch-Channel Normalization as a better alternative.

<center><h3> $ \hat{W_{i, j}} = \frac{W_{i, j} - \mu w_{i, .}}{ \sigma w_{i, .} } $ </h3></center>
where
<center><h3> $ W \in  R^{O \times I} $ </h3></center>
<center><h3> $ \mu w_{i, .} = \frac{1}{I} \Sigma_{j=1}^{I} W_{i, j} $ </h3></center>
<center><h3> $  \sigma w_{i, .} = \sqrt{\frac{1}{I} \Sigma_{j=1}^{I} W_{i, j}^{2} - \mu w_{i, .}^{2} + \epsilon} $ </h3></center>
for a 2D-convolution layer $ O $ is the number of output channels and $ I $ is the number of input channels times the kernel size $ (I = C_{in} \times K_{H} \times K_{W}) $

In [10]:
def weight_standardization(weight):
    c_out, c_in, *kernel_shape = weight.shape
    weight = tf.reshape(weight, (c_out, -1))
    
    mean = tf.reduce_mean(weight, [1], keepdims=True)
    mean_x2 = tf.reduce_mean((weight ** 2), [1], keepdims=True)
    var = mean_x2 - mean ** 2
    
    weight = (weight - mean) / (tf.sqrt(var + eps))
    
    return tf.reshape(weight, (c_out, c_in, *kernel_shape)) 

# Batch-Channel Normalization

This first performs a batch normalizationThis first performs a batch normalization. Then a channel normalization is performed.

Channel Normalization is similar to Group Normalization but affine transform is done group wise.

In [11]:
class ChannelNormalization(L.Layer):
    def __init__(self, groups, channels, eps=1e-5,  **kwargs):
        super().__init__(**kwargs)
        self.eps = eps
        self.groups = groups
        self.channels = channels
        
        self.scale = tf.zeros(groups)
        self.shift = tf.ones(groups)
        
    def call(self, x):
        x_shape = x.shape
        batch_size = x_shape[0]
        
        x = tf.reshape(x, (batch_size, -1, self.groups))
        
        mean = tf.reduce_mean(x, [1], keepdims=True)
        mean_x2 = tf.reduce_mean((x ** 2), [1], keepdims=True)
        var = mean_x2 - mean ** 2
        
        x_norm = (x - mean) / tf.sqrt(var + self.eps)
#         x_norm = tf.reshape(x_norm, (batch_size, -1, self.channels))
        x_norm = tf.reshape(self.scale, (1, 1, -1)) * x_norm +tf.reshape(self.shift, (1, 1, -1))
        
        return tf.reshape(x_norm, x_shape) 

In [12]:
x = tf.random.normal((32, 24, 24, 8))
assert ChannelNormalization(2, 8)(x).shape == x.shape

# References
1. [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf)
2. [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf)
3. [Instance Normalization: The Missing Ingredient for Fast Stylization](https://arxiv.org/pdf/1607.08022.pdf)
4. [Group Normalization](https://arxiv.org/pdf/1803.08494.pdf)
5. [Micro-Batch Training with Batch-Channel Normalization and Weight Standardization](https://arxiv.org/pdf/1903.10520.pdf)