# Layer Normalization

## Layer Normalization

之前有讲过[Batch Normalization](https://github.com/MorvanLi/deepLearningTutorial/blob/main/batchNormalization.ipynb)的原理，今天来简单讲讲`Layer Normalization`。Layer Normalization是针对自然语言处理领域提出的，例如像RNN循环神经网络。为什么不使用直接BN呢，因为在RNN这类时序网络中，时序的长度并不是一个定值（网络深度不一定相同），比如每句话的长短都不一定相同，所有很难去使用BN，所以作者提出了Layer Normalization（注意，在图像处理领域中BN比LN是更有效的，但现在很多人将自然语言领域的模型用来处理图像，比如Vision Transformer，此时还是会涉及到LN）。直接看下Pytorch官方给的关于LayerNorm的简单介绍。只看公式的话感觉和BN没什么区别，都是减去均值$E(x)$，除于方差$\sqrt{Var(x)+\varepsilon } $其中$\varepsilon$是一个非常小的量（默认为$10^{-5}$），是为了防止分母为零。同样也有两个可训练的参数。不同的是，BN是对一个batch数据的每个channel进行Norm处理，但LN是对单个数据的指定维度进行Norm处理与batch无关（后面有示例）。而且在BN中训练时是需要累计$moving_{mean}$和$moving_{var}$两个变量的（所以在BN中需要四个参数$moving_{mean}$,$moving_{var}$，$\beta$和$\gamma$，但LN不需要累计只有$\beta$，$\gamma$ 。

在Pytorch的LayerNorm类中有个**normalized_shape**参数，可以指定你要Norm的维度（注意，函数说明中the last certain number of dimensions，指定的维度必须是从最后一维开始）。比如我们的数据的shape是[4, 2, 3]，那么normalized_shape可以是[3]（最后一维上进行Norm处理），也可以是[2, 3]（Norm最后两个维度），也可以是[4, 2, 3]（对整个维度进行Norm），但不能是[2]或者[4, 2]，否则会报以下错误（以normalized_shape=[2]为例）：

## Pytorch 实验

In [24]:
import torch
import torch.nn as nn

def layer_norm_process(feature: torch.Tensor, beta=0., gamma=1., eps=1e-5):
    var_mean = torch.var_mean(feature, dim=-1, unbiased=False)
    # 均值   -------> [[4, 2]]
    mean = var_mean[1]   
    # 方差-------> [[4, 2]]
    var = var_mean[0]

    # layer norm process  mean[..., None]== [[[4, 2, 1]]]在最后添加一个维度
    feature = (feature - mean[..., None]) / torch.sqrt(var[..., None] + eps)
    feature = feature * gamma + beta

    return feature


def main():
    t = torch.rand(4, 2, 3)
    print(t)
    # 仅在最后一个维度上做norm处理
    norm = nn.LayerNorm(normalized_shape=t.shape[-1], eps=1e-5)
    # 官方layer norm处理
    t1 = norm(t)
    # 自己实现的layer norm处理
    t2 = layer_norm_process(t, eps=1e-5)
    print("t1:\n", t1)
    print("t2:\n", t2)


if __name__ == '__main__':
    main()

tensor([[[0.6935, 0.5683, 0.9867],
         [0.4870, 0.6400, 0.5081]],

        [[0.7813, 0.3420, 0.9768],
         [0.6240, 0.1320, 0.0312]],

        [[0.9584, 0.8086, 0.1088],
         [0.7080, 0.6783, 0.6243]],

        [[0.9419, 0.8908, 0.9590],
         [0.3025, 0.1646, 0.2289]]])
t1:
 tensor([[[-0.3193, -1.0333,  1.3526],
         [-0.8564,  1.4012, -0.5448]],

        [[ 0.3062, -1.3487,  1.0425],
         [ 1.3961, -0.5034, -0.8927]],

        [[ 0.8997,  0.4951, -1.3947],
         [ 1.0856,  0.2341, -1.3198]],

        [[ 0.3883, -1.3644,  0.9761],
         [ 1.2493, -1.1945, -0.0549]]], grad_fn=<NativeLayerNormBackward>)
t2:
 tensor([[[-0.3193, -1.0333,  1.3526],
         [-0.8564,  1.4012, -0.5448]],

        [[ 0.3062, -1.3487,  1.0425],
         [ 1.3961, -0.5034, -0.8927]],

        [[ 0.8997,  0.4951, -1.3947],
         [ 1.0856,  0.2341, -1.3198]],

        [[ 0.3883, -1.3643,  0.9760],
         [ 1.2493, -1.1945, -0.0549]]])


## 关于均值tensor通道

torch.mean()是计算整个tensor的均值，tensor.mean(dim)是计算指定维度的均值，例如tensor.mean(0)是计算第一个维度的均值，**`并随之将该维度压缩`**。

In [38]:
torch.manual_seed(0) 
tensor = torch.rand(4,3,3)
print(f"原始的数据为:\n {tensor}")
print("")
print(f'tensor.mean()的值为: \n{tensor.mean()}')
print("")
print(f'tensor.mean(0)的值为:　\n{tensor.mean(0)} \n 将第一个维度压缩:{(tensor.mean(0)).shape}')
print("")
print(f'tensor.mean(1)的值为:　\n{tensor.mean(1)} \n 将第二个维度压缩:{(tensor.mean(1)).shape}')
print("")
print(f'tensor.mean(2)的值为:　\n{tensor.mean(2)} \n 将第三个维度压缩:{(tensor.mean(2)).shape}')

原始的数据为:
 tensor([[[0.4963, 0.7682, 0.0885],
         [0.1320, 0.3074, 0.6341],
         [0.4901, 0.8964, 0.4556]],

        [[0.6323, 0.3489, 0.4017],
         [0.0223, 0.1689, 0.2939],
         [0.5185, 0.6977, 0.8000]],

        [[0.1610, 0.2823, 0.6816],
         [0.9152, 0.3971, 0.8742],
         [0.4194, 0.5529, 0.9527]],

        [[0.0362, 0.1852, 0.3734],
         [0.3051, 0.9320, 0.1759],
         [0.2698, 0.1507, 0.0317]]])

tensor.mean()的值为: 
0.44025862216949463

tensor.mean(0)的值为:　
tensor([[0.3314, 0.3962, 0.3863],
        [0.3437, 0.4513, 0.4945],
        [0.4245, 0.5744, 0.5600]]) 
 将第一个维度压缩:torch.Size([3, 3])

tensor.mean(1)的值为:　
tensor([[0.3728, 0.6574, 0.3927],
        [0.3911, 0.4051, 0.4985],
        [0.4985, 0.4108, 0.8362],
        [0.2037, 0.4226, 0.1937]]) 
 将第二个维度压缩:torch.Size([4, 3])

tensor.mean(2)的值为:　
tensor([[0.4510, 0.3578, 0.6141],
        [0.4610, 0.1617, 0.6721],
        [0.3750, 0.7288, 0.6417],
        [0.1983, 0.4710, 0.1507]]) 
 将第三个维度压缩:torch.Size([

为什么答案是这样的呢？

tensor包含了4个$3*3$的二维数组，tensor.mean(0)就是计算这4个$3*3$的均值。
$$ (0.4963+0.0.23+0.1610+0.0362)/4=0.3314 $$
$$ (0.7682+0.3489+0.2823+0.1852)/4=0.3962 $$
$$ (0.0885+0.4017+0.6816+0.3734)/4=0.3863 $$

tensor.mean(1):实质就是求每列的均值
$$(0.4963+0.1320+0.4901)/3=0.3728$$
$$(0.7682+0.3074+0.8964)/3=0.6573$$
$$(0.0885+0.6341+0.4556)/3=0.3927$$

tensor.mean(2):实质就是求每行的均值
$$(0.4963+0.7682+0.0885)/3=0.4510$$
$$(0.1320+0.3074+0.6341)/3=0.3578$$
$$(0.4901+0.8964+0.4556)/3=0.6141$$

再举一个例子：


In [44]:
torch.manual_seed(0) 
tensor1 = torch.rand(4,2,3,3)
print(f"原始的数据为:\n {tensor1}")
print("")
print(f'tensor.mean()的值为: \n{tensor1.mean()}')
print("")
print(f'tensor.mean(0)的值为:　\n{tensor1.mean(0)} \n 将第一个维度压缩:{(tensor1.mean(0)).shape}')
print("")
print(f'tensor.mean(1)的值为:　\n{tensor1.mean(1)} \n 将第二个维度压缩:{(tensor1.mean(1)).shape}')
print("")
print(f'tensor.mean(2)的值为:　\n{tensor1.mean(2)} \n 将第三个维度压缩:{(tensor1.mean(2)).shape}')
print("")
print(f'tensor.mean(3)的值为:　\n{tensor1.mean(3)} \n 将第三个维度压缩:{(tensor1.mean(3)).shape}')

原始的数据为:
 tensor([[[[0.4963, 0.7682, 0.0885],
          [0.1320, 0.3074, 0.6341],
          [0.4901, 0.8964, 0.4556]],

         [[0.6323, 0.3489, 0.4017],
          [0.0223, 0.1689, 0.2939],
          [0.5185, 0.6977, 0.8000]]],


        [[[0.1610, 0.2823, 0.6816],
          [0.9152, 0.3971, 0.8742],
          [0.4194, 0.5529, 0.9527]],

         [[0.0362, 0.1852, 0.3734],
          [0.3051, 0.9320, 0.1759],
          [0.2698, 0.1507, 0.0317]]],


        [[[0.2081, 0.9298, 0.7231],
          [0.7423, 0.5263, 0.2437],
          [0.5846, 0.0332, 0.1387]],

         [[0.2422, 0.8155, 0.7932],
          [0.2783, 0.4820, 0.8198],
          [0.9971, 0.6984, 0.5675]]],


        [[[0.8352, 0.2056, 0.5932],
          [0.1123, 0.1535, 0.2417],
          [0.7262, 0.7011, 0.2038]],

         [[0.6511, 0.7745, 0.4369],
          [0.5191, 0.6159, 0.8102],
          [0.9801, 0.1147, 0.3168]]]])

tensor.mean()的值为: 
0.4814554750919342

tensor.mean(0)的值为:　
tensor([[[0.4252, 0.5465, 0.5216],
         

tensor.mean(0)就是计算这tensor第0个维度的均值:
$$0.4963+0.1610+0.2081+0.8352=0.4252$$

tensor.mean(1)就是计算这tensor第1个维度的均值:
$$(0.4963+0.6323)/2=0.5643$$

tensor.mean(2)就是计算这tensor第2个维度的均值:
$$(0.4963+0.1320+0.4901)/3=0.3728$$

tensor.mean(3)就是计算这tensor第3个维度的均值:
$$(0.4963+0.7682+0.0885)/3=0.4510$$