在深度学习中，经常会使用EMA（指数移动平均，Exponential Movin Average,也叫做权重移动平均,Weighted Moving Average）这个方法对模型的参数做平均，以求提高测试指标并增加模型鲁棒。可以用来估计变量的局部均值，使得变量的更新与一段时间内的历史取值有关。
# 什么是EMA？
滑动平均可以看作是变量的过去一段时间取值的均值，相比对变量直接赋值而言，滑动平均得到的值在图像上更加平缓光滑，抖动性更小，不会因为某次的异常取值而使得滑动平均值波动很大。
假定得到一个参数$\theta$在不同epoch下的值:$[\theta_1,\theta_2,...,\theta_t]$。当训练结束的$\theta$的Moving Average 就是：$v_t=\beta*v_{t-1}+(1-\beta)*v_t$,$\beta$是衰减率，用于控制模型更新的速度。Andrew Ng在Course 2 Improving Deep Neural Networks中讲到，t时刻变量v的滑动平均值大致等于过去$\frac{1}{( 1 − \beta )}$个时刻 $v$值的平均。
![EMA](https://gitee.com/FawkesDoris/drawing-bed/raw/master/img/EMA.png)
图一：不同$\beta$值做EMA的效果对比（天气预报数据）
当$\beta$越大，滑动平均得到的值越和$v$的历史值相关。如果$\beta=0.9$，则大致等于过去10个$v$值的平均;如果$\beta=0.99$,则大致等于过去100个$v$值的平均。
**滑动平均的好处: 占内存少，不需要保存过去10个或100个历史$v$值，就能估计均值。**

# TensorFlow实现
TensorFlow 提供了 [tf.train.ExponentialMovingAverage](https://tensorflow.google.cn/api_docs/python/tf/train/ExponentialMovingAverage)来实现滑动平均。
Example usage when creating a training model:

In [None]:
# Create variables.
var0 = tf.Variable(...)
var1 = tf.Variable(...)
# ... use the variables to build a training model...
...
# Create an op that applies the optimizer.  This is what we usually
# would use as a training op.
opt_op = opt.minimize(my_loss, [var0, var1])

# Create an ExponentialMovingAverage object
ema = tf.train.ExponentialMovingAverage(decay=0.9999)

with tf.control_dependencies([opt_op]):
    # Create the shadow variables, and add ops to maintain moving averages
    # of var0 and var1. This also creates an op that will update the moving
    # averages after each training step.  This is what we will use in place
    # of the usual training op.
    training_op = ema.apply([var0, var1])

...train the model by running training_op...

# Pytorch实现
官方目前未提供EMA的实现，不过也并不是很复杂。

In [None]:
from torch.optim import Adam

'''
实现EMA
'''
class EMA():
    def __init__(self, model, decay):
        self.model = model
        self.decay = decay
        self.shadow = {}
        self.backup = {}

    def register(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
                self.shadow[name] = new_average.clone()

    def apply_shadow(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                self.backup[name] = param.data
                param.data = self.shadow[name]

    def restore(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}
'''
EMA使用示例
'''
# 初始化
ema = EMA(model, 0.999)
ema.register()

# 训练过程中，更新完参数后，同步update shadow weights
optimizer = Adam(...)
def train():
    optimizer.step()
    ema.update()

# eval前，apply shadow weights；eval之后，恢复原来模型的参数
def evaluate():
    ema.apply_shadow()
    # evaluate
    ema.restore()