# 权重衰减
过拟合可以通过收集更多的训练数据来缓解过拟合，但是获取更多的数据会导致成本很高。

如果有了很多高质量的数据，那么我们可以将重点放在正则化技术上。

在[多项式回归](./overfittingAndUnderfitting.ipynb)中，我们通过限制特征的数量来缓解过拟合。但是这种简单地丢弃特征的行为太过于生硬。

在训练参数化机器学习模型的时候，权重衰减（weight decay）是广泛使用的正则化技术之一，通常被称为L2正则化，通过函数与0的距离来衡量函数的复杂度。

一种简单的方法是通过线性函数中权重向量的某个范数来度量复杂性，要保证权重向量比较小， 最常用方法是将其范数作为惩罚项加到最小化损失的问题中。 将原来的训练目标最小化训练标签上的预测损失， 调整为最小化预测损失和惩罚项之和。

**为什么可以防止过拟合呢？**
1. 从模型的复杂度上解释：更小的权值w，从某种意义上说，表示网络的复杂度更低，对数据的拟合更好（这个法则也叫做奥卡姆剃刀），而在实际应用中，也验证了这一点，L2正则化的效果往往好于未经正则化的效果。
2. 从数学方面的解释：过拟合的时候，拟合函数的系数往往非常大，在某些很小的区间里，函数值的变化很剧烈。这就意味着函数在某些小区间里的导数值（绝对值）非常大，由于自变量值可大可小，所以只有系数足够大，才能保证导数值很大。而正则化是通过约束参数的范数使其不要太大，所以可以在一定程度上减少过拟合情况。


In [83]:
%matplotlib inline
import torch as t
import torch.nn as nn
from pltutils import *
import math

# 生成数据

In [84]:
true_w,true_b=t.ones((200))*0.01,0.05
X= t.normal(0,1,size=(20+100,200))
Y=t.zeros((120,1))
for i in range(100+20):
    Y[i]=true_b+(X[i]*true_w).sum()+t.normal(0,0.01,(1,))


def data_iter(batch_size: int, features: t.Tensor, labels: t.Tensor):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = t.tensor(
            indices[i:min(i+batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]



# parameters

In [85]:
# netwrok architecture
def init_params():
    w=t.normal(0,1,size=(200,1),requires_grad=True)
    b=t.zeros(1,requires_grad=True)
    return [w,b]
# l2 penalty for weights
def l2_penalty(w:t.Tensor):
    return t.sum(w.pow(2))/2

# training function

In [86]:
# SGD
def stochastic_gradient_desent(params: t.Tensor, lr, batch_size):
    with t.no_grad():
        for param in params:
            param -= lr*param.grad/batch_size
            param.grad.zero_()
# training function
def train(lambd):
    w,b=init_params()
    for epoch in range(200):
        train_iter = data_iter(1, X[:20], Y[:20])
        test_iter = data_iter(1, X[20:], Y[20:])
        total_train_loss =[]
        total_eval_loss=[]
        for x,y in train_iter:
            loss = t.pow((y - (t.mm(x, w)+b)), 2).mean()
            loss+=lambd*l2_penalty(w)
            loss.sum().backward()
            stochastic_gradient_desent([w,b],lr=0.003,batch_size=5)
            total_train_loss.append(loss.item())
        for x,y in test_iter:
            #print(x.shape, w.shape, t.mm(x, w).shape)
            loss = t.pow((y - (t.mm(x,w)+b)), 2).mean()
            total_eval_loss.append(loss.item())
            
        print(
            f"epoch:{epoch} eval_loss = {np.mean(total_eval_loss)},train_loss = {np.mean(total_train_loss)}")

            
        


In [87]:
train(0)


epoch:0 eval_loss = 217.56593918681145,train_loss = 199.9007263660431
epoch:1 eval_loss = 210.50795641635546,train_loss = 116.10138123072684
epoch:2 eval_loss = 205.7486390982784,train_loss = 68.48893291950226
epoch:3 eval_loss = 202.67142190419136,train_loss = 41.09773392267525
epoch:4 eval_loss = 200.64495701316744,train_loss = 25.04750501215458
epoch:5 eval_loss = 199.2305720605579,train_loss = 15.467374025445315
epoch:6 eval_loss = 198.23894260890782,train_loss = 9.619874024682213
epoch:7 eval_loss = 197.5673576692492,train_loss = 6.050757286819862
epoch:8 eval_loss = 197.11165327522903,train_loss = 3.829193403525278
epoch:9 eval_loss = 196.7768082438223,train_loss = 2.4436982403043657
epoch:10 eval_loss = 196.55395402989816,train_loss = 1.5647295889768429
epoch:11 eval_loss = 196.4039196276781,train_loss = 1.0098844512889626
epoch:12 eval_loss = 196.29459929717305,train_loss = 0.6554936545901
epoch:13 eval_loss = 196.2241386250849,train_loss = 0.4273347499314696
epoch:14 eval_loss

In [88]:
train(2)


epoch:0 eval_loss = 147.09832449089737,train_loss = 328.35225677490234
epoch:1 eval_loss = 137.12287073327852,train_loss = 238.98553695678712
epoch:2 eval_loss = 129.2263666125573,train_loss = 189.62548828125
epoch:3 eval_loss = 122.4534746141918,train_loss = 160.8409622192383
epoch:4 eval_loss = 116.3970226067886,train_loss = 142.8178565979004
epoch:5 eval_loss = 110.8314888902381,train_loss = 130.44805145263672
epoch:6 eval_loss = 105.60465847454965,train_loss = 121.17971839904786
epoch:7 eval_loss = 100.65176296964229,train_loss = 113.73150634765625
epoch:8 eval_loss = 95.9527870351635,train_loss = 107.3895320892334
epoch:9 eval_loss = 91.47328376237303,train_loss = 101.76302070617676
epoch:10 eval_loss = 87.20976167194546,train_loss = 96.6377140045166
epoch:11 eval_loss = 83.14427143327892,train_loss = 91.89125556945801
epoch:12 eval_loss = 79.261882258486,train_loss = 87.44974212646484
epoch:13 eval_loss = 75.56377267713658,train_loss = 83.26573486328125
epoch:14 eval_loss = 72.03

由于权重衰减在神经网络优化中很常用， 深度学习框架为了便于我们使用权重衰减， 将权重衰减集成到优化算法中，以便与任何损失函数结合使用。 此外，这种集成还有计算上的好处， 允许在不增加任何额外的计算开销的情况下向算法中添加权重衰减。 由于更新的权重衰减部分仅依赖于每个参数的当前值， 因此优化器必须至少接触每个参数一次。

```python
def train_concise(wd):
    net = nn.Sequential(nn.Linear(num_inputs, 1))
    for param in net.parameters():
        param.data.normal_()
    loss = nn.MSELoss(reduction='none')
    num_epochs, lr = 100, 0.003
    # 偏置参数没有衰减
    trainer = torch.optim.SGD([
        {"params":net[0].weight,'weight_decay': wd},
        {"params":net[0].bias}], lr=lr)
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                            xlim=[5, num_epochs], legend=['train', 'test'])
    for epoch in range(num_epochs):
        for X, y in train_iter:
            trainer.zero_grad()
            l = loss(net(X), y)
            l.mean().backward()
            trainer.step()
        if (epoch + 1) % 5 == 0:
            animator.add(epoch + 1,
                         (d2l.evaluate_loss(net, train_iter, loss),
                          d2l.evaluate_loss(net, test_iter, loss)))
    print('w的L2范数：', net[0].weight.norm().item())
```