除了权重衰减，常使用**丢弃法**来解决过拟合问题。
丢弃法有多种，这里特指倒置丢弃法（inverted dropout）

当对谋个隐藏层使用丢弃法时，该层的任一隐藏单元将有一定概率被丢弃掉。设丢弃概率为p，那么有p的概率hi会**被清零**，有1-p的概率hi会处于1-p做**拉伸**。

丢弃率是丢弃法的超参数。

 **特点:** 
丢弃法的特点是不改变其输入的期望值。设随机变量Xi为0和1的概率分别为p和1-p，使用丢弃法时我们计算新的隐藏单元hi‘：
$$ h_{i}^{\prime}=\frac{X_{i}}{1-p} h_{i} $$
由于E(Xi) = 1-p，所以
$$E\left(h_{i}^{\prime}\right)=\frac{E\left(\xi_{i}\right)}{1-p} h_{i}=h_{i}$$


**原理：** 在隐藏层中使用丢弃法，若hi被丢弃，则反向传播时，与被丢弃的hi相关的权重梯度均为0。因为任意的隐藏单元都有可能被丢弃，所以输出层的计算无法过度依靠其中的任一个，从而在训练模型时达到正则化的作用，被用来应对过拟合。

**注意：** 在测试模型时，我们为了拿到更加确定性的结果，一般不使用丢弃法。

# 实现

In [40]:
%matplotlib inline
import torch
import torch.nn as nn
import numpy as np
import sys
sys.path.append("..") 
import d2lzh_pytorch as d2l
print(sys.path)

['/Users/luowei/PycharmProjects/动手学深度学习pyTorch版本code', '/opt/anaconda3/lib/python38.zip', '/opt/anaconda3/lib/python3.8', '/opt/anaconda3/lib/python3.8/lib-dynload', '', '/opt/anaconda3/lib/python3.8/site-packages', '/opt/anaconda3/lib/python3.8/site-packages/aeosa', '/opt/anaconda3/lib/python3.8/site-packages/IPython/extensions', '/Users/luowei/.ipython', '..', '..', '..']


## 定义dropout

In [4]:
def dropout(X,drop_prob):
    X = X.float()
    assert 0 <= drop_prob <= 1
    keep_prob = 1- drop_prob
    if keep_prob == 0:
        return torch.zero_like(X)
    
    # > keep_prob的下标对应的元素都被清0
    mask = (torch.rand(X.shape) < keep_prob).float() 
    
    return mask * X / keep_prob

In [7]:
X = torch.arange(16).view(2,8)
print(dropout(X,0))
print(dropout(X,0.2))

tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14., 15.]])
tensor([[ 0.0000,  1.2500,  2.5000,  3.7500,  0.0000,  6.2500,  7.5000,  8.7500],
        [10.0000, 11.2500, 12.5000, 13.7500, 15.0000, 16.2500, 17.5000,  0.0000]])


##  定义模型参数
将定义一个包含两个隐藏层的多层感知机，其中两个隐藏层的输出个数都是256.

In [21]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = torch.tensor(np.random.normal(0,0.01,size=(num_inputs,num_hiddens1)),dtype = torch.float,requires_grad=True)
b1 = torch.zeros(num_hiddens1,requires_grad=True)
W2 = torch.tensor(np.random.normal(0,0.01,size=(num_hiddens1,num_hiddens2)),dtype = torch.float,requires_grad=True)
b2 = torch.zeros(num_hiddens2,requires_grad=True)
W3 = torch.tensor(np.random.normal(0,0.01,size=(num_hiddens2,num_outputs)),dtype = torch.float,requires_grad=True)
b3 = torch.zeros(num_outputs,requires_grad=True)

params = [W1,b1,W2,b2,W3,b3]

## 定义模型

In [29]:
drop_prob1=0.2
drop_prob2=0.5

def net(X,is_training=True):
    X = X.view(-1,num_inputs)
    H1 = (torch.matmul(X,W1) + b1).relu()
    if is_training:
        H1 = dropout(H1,drop_prob1) # 在第一层全连接后添加丢弃层
    H2 = (torch.matmul(H1,W2) + b2).relu()
    if is_training:
        H2 = dropout(H2,drop_prob2)
    output = (torch.matmul(H2,W3) + b3)
    return output

In [27]:
# 需要将此函数写回到d2lzh_pytorch，然后重启一下jupyter kernel
def evaluate_accuracy(data_iter,net):
    acc_sum,n = 0.0,0 # 分类正确的数量，总数
    for X,y in data_iter:
        if isinstance(net,torch.nn.Module):
            net.eval() # 评估模式，关闭dropout
            acc_sum += (net(X).argmax(dim==1) == y).float().sum().item()
            net.train() # 改回训练模式
        else: # 自定义的模型
            if('is_training' in net.__code__.co_varnames):
                acc_sum += (net(X,is_training=False).argmax(dim==1) == y).float().item()
            else:
                acc_sum += (net(X).argmax(dim==1) == y).float().sum().item()
        n+=y.shape[0]
    return acc_sum / n

## 训练和测试模型

In [30]:
num_epochs,lr,batch_size = 5,100.0,256
loss=torch.nn.CrossEntropyLoss()
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
d2l.train_ch3(net,train_iter,test_iter,loss,num_epochs,batch_size,params,lr)

epoch 1, loss 0.0045, train acc 0.552, test acc 0.759
epoch 2, loss 0.0023, train acc 0.783, test acc 0.800
epoch 3, loss 0.0019, train acc 0.822, test acc 0.830
epoch 4, loss 0.0017, train acc 0.840, test acc 0.831
epoch 5, loss 0.0016, train acc 0.848, test acc 0.840


# 简洁实现

In [37]:
net = nn.Sequential(
    d2l.FlattenLayer(),
    nn.Linear(num_inputs,num_hiddens1),
    nn.ReLU(),
    nn.Dropout(drop_prob1),
    nn.Linear(num_hiddens1,num_hiddens2),
    nn.ReLU(),
    nn.Dropout(drop_prob2),
    nn.Linear(num_hiddens2,10)
)
for param in net.parameters():
    nn.init.normal_(param,mean=0,std=0.01)

In [41]:
optimizer = torch.optim.SGD(net.parameters(),lr = 0.5)
d2l.train_ch3(net,train_iter,test_iter,loss,num_epochs,batch_size,params=None,lr=None,optimizer=optimizer,)

epoch 1, loss 0.0015, train acc 0.860, test acc 0.851
epoch 2, loss 0.0015, train acc 0.861, test acc 0.856
epoch 3, loss 0.0014, train acc 0.867, test acc 0.852
epoch 4, loss 0.0014, train acc 0.870, test acc 0.853
epoch 5, loss 0.0013, train acc 0.873, test acc 0.837
