### :
    暂退法在前向传播过程中，计算每一隐藏层的同时注入噪声，从表面上看是在训练过程中丢弃（drop out）一些神经元。
    在整个训练过程的每一次迭代中，标准暂退法包括在计算下一层之前将当前层中的一些节点置零，
        即该层的隐藏单元将有一定概率被丢弃掉，丢弃概率是超参数
    
    假设丢弃概率为 p ，那么有 p 的概率 h 会被置零，有 1-p 的概率 h 会除以 1-p 做拉伸
![dropout](./img/3.9/dropout.png)

    总结：
        可以通过使用暂退法应对过拟合现象
        暂退法只在训练模型( model.train() )时使用

### 实现 Dropout
    以drop_prob的概率丢弃X中的元素

In [22]:
import numpy as np
import torch
import torchvision
import matplotlib.pyplot as plt
import sys

In [23]:
def dropout(x,drop_prob):
    x = x.float()
    assert 0 <= drop_prob <=1 # assert: 当条件为 False 时触发
    keep_prob = 1 - drop_prob
    if drop_prob == 1: # 所有元素都被丢弃
        return torch.zeros_like(x,dtype=torch.float)
    if drop_prob == 0: # 所有元素都被保留
        return x
    
    mask = (torch.rand(x.size()) < keep_prob).float()
    
    return mask * x / keep_prob

In [24]:
x = torch.arange(16).view(2, 8)
x,dropout(x,0),dropout(x,0.5),dropout(x,1)

(tensor([[ 0,  1,  2,  3,  4,  5,  6,  7],
         [ 8,  9, 10, 11, 12, 13, 14, 15]]),
 tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11., 12., 13., 14., 15.]]),
 tensor([[ 0.,  2.,  0.,  6.,  0.,  0., 12.,  0.],
         [16., 18.,  0.,  0.,  0.,  0., 28., 30.]]),
 tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.]]))

#### 定义模型参数
    使用Fashion-MNIST数据集，定义一个包含两个隐藏层的多层感知机，其中两个隐藏层的输出个数都是256

In [25]:
num_inputs,num_hidden_1,num_hidden_2,num_outputs = 28 * 28,256,256,10

#### 定义模型

In [37]:
drop_prop_1,drop_prop_2 = 0.2,0.5

class Net(torch.nn.Module):
    def __init__(self,num_inputs,num_hidden_1,num_hidden_2,num_outputs):
        super(Net,self).__init__()
        self.linear_1 = torch.nn.Linear(num_inputs,num_hidden_1)
        self.linear_2 = torch.nn.Linear(num_hidden_1,num_hidden_2)
        self.linear_3 = torch.nn.Linear(num_hidden_2,num_outputs)
        self.relu = torch.nn.ReLU()
        
    def forward(self,x,drop_prop_1,drop_prop_2,is_training=True):
        x = x.view(-1,num_inputs)
        H1 = self.relu(self.linear_1(x))
        
        # 只有在训练模型时才使用dropout
        if is_training == True:
            H1 = dropout(H1,drop_prop_1)
            
        H2 = self.relu(self.linear_2(H1))
        if is_training == True:
            H2 = dropout(H2,drop_prop_2)
        out = self.linear_3(H2)
        return out

In [38]:
net = Net(num_inputs,num_hidden_1,num_hidden_2,num_outputs)
net

Net(
  (linear_1): Linear(in_features=784, out_features=256, bias=True)
  (linear_2): Linear(in_features=256, out_features=256, bias=True)
  (linear_3): Linear(in_features=256, out_features=10, bias=True)
  (relu): ReLU()
)

#### 训练和测试

In [39]:
num_epochs,lr,batch_size = 10,0.5,256
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(),lr)
train_data = torchvision.datasets.FashionMNIST('./data/FashionMNIST',train=True,transform=torchvision.transforms.ToTensor())
test_data = torchvision.datasets.FashionMNIST('./data/FashionMNIST',train=False,transform=torchvision.transforms.ToTensor())

train_iter = torch.utils.data.DataLoader(train_data,batch_size,shuffle=True)
test_iter = torch.utils.data.DataLoader(test_data,batch_size,shuffle=False)

In [40]:
def evaluate_acc(test_iter,net,is_training):
    test_acc,test_n = 0.0,0
    for x,y in test_iter:
        y_hat = net(x,None,None,is_training)
        loss = loss_fn(y_hat,y)
        
        test_acc += (y_hat.argmax(dim=1) == y).float().sum().item()
        test_n += y.shape[0]
    
    return test_acc / test_n

In [41]:
def train(net,train_iter,test_iter,loss,num_epochs,optimizer,drop_prop_1,drop_prop_2):
    for epoch in range(num_epochs):
        train_acc,train_ls,train_n = 0.0,0.0,0
        for x,y in train_iter:
            y_hat = net(x,drop_prop_1,drop_prop_2,True)
            loss = loss_fn(y_hat,y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_acc += (y_hat.argmax(dim=1) == y).float().sum().item()
            train_ls += loss.item()
            train_n += y.shape[0]
        
        test_acc = evaluate_acc(test_iter,net,is_training=False)
        print('epoch: %d train loss: %.4f train acc: %.3f test acc: %.3f'%(epoch+1,train_ls / train_n
                                                                           ,train_acc / train_n,test_acc))

In [42]:
train(net,train_iter,test_iter,loss_fn,num_epochs,optimizer,drop_prop_1,drop_prop_2)

epoch: 1 train loss: 0.0034 train acc: 0.673 test acc: 0.754
epoch: 2 train loss: 0.0021 train acc: 0.802 test acc: 0.796
epoch: 3 train loss: 0.0018 train acc: 0.832 test acc: 0.837
epoch: 4 train loss: 0.0017 train acc: 0.843 test acc: 0.801
epoch: 5 train loss: 0.0016 train acc: 0.851 test acc: 0.843
epoch: 6 train loss: 0.0015 train acc: 0.862 test acc: 0.815
epoch: 7 train loss: 0.0014 train acc: 0.865 test acc: 0.844
epoch: 8 train loss: 0.0014 train acc: 0.869 test acc: 0.854
epoch: 9 train loss: 0.0013 train acc: 0.874 test acc: 0.856
epoch: 10 train loss: 0.0013 train acc: 0.876 test acc: 0.870


### 简洁实现:
    在全连接层后添加Dropout层并指定丢弃概率,
    在训练模型时，Dropout层将以指定的丢弃概率随机丢弃上一层的输出元素
    在测试模型时（即model.eval()后），Dropout层并不发挥作用

In [43]:
net = torch.nn.Sequential(
    torch.nn.Linear(num_inputs,num_hidden_1),
    torch.nn.ReLU(),
    torch.nn.Dropout(drop_prop_1),
    torch.nn.Linear(num_hidden_1,num_hidden_2),
    torch.nn.ReLU(),
    torch.nn.Dropout(drop_prop_2),
    torch.nn.Linear(num_hidden_2,num_outputs)
)

In [44]:
net

Sequential(
  (0): Linear(in_features=784, out_features=256, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.2, inplace=False)
  (3): Linear(in_features=256, out_features=256, bias=True)
  (4): ReLU()
  (5): Dropout(p=0.5, inplace=False)
  (6): Linear(in_features=256, out_features=10, bias=True)
)

In [52]:
optimizer = torch.optim.SGD(net.parameters(),lr=0.5)

In [57]:
def evaluate_acc(test_iter,net):
    test_acc,test_n = 0.0,0
    for x,y in test_iter:
        x = x.view(-1,num_inputs)
        y_hat = net(x)
        loss = loss_fn(y_hat,y)
        
        test_acc += (y_hat.argmax(dim=1) == y).float().sum().item()
        test_n += y.shape[0]
    
    return test_acc / test_n

In [64]:
def train(net,train_iter,test_iter,loss,num_epochs,optimizer):
    for epoch in range(num_epochs):
        train_acc,train_ls,train_n = 0.0,0.0,0
        for x,y in train_iter:
            x = x.view(-1,num_inputs)
            y_hat = net(x)
            loss = loss_fn(y_hat,y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_acc += (y_hat.argmax(dim=1) == y).float().sum().item()
            train_ls += loss.item()
            train_n += y.shape[0]
        
        test_acc = evaluate_acc(test_iter,net)
        print('epoch: %d train loss: %.4f train acc: %.3f test acc: %.3f'%(epoch+1,train_ls / train_n
                                                                           ,train_acc / train_n,test_acc))

In [65]:
train(net,train_iter,test_iter,loss_fn,num_epochs,optimizer)

epoch: 1 train loss: 0.0011 train acc: 0.894 test acc: 0.811
epoch: 2 train loss: 0.0011 train acc: 0.897 test acc: 0.849
epoch: 3 train loss: 0.0010 train acc: 0.901 test acc: 0.873
epoch: 4 train loss: 0.0010 train acc: 0.902 test acc: 0.872
epoch: 5 train loss: 0.0010 train acc: 0.905 test acc: 0.882
epoch: 6 train loss: 0.0010 train acc: 0.908 test acc: 0.855
epoch: 7 train loss: 0.0009 train acc: 0.909 test acc: 0.866
epoch: 8 train loss: 0.0009 train acc: 0.913 test acc: 0.875
epoch: 9 train loss: 0.0009 train acc: 0.913 test acc: 0.861
epoch: 10 train loss: 0.0009 train acc: 0.915 test acc: 0.889
