## 3.13.1 方法
- 丢弃法的核心就是加入丢弃概率来舍弃丢掉某些隐藏层单元
- 新的隐藏单元$h_i^{'}$
    $$
        h_{i}^{'} = \frac{\xi_i}{1-p}h_i
    $$

### 3.13.2 从零开始实现

In [7]:
import sys
sys.path.append("../")
import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn

# 参数是输入特征和丢弃概率
def dropout(X, drop_prob):
    # 检查丢弃概率是否合法
    assert 0 <= drop_prob <= 1
    
    keep_prob = 1 - drop_prob
    
    
    # keep_prob = 0 说明全部的隐藏单元都被丢弃了
    if keep_prob == 0:
        return X.zeros_like()
    
    # uniform 是 uniform distribution 均匀分布（包括左0,不包括右1）
    mask = nd.random.uniform(0, 1, X.shape) < keep_prob
    return mask * X / keep_prob

In [8]:
X = nd.arange(16).reshape(2,8)
dropout(X, 0)


[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  9. 10. 11. 12. 13. 14. 15.]]
<NDArray 2x8 @cpu(0)>

In [10]:
dropout(X, 0.5)


[[ 0.  0.  0.  0.  8.  0.  0.  0.]
 [16.  0.  0.  0.  0. 26. 28. 30.]]
<NDArray 2x8 @cpu(0)>

In [11]:
dropout(X, 1)


[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x8 @cpu(0)>

### 定义模型参数

In [14]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))
b1 = nd.zeros(num_hiddens1)
W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))
b2 = nd.zeros(num_hiddens2)
W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]
for param in params:
    param.attach_grad()

### 定义模型

In [15]:
# 两层隐藏单元丢弃的概率
drop_prob1, drop_prob2 = 0.2, 0.5

def net(X):
    # 转一维
    X = X.reshape((-1, num_inputs))
    H1 = (nd.dot(X,W1) + b1).relu()
    
    # 只在训练模型时使用丢弃法
    if autograd.is_training():
        H1 = dropout(H1, drop_prob1)
    
    H2 = (nd.dot(H1,W2) + b2).relu()
    
    # 只在训练模型时使用丢弃法
    if autograd.is_training():
        H2 = dropout(H2, drop_prob2)
    
    return nd.dot(H2,W3) + b3    

In [17]:
num_epochs, lr, batch_size = 5 , 0.5 ,256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

epoch 1, loss 0.4007, train acc 0.855, test acc 0.871
epoch 2, loss 0.3878, train acc 0.860, test acc 0.868
epoch 3, loss 0.3682, train acc 0.865, test acc 0.867
epoch 4, loss 0.3599, train acc 0.868, test acc 0.874
epoch 5, loss 0.3493, train acc 0.872, test acc 0.878


## 3.13.3 简洁实现

In [20]:
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'),
        nn.Dropout(drop_prob1),
        nn.Dense(256, activation='relu'),
        nn.Dropout(drop_prob2),
        nn.Dense(10)
       )
net.initialize(init.Normal(sigma=0.01))

In [21]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, trainer)

epoch 1, loss 1.1212, train acc 0.566, test acc 0.742
epoch 2, loss 0.5743, train acc 0.787, test acc 0.835
epoch 3, loss 0.4859, train acc 0.822, test acc 0.847
epoch 4, loss 0.4408, train acc 0.839, test acc 0.859
epoch 5, loss 0.4130, train acc 0.849, test acc 0.863


### 小结
- 丢弃法可以应对过拟合
- 丢弃法只有在训练模型时，才可以用

### 练习
- 如果把超参数丢弃概率对调，会出现什么结果？
    - 从结果上看，模型在测试集上的表现变得更好了
- 增大迭代周期，比较使用丢弃法和不使用丢弃法
- 如果将模型该的更加复杂，使用丢弃法应多过拟合是否效果更加明显？
- 以本节中的模型为例，比较使用丢弃法与权重衰减的效果。如果同时使用两种方法，效果如何？

In [28]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))
b1 = nd.zeros(num_hiddens1)
W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))
b2 = nd.zeros(num_hiddens2)
W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]
for param in params:
    param.attach_grad()
    
    
drop_prob1, drop_prob2 = 0.2, 0.5

def net(X):
    # 转一维
    X = X.reshape((-1, num_inputs))
    H1 = (nd.dot(X,W1) + b1).relu()
    
    # 只在训练模型时使用丢弃法
    #if autograd.is_training():
    #    H1 = dropout(H1, drop_prob1)
    
    H2 = (nd.dot(H1,W2) + b2).relu()
    
    # 只在训练模型时使用丢弃法
    #if autograd.is_training():
    #    H2 = dropout(H2, drop_prob2)
    
    return nd.dot(H2,W3) + b3 

num_epochs, lr, batch_size = 10 , 0.5 ,256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

epoch 1, loss 1.1313, train acc 0.570, test acc 0.769
epoch 2, loss 0.5564, train acc 0.790, test acc 0.837
epoch 3, loss 0.4624, train acc 0.829, test acc 0.847
epoch 4, loss 0.4565, train acc 0.834, test acc 0.852
epoch 5, loss 0.3935, train acc 0.854, test acc 0.864
epoch 6, loss 0.3755, train acc 0.862, test acc 0.868
epoch 7, loss 0.3529, train acc 0.870, test acc 0.871
epoch 8, loss 0.3380, train acc 0.873, test acc 0.874
epoch 9, loss 0.3248, train acc 0.880, test acc 0.877
epoch 10, loss 0.3158, train acc 0.882, test acc 0.878
