# 从零重新看待多层感知机

多层感知机的基本原理：

$$h_1 = \phi(W_1\boldsymbol{x} + b_1)$$

$$h_2 = \phi(W_2\boldsymbol{h_1} + b_2)$$

$$...$$

$$h_n = \phi(W_n\boldsymbol{h_{n-1}} + b_n)$$

$$\hat{y} = \mbox{softmax}(W_y \boldsymbol{h}_n + b_y)$$

## <font color="blue">**使用softmax和交叉熵混合的原因**

首先我们回顾softmax的公式：

$$\hat y_j = \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}$$

其中$\hat y_j$是$cross\_entropy$函数的输入$yhat$的第$j-th$个元素，$z_j$是softmax函数中输入$y\_linear$的第$j-th$个元素。

有种情况是当我们的$z_j$很大时，$e^{z_j}$的结果会很大从而导致上溢，这时候分子会接近``inf``从而我们的结果$\hat y_j$就会不确定(``inf``, ``nan``, 0)，这时候就会产生数值不稳定性，因此我们先对每个$z_j$首先减去其最大值，这样就不会出现正无穷的情况，我们可以证明，<font color="red">**减去最大值并不影响$softmax$结果的输出</font>。**

进一步我们考虑到，在上面减去最大值后,$z_j$同样可能会趋近于负无穷，那么此时$e^{z_j}$就会趋近于0，此时同样产生数值的不稳定性，例如当$\hat y_j$趋近于0时，接下来的交叉熵步骤$log(\hat y_j)$就会趋近于负无穷，那么我们利用这样的负无穷的loss反向传播时，其结果将会非常可怕。

因此我们可以巧妙的将$softmax$和$cross\_entropy$的步骤结合起来，来避开先求导然后求对数的过程，这样可以避开这样的数值不稳定性的发生，公式推导如下：

$$\text{log}{(\hat y_j)} = \text{log}\left( \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}\right) = \text{log}{(e^{z_j})}-\text{log}{\left( \sum_{i=1}^{n} e^{z_i} \right)} = z_j -\text{log}{\left( \sum_{i=1}^{n} e^{z_i} \right)}$$

因此我们正确的做法是，写出$softmax$函数只用于计算其概率，但我们不将$softmax$输出的概率传入我们的损失函数中，而是直接将$softmax$的输入$y
\_linear$传入损失函数中，这样可以使得数值更具稳定性。

**下面我们来实现本章的算法**

In [1]:
import mxnet as mx
import numpy as np

from mxnet import nd
from mxnet import autograd
from mxnet import gluon

import utils

In [2]:
ctx = mx.cpu()

In [3]:
batch_size = 128
train_data, test_data = utils.load_dataset(batch_size, data_type='mnist')

## 随机初始化参数

In [4]:
num_examples = 60000
num_inputs = 784
num_outputs = 10
num_hidden = 256
weight_scale = .01

W1 = nd.random.normal(shape=(num_inputs, num_hidden), scale=weight_scale)
b1 = nd.random.normal(shape=(num_hidden,))

W2 = nd.random.normal(shape=(num_hidden, num_hidden), scale=weight_scale)
b2 = nd.random.normal(shape=(num_hidden,))

W3 = nd.random.normal(shape=(num_hidden, num_outputs), scale=weight_scale)
b3 = nd.random.normal(shape=(num_outputs,))

params = [W1, b1, W2, b2, W3, b3]
for param in params:
    param.attach_grad()

In [5]:
for data, label in train_data:
    print(data.shape)
    break  

(128, 1, 28, 28)


## 定义$softmax$交叉熵损失函数和优化器

In [6]:
def softmax(y_linear):
    yexp = nd.exp(y_linear - nd.max(y_linear))
    return yexp / nd.nansum(yexp, axis=1, exclude=True)

def softmax_cross_entropy(yhat, y):
    return - nd.nansum(y * nd.log(softmax(yhat)), axis=1, exclude=True)

def SGD(params, lr):
    for param in params:
        param[:] = param - lr * param.grad

## 定义模型

In [7]:
def relu(y_linear):
    return nd.maximum(y_linear, nd.zeros_like(y_linear))
    
def net(X):
    X = X.reshape((-1, 784))
    # first layer
    h1 = nd.dot(X, W1) + b1
    h1_relu = relu(h1)
    
    #second layer 
    h2 = nd.dot(h1_relu, W2) + b2
    h2_relu = relu(h2)
    
    #output layer 
    output = nd.dot(h2_relu, W3) + b3
    return output

## 定义评估函数

In [None]:
utils.evaluate_accuracy_scratch(train_data, net, ctx)

0.10441667

## 训练

In [None]:
epochs = 10
learning_rate = 0.01

for epoch in range(epochs):
    cumulative_loss = .0
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        label_one_hot = nd.one_hot(label, 10)
        with autograd.record():
            output = net(data)   
            loss = softmax_cross_entropy(output, label_one_hot)
        loss.backward()
        SGD(params, learning_rate)
        cumulative_loss += nd.sum(loss).asscalar()
        
    train_acc = utils.evaluate_accuracy_scratch(train_data, net, ctx)
    test_acc = utils.evaluate_accuracy_scratch(test_data, net, ctx)
    
    print("Epoch %s, Train loss %s, Train acc %s, Test acc %s." 
          % (epoch, cumulative_loss / num_examples, train_acc, test_acc))

Epoch 0, Train loss 3.00007872289, Train acc 0.905467, Test acc 0.9012.
Epoch 1, Train loss 2.71165451838, Train acc 0.949433, Test acc 0.9423.
Epoch 2, Train loss 2.6723713829, Train acc 0.955033, Test acc 0.9503.
Epoch 3, Train loss 2.65166878611, Train acc 0.96935, Test acc 0.9573.
Epoch 4, Train loss 2.63810422694, Train acc 0.969033, Test acc 0.9571.
Epoch 5, Train loss 2.62684993515, Train acc 0.978883, Test acc 0.9644.
Epoch 6, Train loss 2.61908394419, Train acc 0.98335, Test acc 0.9708.
Epoch 7, Train loss 2.6138356458, Train acc 0.984983, Test acc 0.9706.
Epoch 8, Train loss 2.60796582514, Train acc 0.992133, Test acc 0.9751.
