# Chapter 5 误差反向传播法

要正确理解误差反向传播法
- 基于数学式
- 基于计算图(computational graph)

## 5.1 计算图

### 5.1.1 用计算图求解
正方向(从左到右)的传播: forward propagation
负方向(从右到左)的传播: backward propagation

### 5.1.2 局部计算
### 5.1.3 为何用计算图解题

## 5.2 链式法则

### 5.2.1 计算图的反向传播
### 5.2.2 什么是链式法则
### 5.2.3 链式法则和计算图

## 5.3 反向传播

### 5.3.1 加法节点的反向传播
x + y = z  
dL/dz  
-> dL/dz * 1 (x)  
-> dL/dz * 1 (y)  

### 5.3.2 乘法节点的反向传播
xy = z  
dL/dz   
-> dL/dz * y (x)  
-> dL/dz * x (y)  

## 5.4 简单层的实现

### 5.4.1 乘法层的实现
曾的实现中有两个共通的地方forward(), backward()。

In [1]:
class MulLayer:
    def __init__(self):
        self.x = None
        self.y = None
    def forward(self, x, y):
        self.x = x
        self.y = y
        out = x*y
        return out
    def backward(self, dout):
        dx = dout*self.y
        dy = dout*self.x
        return dx, dy

In [3]:
# example
apple = 100
apple_num = 2
tax = 1.1

# layer
mul_apple_layer = MulLayer()
mul_tax_layer = MulLayer()

# forward 
apple_price = mul_apple_layer.forward(apple, apple_num)
price = mul_tax_layer.forward(apple_price, tax)

print(price)

220.00000000000003


In [6]:
# backward
dprice = 1
dapple_price, dtax = mul_tax_layer.backward(dprice)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)

print(dapple, dapple_num, dapple_price, dtax)

2.2 110.00000000000001 1.1 200


### 5.4.2 加法层的实现

In [13]:
class AddLayer:
    def __init__(self):
        pass
    
    def forward(self, x, y):
        out = x + y
        return out
    
    def backward(self, dout):
        dx = dout*1
        dy = dout*1
        return dx, dy

In [15]:
apple = 100
apple_num = 2
orange = 150
orange_num = 3
tax = 1.1

# layer 
mul_apple_layer = MulLayer()
mul_orange_layer = MulLayer()
add_apple_orange_layer = AddLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)

orange_price = mul_orange_layer.forward(orange, orange_num)

all_price = add_apple_orange_layer.forward(apple_price, orange_price)

price = mul_tax_layer.forward(all_price, tax)

print("Forward: \nApple_price: ",apple_price, "\nOrange_price: ", orange_price, "\nTotal price: ", all_price, "\nPrice: ", price )

Forward: 
Apple_price:  200 
Orange_price:  450 
Total price:  650 
Price:  715.0000000000001


In [17]:
# backward
dprice = 1
dall_price, dtax = mul_tax_layer.backward(dprice)
dapple_price, dorange_price = add_apple_orange_layer.backward(dall_price)
dorange, dorange_num = mul_orange_layer.backward(dorange_price)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)

print("Backward:\ndapple_num: ", dapple_num, "\ndoange_price: ", dorange_price, "\ndall price: ", dall_price, "\ndprice: ", dprice)

Backward:
dapple_num:  110.00000000000001 
doange_price:  1.1 
dall price:  1.1 
dprice:  1


## 5.5 激活函数层的实现
ReLu 和 Sigmooid

### 5.5.1 ReLu层

In [18]:
class ReLu:
#     Rectified Linear Unit
    def __init__(self):
        self.mask = None
    def forward(self, x):
        self.mask = (x<=0)
        out = x.copy()
        out[self.mask] = 0
        return out
    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout
        return dx

In [20]:
# example
import numpy as np
x = np.array([[1.0, -0.5], [-2.0, 3.0]])
print(x)
mask = (x<=0)
print(mask)

[[ 1.  -0.5]
 [-2.   3. ]]
[[False  True]
 [ True False]]


### 5.5.2 Sigmoid层

In [21]:
class Sigmoid:
    def __init__(self):
        self.out = None
    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        self.out = out 
        return out
    def backward(self, dout):
        dx = dout * (1.0 - self.out) * self.out
        return dx

## 5.6 Affine/Softmax层的实现
### 5.6.1 Affine层
矩阵的乘积：np.dot()

In [22]:
X = np.random.rand(2)
W = np.random.rand(2,3)
B = np.random.rand(3)

print(X.shape)
print(W.shape)
print(B.shape)

Y = np.dot(X, W) + B

(2,)
(2, 3)
(3,)


### 5.6.2 批版本的Affine层
正向传播时，偏置被加到X * W的各个数据上
因此反向传播的值需要汇总为偏置的元素

In [26]:
X_dot_W = np.array([[0,0,0], [10,10,10]])
B = np.array([1,2,3])
X_dot_W

array([[ 0,  0,  0],
       [10, 10, 10]])

In [27]:
X_dot_W + B

array([[ 1,  2,  3],
       [11, 12, 13]])

In [28]:
dY = np.array([[1,2,3], [4,5,6]])
print(dY)

dB = np.sum(dY, axis=0)
print(dB)

[[1 2 3]
 [4 5 6]]
[5 7 9]


In [29]:
class Affine:
    def __init__(self, w, b):
        self.w = w
        self.b = b
        self.x = None
        self.dw = None
        self.db = None
    def forward(self, f):
        self.x = x
        out = np.dot(x, self.w) + self.b
        return out
    def backward(self, dout):
        dx = np.dot(dout, self.w.T)
        self.dw = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        return dx

### 5.6.3 Softmax-with-Loss 层

神经网络中进行的处理有推理(inference)和学习两个阶段。  
神经网络的推理通常不使用softmax层。  
当神经网络的推理只需要给出一个答案的情况下，因此此时对得分最大值感兴趣，所以不需要Softmax层。不过神经网络的学习阶段则需要Softmax层。  

使用交叉熵误差函数，正向时是平方和误差，反向时是(y1-t1, y2-t2, y3-t3)这样漂亮的结果。

In [31]:
class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None
        self.t = None
    def forwared(self, x, t):
        self.t = t 
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        return self.loss
    def backward(self, dout=1):
        batch_size = self.t.shape[0]
        dx = (self.y - self.t) / batch_size
        return dx

## 5.7 误差反向传播法的实现

### 5.7.1 神经网络学习的全貌图
前提：神经网络中有合适的权重和偏置，调整权重和偏置以便拟合训练数据的过程称为学习，神经网络的学习分为下面4个步骤：
- mini-batch: 从训练数据中随机选择一部分数据
- 计算梯度: 计算损失函数关于各个权重参数的梯度
- 更新参数: 将权重参数沿梯度方向进行微小的更新
- 重复: 重复以上步骤

### 5.7.2 对应误差反向传播法的神经网络的实现


In [32]:
import sys,  os
sys.path.append("./code/")
import numpy as np
from common.layers import *
from common.gradient import numerical_gradient
from collections import OrderedDict

class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # initialize the weight
        self.params = {}
        self.params['w1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['w2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)
        # generate the layers
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['w1'], self.params['b1'])
        self.layers['Relu1'] = Relu()
        self.layers['Affine2'] = Affine(self.params['w2'], self.params['b2'])
        self.lastLayer = SoftmaxWithLoss()
        
    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x
    
    def loss(self, x, t):
        y = self.predic(x)
        return self.lastLayer.forward(y, t)
    
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        if t.ndim != 1:
            t = np.argmax(t, axis=1)
        accuracy = np.sum(y==t)
        return accuracy
    
    def numerical_gradient(self, x, t):
        loss_w = lambda w: self.loss(x, t)
        grads = {}
        grads['w1'] = numerical_gradient(loss_w, self.params['w1'])
        grads['b1'] = numerical_gradient(loss_w, self.params['b1'])
        grads['w2'] = numerical_gradient(loss_w, self.params['w2'])
        grads['b2'] = numerical_gradient(loss_w, self.params['b2'])
        return grads
    
    def gradient(self, x, t):
        # forward
        self.loss(x, t)
        
        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)
        
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)
        # settings
        grads = {}
        grads['w1'] = self.layers['Affine1'].dw
        grads['b1'] = self.layers['Affine1'].db
        grads['w2'] = self.layers['Affine2'].dw
        grads['b2'] = self.layers['Affine2'].db
        return grads

### 5.73 误差反向传播法的梯度确认

In [33]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Using TensorFlow backend.


In [35]:
x_train = x_train.reshape([60000, 28*28])/255
x_test = x_test.reshape([10000, 28*28])/255

In [36]:
from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [39]:
sys.path.append("./code/ch05/")
from gradient_check.two_layer_net import TwoLayerNet
from two_layer_net import TwoLayerNet
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

x_batch = x_train[:3]
y_batch = y_train[:3]

grad_numerical = network.numerical_gradient(x_batch, y_batch)
grad_backprop = network.gradient(x_batch, y_batch)

# calculate the average of absolute error of weights
for key in grad_numerical.keys():
    diff = np.average(np.abs(grad_backprop[key] - grad_numerical[key]))
    print(key + ": " + str(diff))

ModuleNotFoundError: No module named 'dataset.mnist'

### 5.7.4 使用误差反向传播法的学习

In [40]:
# 源码 code/ch05/train_neuralnet.py

In [None]:
s