# 神经网络

## 一、概念

神经网络是时下最热的人工智能话题，而神经网络的历史也由来已久，近年来的算力大爆发使人工智能和神经网络发现了彼此。

神经网络通过神经元进行组织，数据从上一层神经元流向下一层神经元直到输出神经元，损失函数衡量预测和输出之间的差距，再通过反向传播更新各层神经元的参数。

神经网络由如下元素构成：

1. 输入层：数据从输入层进入模型


2. 隐藏层：数据在隐藏层中进行交互和组合


3. 输出层：输出层输出预测结果


4. 激活函数：各个神经元的对上一层的输入进行非线性处理的函数


5. 损失函数：衡量预测结果和实际结果的差距


6. 优化器：即以何种方式更新参数

## 二、符号说明

- $X$：输入数据，$X\in R^{p\times n}$，p代表变量数，n代表样本数


- $\hat Y$：输出数据，$\hat Y\in R^{k\times n}$，n代表样本数，多分类时k代表分类数，二分类和回归时k为1


- $Y$：实际结果，$Y\in R^{k\times n}$，n代表样本数，多分类时k代表分类数，二分类和回归时k为1


- $p_i$：第i层的神经元数，$p_0=p$


- $W_i$：从第i-1层向第i层传播的矩阵，$W_i\in R^{p_{i}\times p_{i-1}}$，输入层为第0层时，$W_1\in R^{p_{1}\times p}$


- $\alpha(z)$：激活函数，每一层每一个神经元的激活函数都可以不同，此处统一用α


- $g(z)$：输出层的激活函数，通常和隐藏层的激活函数不同


- $b_i$：第i层的偏置项，$b_i \in R^{p_{i+1}}$


- $Z_i$：上一层激活函数的线性组合，$Z_i \in R^{p_i\times n}$


- $A_i$：线性组合的激活函数值，$A_i \in R^{p_i\times n}$


- \* ：逐元素相乘

## 三、Feed Forward 前向传播

### 1.从输入层到第一个隐藏层

首先是对输入数据的线性组合，由于偏执项是一个向量，对所有n个数据来说都相等。虽然此处维度按照线性代数并不能严格成立（因为$W_1X\in R^{p_1\times n}$，$b_1 \in R^{p_1\times 1}$），但是由于numpy中的广播（broadcast）机制存在，在编程中以下公式是成立的。如果非要按照数学定义上成立可以对$b_1$乘上一个$1\times n$的值全为1的向量。

$$
\begin{split}
        &Z_1 = W_1X + b_1 \ \ \in R^{p_1\times n}\\
    \Leftrightarrow & Z_1 = W_1X + b_11^{1\times n}
\end{split}
$$

然后是对第一层的各个神经元进行“激活”，对线性组合进行逐元素的函数计算

$$
A_1 = \alpha(Z_1)\ \ \in R^{p_1\times n}
$$

### 2.从第i-1层到第i层

与输入不同，此时是将上一层的激活函数值进行线性组合：

$$
Z_i = W_iA_{i-1} + b_i \ \ \in R^{p_i\times n}
$$

$$
A_i = \alpha(Z_i)\ \ \in R^{p_i\times n}
$$

### 3.从最后一个隐藏层到输出层

假设输入层是第0层，第1——m-1层是隐藏层，第m层是输出层。如果是二分类、回归等情况，则输出层只有一个神经元，若是多分类等情况则有多个神经元，将在后面介绍，暂时假定只有一个输出：

$$
Z_m = W_m A_{m-1} + b_m \ \ \in R^{k\times n}
$$

$$
\hat Y = A_m = g(Z_m)\ \ \in R^{k\times n}
$$

## 四、激活函数

激活函数有多种多样，本质上都是为了进行非线性组合，还有易于进行求导运算以便更新参数。此处简单介绍几种激活函数

### 1.sigmoid函数

Sigmoid函数已经在logistic回归中介绍过：

$$
sigmoid(z)=\frac{1}{1+e^{-z}}
$$

它是一种较早期的激活函数，现在多用于最后输出层的激活而不用在隐藏层中，这是因为当x远离原点时它的梯度会非常接近0，会造成非常著名的“梯度消失”的现象。

考虑sigmoid函数的导数：

$$
\frac{d}{dz} sigmoid(z)=\frac{e^{-z}}{(1+e^{-z})^2}
$$

当z=0时其梯度最大为0.25，当神经网络的层数变深时便是指数倍地降低，这便是“梯度消失”最直观和简洁的解释。

### 2.Relu（Rectified Linear Unit, 线性整流函数）

Relu也曾是红极一时的激活函数，因其简洁的函数形式和导数形式（x大于零导数为1，其他情况为0）使计算成本大大降低，但同时这也带来了神经元没有被激活的情况。这是因为当输入小于0时，输出和梯度都为0，导致神经元“死亡”。

$$
Relu(z) = max(0, z)
$$

$$
\frac{d}{dz}Relu(z) = \begin{cases}
1 & z>0 \\
0 & z\le 0
\end{cases}
$$

### 3.leaky Relu

leaky Relu是我最喜欢的激活函数，因为它兼具了Relu的优点，且当输入小于零时不会出现神经元死亡的情况，k通常的设置为0.1。

$$
leakyRelu(z, k) = max(kz, z)
$$

$$
\frac{d}{dz}leakyRelu(z) = \begin{cases}
1 & z>0 \\
k & z\le 0
\end{cases}
$$

### 4.softmax

softmax是专门用于多分类的输出层的激活函数，有两种等价形式，一种是针对K类有K个输出的线性相关的形式（即下式），另一个是针对K类有K-1个输出的线性无关的形式。

$$
softmax(z) = \begin{bmatrix}
    \frac{e^{z_1}}{\sum_{i=1}^ke^{z_i}}\\
    \frac{e^{z_2}}{\sum_{i=1}^ke^{z_i}}\\
    ...\\
    \frac{e^{z_j}}{\sum_{i=1}^ke^{z_i}}\\
    ...\\
    \frac{e^{z_k}}{\sum_{i=1}^ke^{z_i}}\\
\end{bmatrix} = \begin{bmatrix}
    \hat y_1\\
    \hat y_2\\
    ...\\
    \hat y_i\\
    ...\\
    \hat y_k\\
\end{bmatrix}
$$

它的针对单一分量的偏导数形式和sigmoid函数极为相似：

1. 当分量出现在分母和分子上上时，我们用$a$表示和第i个分量无关的其他分量和：

$$
\begin{split}
    \frac{d}{dz_i} \frac{e^{z_i}}{a+e^{z_i}}&= \frac{e^{z_i}(a+e^{z_i}) - e^{z_i}e^{z_i}}{(a+e^{z_i})^2} \\
        &= \frac{ae^{z_i}+a^2-a^2}{(a+e^{z_i})^2} \\
        &= \frac{a(e^{z_i}+a)-a^2}{(a+e^{z_i})^2} \\
        &= \frac{a}{a+e^{z_i}} - \left(\frac{a}{a+e^{z_i}}\right)^2 \\
        &= \frac{a}{a+e^{z_i}}\left(1-\frac{a}{a+e^{z_i}}\right) \\
        &= \left(1-\frac{e^{z_i}}{a+e^{z_i}}\right)\frac{e^{z_i}}{a+e^{z_i}}\\
        &= (1-\hat y_i)\hat y_i
\end{split}
$$

2. 而当分量只出现在分母上时，我们用b表示分子上的第j个分量，用a表示与第i、j个分量无关的其他分量的和：

$$
\begin{split}
    \frac{d}{dz_i} \frac{b}{a+b+e^{z_i}} &= \frac{-be^{z_i}}{(a+b+e^{z_i})^2}\\
        &= \frac{-b(a+b+e^{z_i})+ab+b^2}{(a+b+e^{z_i})^2}\\
        &= \frac{-b}{a+b+e^{z_i}}+\frac{b(a+b)}{(a+b+e^{z_i})^2} \\
        &= \frac{-b}{a+b+e^{z_i}}+\frac{b}{a+b+e^{z_i}}\left(1-\frac{e^{z_i}}{a+b+e^{z_i}}\right)\\
        &= -\hat y_j + \hat y_j(1-\hat y_i)
\end{split}
$$

按照矩阵的求导法则，$m\times 1$列向量对$n \times 1$列向量求导的结果应该是$mn \times 1$维向量，但是此时为了便于计算，我们将其改写成$m\times n$的矩阵（或者$n\times m$，看需求）则它的梯度为：

$$
\triangledown softmax(z)=\begin{bmatrix}
    \hat y_1(1-\hat y_1) & -\hat y_2 + \hat y_2(1-\hat y_1) & \dots & -\hat y_k + \hat y_k(1-\hat y_1)\\
    -\hat y_1 + \hat y_1(1-\hat y_2) & \hat y_2(1-\hat y_2) & \dots & -\hat y_k + \hat y_k(1-\hat y_2)\\
    \vdots & \vdots & \ddots & \vdots \\
    -\hat y_1 + \hat y_1(1-\hat y_k) & -\hat y_2 + \hat y_2(1-\hat y_k) & \dots & \hat y_k(1-\hat y_k)\\
\end{bmatrix}
$$

## 五、损失函数

二分类和回归的损失函数不再赘述，和logistic回归和多元线性回归类似，这里介绍多分类的损失函数。

多分类的损失函数和二分类相同，也是通过似然函数进行定义：假设随机变量Y一共有K个取值，第i个样本对第j个取值的概率估计值为：

$$
\begin{split}
P(y_i=j) = \hat y_{ij} \ \ j=1,2,...,k \\
\end{split}
$$

则对n个样本，其似然函数为：

$$
likelihood(Y, \hat Y)=\prod_{i=1}^n\prod_{j=1}^k \hat y_{ij}^{I(y_i=j)}
$$

对其求自然对数，除以样本数进行标准化取负数：

$$
loss(Y, \hat Y) = -\frac1n\sum_{i=1}^n \sum_{j=1}^k I(y_i=j)ln(\hat y_{ij})
$$

这就是最终的损失函数。

## 六、Backward propagation 反向传播

反向传播是神经网络更新参数最经典也是最有效、最具有广泛性的算法。

反向传播的基础仍然是梯度下降法。

### 1.输出层到最后一个隐藏层的反向传播

$$
\frac{\partial}{\partial y_{ij}}loss = -\frac1n\frac{I(y_i=j)}{\hat y_{ij}}
$$

由于输出是$\hat Y \in R^{k\times n}$向量，损失对输出层的梯度和输出保持一致的维度：

$$
\frac{\triangledown loss}{\triangledown \hat Y}=
-\frac1n\begin{bmatrix}
    \frac{I(y_1=1)}{\hat y_{11}} & \dots & \frac{I(y_1=k)}{\hat y_{1k}}\\
     & \ddots & \vdots \\
    \frac{I(y_n=1)}{\hat y_{n1}} & & \frac{I(y_n=k)}{\hat y_{nk}}
\end{bmatrix}^T \in R^{k\times n}
$$

对输出层的激活函数有$\hat Y_i = g(Z_i)=softmax(Z_i)\in R^{k\times 1}$，$Z_i\in R^{k\times 1}$

$$
\frac{\triangledown\hat Y}{\triangledown Z_i} = 
\begin{bmatrix}
    \hat y_{i1}(1-\hat y_{i1}) & -\hat y_{i2} + \hat y_{i2}(1-\hat y_{i1}) & \dots & -\hat y_{ik} + \hat y_{ik}(1-\hat y_{i1})\\
    -\hat y_{i1} + \hat y_{i1}(1-\hat y_{i2}) & \hat y_{i2}(1-\hat y_{i2}) & \dots & -\hat y_{ik} + \hat y_{ik}(1-\hat y_{i2})\\
    \vdots & \vdots & \ddots & \vdots \\
    -\hat y_{i1} + \hat y_{i1}(1-\hat y_{ik}) & -\hat y_{i2} + \hat y_{i2}(1-\hat y_{ik}) & \dots & \hat y_{ik}(1-\hat y_{ik})\\
\end{bmatrix}
$$

将这个梯度矩阵乘以$\frac{\triangledown loss}{\triangledown \hat Y}$与之对应的列，如果$y_i=j$的话，这一列将是：

$$
\begin{bmatrix}
    -\hat y_{i1}\\
    \dots \\
    -\hat y_{ij-1}\\
    1-\hat y_{ij}\\
    -\hat y_{ij+1}\\
    \dots \\
    -\hat y_{ik}\\
\end{bmatrix}
$$

结合损失对估计值的梯度前的系数$-\frac1n$于是恰巧有：

$$
\frac{\triangledown loss}{\triangledown Z_m} = \frac1n(\hat Y - Y)
$$

此时还未涉及到参数的更新，而$Z_m = W_mA_{m-1} + b_m$中$W_m\in R^{p_m=k\times p_{m-1}}$、$b_m\in R^{k\times 1}$均为参数。

$$
\begin{split}
    &\frac{\triangledown Z_m}{\triangledown W_m} = A_{m-1}\in R^{p_{m-1}\times n}\\
    \\
    &\frac{\triangledown Z_m}{\triangledown b_m} = 1^{1\times n} \in R^{1\times n}
\end{split}
$$

将其和之前的梯度结合起来：

$$
\begin{split}
    \frac{\triangledown loss}{\triangledown W_m} &= \frac{\triangledown loss}{\triangledown Z_m}\left(\frac{\triangledown Z_m}{\triangledown W_m}\right)^T =\frac{\triangledown loss}{\triangledown Z_m}A_{m-1}^T \in R^{p_m=k\times p_{m-1}}\\
    \frac{\triangledown loss}{\triangledown b_m} &= \frac{\triangledown loss}{\triangledown Z_m} 1^{n\times 1}\in R^{k\times 1}
\end{split}
$$

### 2.第i层到第i-1层的反向传播

从第i层到第i-1层的反向传播和从输出层到最后一个隐藏层的推导相似：

假设$\frac{\triangledown loss}{\triangledown Z_i}\in R^{p_i\times n}$已知，又$Z_i = W_i A_{i-1} + b_i$，

$$
\begin{split}
    \frac{\triangledown Z_{i}}{\triangledown A_{i-1}} &= W_{i}^T\in R^{p_{i-1}\times p_{i}}\\
    A_{i-1} &= \begin{bmatrix}
            \alpha (z_{11}) & \dots & \alpha (z_{1p_i})\\
             & \ddots & \vdots \\
            \alpha (z_{n1}) & &\alpha(z_{np_i}) 
          \end{bmatrix}\in R^{p_{i-1}\times n}\\
       \frac{\triangledown A_{i-1}}{\triangledown Z_{i-1}}&=\begin{bmatrix}
            \alpha '(z_{11}) & \dots & \alpha '(z_{1p_i})\\
             & \ddots & \vdots \\
            \alpha '(z_{n1}) & & \alpha '(z_{np_i}) 
       \end{bmatrix}\in R^{p_{i-1}\times n}
\end{split}
$$


其余部分和之前的相同

$$
\begin{split}
    &\frac{\triangledown Z_{i-1}}{\triangledown W_{i-1}} = A_{i-2}\in R^{p_{i-2}\times n}\\
    \\
    &\frac{\triangledown Z_{i-1}}{\triangledown b_{i-1}} = 1^{1\times n} \in R^{1\times n}
\end{split}
$$

于是有：

$$
\begin{split}
    \frac{\triangledown loss}{\triangledown Z_{i-1}} &= W_i^T\frac{\triangledown loss}{\triangledown Z_i}* \frac{\triangledown A_{i-1}}{\triangledown Z_{i-1}}\in R^{p_{i-1 \times n}} \\
    \\
    \frac{\triangledown loss}{\triangledown W_{i-1}} &= \frac{\triangledown loss}{\triangledown Z_{i-1}} A_{i-2}^T \in R^{p_{i-1}\times p_{i-2}} \\
    \\
    \frac{\triangledown loss}{\triangledown b_{i-1}} &= \frac{\triangledown loss}{\triangledown Z_{i-1}}1^{n\times 1}\in R^{p_{i-1}\times 1}
\end{split}
$$

### 3.从第一层到输入层

从第1层到输入层的反向传播和从第i层到第i-1层的推导相似，区别在于输入是固定的数据，而不再是激活函数值：

假设$\frac{\triangledown loss}{\triangledown Z_2}\in R^{p_2\times n}$已知，又$Z_2 = W_2 A_{1} + b_2$，$Z_1 = W_1 X + b_1$

$$
\begin{split}
    \frac{\triangledown loss}{\triangledown Z_{1}} &= W_2^T\frac{\triangledown loss}{\triangledown Z_2}* \frac{\triangledown A_{1}}{\triangledown Z_{1}}\in R^{p_{1 \times n}} \\
    \\
    \frac{\triangledown loss}{\triangledown W_{1}} &= \frac{\triangledown loss}{\triangledown Z_{1}}X^T \in R^{p_{1}\times p_{0}} \\
    \\
    \frac{\triangledown loss}{\triangledown b_{1}} &= \frac{\triangledown loss}{\triangledown Z_{1}}1^{n\times 1}\in R^{p_{1}\times 1}
\end{split}
$$

## 七、优化器

优化器是指优化得到参数的方法，优化器基本都是基于梯度下降方法。如果你在线性回归中不用正规方程求解参数，而是用梯度下降，你会发现随着梯度不断下降，**梯度不断减小**。而这还不是最麻烦的问题，由于线性回归是凸优化，用梯度下降总会收敛到最小值，而神经网络多是非凸问题，梯度下降很可能会困在局部极值**无法收敛**。而且通常神经网络需要很多的数据进行训练，如果每次都像传统的梯度下降那样把所有数据都传入模型，则**计算成本很大**。

这里先介绍**SGD**（Stochastic Gradient Descnet，随机梯度下降）优化器。

SGD不再把所有的数据都用来进行梯度下降，而是只用小批量（**mini batch**）数据进行梯度下降，常见的选择是从2的4次方（16）到2的10次方之间，选用2的整数次方是根据计算机比特的特点决定的，而之前推导中梯度进行标准化时除以样本数，此时需要除以一批量的样本数。

控制梯度下降停止的条件也有所改变，由于神经网络强大的非线性组合能力，训练到收敛会造成过拟合，于是神经网络中用到最多的是早停法，也即小批量进行训练时将全部样本循环数遍（**epoch**）后就立即停下，避免过拟合。

## 八、应用

这次采用的是minist手写数字数据集，从kaggle的入门赛下载下来的训练数据集，有兴趣的可以把自己训练好的型跑一下kaggle上的测试数据集提交一下看看分数。（排名就不必了看了...）

In [1]:
import pandas as pd
import numpy as np

train_data = pd.read_csv('data_set/minist.csv')
train_data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [2]:
n = train_data.shape[0]
np.random.seed(2099)
index = np.random.permutation(n)
train_index = index[0: int(0.7*n)]
test_index = index[int(0.7*n): n]

test_data = train_data.iloc[test_index]
test_label = test_data['label']
del test_data['label']
test_data = np.array(test_data)
test_label = np.array(test_label).reshape([n-int(0.7*n), 1])

train_data = train_data.iloc[train_index]
train_label = train_data['label']
del train_data['label']
train_data = np.array(train_data)
train_label = np.array(train_label).reshape([int(0.7*n), 1])

In [3]:
def to_category(label, num_classes):
    n = label.shape[0]
    tmp = np.zeros([n, num_classes])
    j = 0
    for i in label:
        tmp[j, i]=1
        j += 1
    return tmp

In [4]:
def soft_max(z):
    """
    :param z: input, an p*n matrix
    :return: p*n matrix
    """
    e = np.exp(z)
    total = np.sum(e, axis=0, keepdims=True)
    weight = e / total
    return weight

In [5]:
def accuracy(y, y_hat):
    y = np.argmax(y, axis=0)
    y_hat = np.argmax(y_hat, axis=0)
    return sum(y == y_hat)/len(y)

In [6]:
def loss(y, y_hat):
    """
    :param y: the ture value p*n matrix
    :param y_hat: the predicted value
    :return: loss, it's quite computationally expensive
                I once simplify it as np.sum(y*y_hat)
    """
    n = y.shape[1]
    tmp = y_hat**y
    tmp = -np.log(tmp.prod(axis=0)).sum()/n
    return tmp

In [7]:
def leaky_relu(x, k=0.3):
    return (x > 0)*x + k*(x < 0)*x


def d_leaky_relu(x, k=0.3):
    return (x > 0) + k*(x < 0)


In [8]:
def init_w(b, a):
    w = np.random.randn(a * b)
    w = np.reshape(w, [b, a])
    return w


def init_b(b):
    b = np.zeros([b, 1])
    return b


In [9]:
def forward(x, parameter, cache):
    """
    :param x:input data p*n matrix
    :param parameter: a dict storing parameters
    :param cache: a dict storing computation result of each layer
    :return: the predicted value
    """
    cache['C1'] = np.dot(parameter['W1'], x) + parameter['b1']
    cache['A1'] = leaky_relu(cache['C1'])
    cache['C2'] = np.dot(parameter['W2'], cache['A1']) + parameter['b2']
    cache['A2'] = leaky_relu(cache['C2'])
    cache['C3'] = np.dot(parameter['W3'], cache['A2']) + parameter['b3']
    cache['A3'] = soft_max(cache['C3'])
    return cache

In [10]:
def back_propagation(x, y, parameter, cache, step):
    """
    X 784*n / Y 10*n
    dW1 W1 800*784, db1 b1 800*1, A1 C1, 800*n
    dW2 W2 400*800, db2 b2 400*1, A2 C2, 400*n
    dW3 W3 10*400, db3 b3 10*1, A3 C3, 10*n
    :param y: true value
    :param parameter: dictionary storing all parameters
    :param cache: dictionary storing all the computation in process
    :param step: learning rate
    :return: updated parameters
    """
    number = y.shape[1]
    cache['dC3'] = cache['A3'] - y  # 10*n
    cache['dW3'] = np.dot(cache['dC3'], cache['A2'].T)/number  # 10*400
    cache['db3'] = np.sum(cache['dC3'], axis=1, keepdims=True)/number  # 10*1
    parameter['W3'] = parameter['W3'] - step*cache['dW3']  # 10*400
    parameter['b3'] = parameter['b3'] - step*cache['db3']  # 10*1

    cache['dC2'] = np.dot(parameter['W3'].T, 
                          cache['dC3'])*d_leaky_relu(cache['C2'])  # 400*n
    cache['dW2'] = np.dot(cache['dC2'], cache['A1'].T)/number  # 400*800
    cache['db2'] = np.sum(cache['dC2'], axis=1, keepdims=True)/number  # 400*1
    parameter['W2'] = parameter['W2'] - step*cache['dW2']  # 400*800
    parameter['b2'] = parameter['b2'] - step*cache['db2']  # 400*1

    cache['dC1'] = np.dot(parameter['W2'].T, 
                          cache['dC2'])*d_leaky_relu(cache['C1'])  # 800*n
    cache['dW1'] = np.dot(cache['dC1'], x.T)/number  # 800*784
    cache['db1'] = np.sum(cache['dC1'], axis=1, keepdims=True)  # 800*1
    parameter['W1'] = parameter['W1'] - step*cache['dW1']  # 800*784
    parameter['b1'] = parameter['b1'] - step*cache['db1']  # 800*1
    return cache, parameter


In [11]:
def train(x, y, learning_rate=0.001, batch_size=128, epoch=5):
    """
    :param x: training data
    :param y: training label
    :param learning_rate: the length of a step
    :param batch_size: numbers of samples we train in a round
    :param epoch: rounds we train through training data
    :return: a trained set of parameters
    """
    parameter = dict()
    nx = x.shape[1]
    parameter['W1'] = init_w(800, 784)/100
    parameter['b1'] = init_b(800)
    parameter['W2'] = init_w(400, 800)/100
    parameter['b2'] = init_b(400)
    parameter['W3'] = init_w(10, 400)/100
    parameter['b3'] = init_b(10)

    index = np.array([], dtype='int')
    for i in range(0, nx, batch_size):
        index = np.append(index, i)
    index = np.append(index, nx)

    cache = dict()
    for i in range(0, epoch):
        for j in range(0, int(nx/batch_size)+1):
            one_batch_x = x[:, index[j]:index[j+1]]
            one_batch_y = y[:, index[j]:index[j+1]]
            cache = forward(one_batch_x, parameter, cache)
            prob = loss(one_batch_y, cache['A3'])
            acc = accuracy(one_batch_y, cache['A3'])
            print(str(i)+'--'+str(j)+'--'+str(index[j+1]))
            print('loss: '+str(prob))
            print('accuracy: '+str(acc))
            [cache, parameter] = back_propagation(one_batch_x, one_batch_y,
                                        parameter, cache, step=learning_rate)
    return cache, parameter

In [12]:
train_label = to_category(train_label, num_classes=10)
test_label = to_category(test_label, num_classes=10)

print(train_label.shape)
print(test_label.shape)

(29399, 10)
(12601, 10)


In [13]:
cache, parameter = train(x=train_data.T, y=train_label.T, epoch=5)

0--0--128
loss: 2.5144562796833654
accuracy: 0.0703125
0--1--256
loss: 2.217050238032897
accuracy: 0.1875
0--2--384
loss: 2.1658027472486903
accuracy: 0.234375
0--3--512
loss: 2.0168412770294104
accuracy: 0.296875
0--4--640
loss: 1.8946818751163836
accuracy: 0.40625
0--5--768
loss: 1.8244820246098055
accuracy: 0.421875
0--6--896
loss: 1.6413233158646316
accuracy: 0.515625
0--7--1024
loss: 1.6209324889136416
accuracy: 0.5546875
0--8--1152
loss: 1.5990883453285492
accuracy: 0.5390625
0--9--1280
loss: 1.385634792789608
accuracy: 0.6484375
0--10--1408
loss: 1.303833712288761
accuracy: 0.703125
0--11--1536
loss: 1.3310439935553542
accuracy: 0.6484375
0--12--1664
loss: 1.1805366474892716
accuracy: 0.7265625
0--13--1792
loss: 1.260705592177902
accuracy: 0.6640625
0--14--1920
loss: 1.1617022979770701
accuracy: 0.7109375
0--15--2048
loss: 1.1734547644587505
accuracy: 0.6875
0--16--2176
loss: 1.0308806471718963
accuracy: 0.734375
0--17--2304
loss: 1.0419973584182476
accuracy: 0.71875
0--18--2432

0--147--18944
loss: 0.4114326298069638
accuracy: 0.890625
0--148--19072
loss: 0.41031210578035787
accuracy: 0.890625
0--149--19200
loss: 0.46009855033299535
accuracy: 0.8671875
0--150--19328
loss: 0.3394504977432337
accuracy: 0.8984375
0--151--19456
loss: 0.3800270390012476
accuracy: 0.8984375
0--152--19584
loss: 0.3947237096393335
accuracy: 0.859375
0--153--19712
loss: 0.3841021629213581
accuracy: 0.90625
0--154--19840
loss: 0.3205243187648902
accuracy: 0.890625
0--155--19968
loss: 0.4992907784343232
accuracy: 0.859375
0--156--20096
loss: 0.4204976871912581
accuracy: 0.8828125
0--157--20224
loss: 0.47473556478157025
accuracy: 0.890625
0--158--20352
loss: 0.3297899245674878
accuracy: 0.9140625
0--159--20480
loss: 0.5765076515144314
accuracy: 0.8515625
0--160--20608
loss: 0.5689105012752617
accuracy: 0.859375
0--161--20736
loss: 0.40266185415624056
accuracy: 0.9296875
0--162--20864
loss: 0.4079488638150572
accuracy: 0.890625
0--163--20992
loss: 0.40792975673851656
accuracy: 0.875
0--164

1--60--7808
loss: 0.24343609902013907
accuracy: 0.9375
1--61--7936
loss: 0.4631960912363376
accuracy: 0.8828125
1--62--8064
loss: 0.30408090548921496
accuracy: 0.9140625
1--63--8192
loss: 0.3106892270517312
accuracy: 0.9375
1--64--8320
loss: 0.29241012925739635
accuracy: 0.90625
1--65--8448
loss: 0.21049285400482148
accuracy: 0.9375
1--66--8576
loss: 0.21814655111015233
accuracy: 0.953125
1--67--8704
loss: 0.22486617728784983
accuracy: 0.9375
1--68--8832
loss: 0.3442469318671014
accuracy: 0.890625
1--69--8960
loss: 0.31405442327854927
accuracy: 0.890625
1--70--9088
loss: 0.2512756437475721
accuracy: 0.9296875
1--71--9216
loss: 0.33934086320415857
accuracy: 0.8984375
1--72--9344
loss: 0.48646391609240675
accuracy: 0.8359375
1--73--9472
loss: 0.3472198928436054
accuracy: 0.8671875
1--74--9600
loss: 0.4607340927267417
accuracy: 0.90625
1--75--9728
loss: 0.3246488577845161
accuracy: 0.8984375
1--76--9856
loss: 0.35825968686244286
accuracy: 0.875
1--77--9984
loss: 0.3049351727502574
accurac

1--203--26112
loss: 0.2865569033984523
accuracy: 0.9296875
1--204--26240
loss: 0.21680094487446883
accuracy: 0.9296875
1--205--26368
loss: 0.40039280850551406
accuracy: 0.8828125
1--206--26496
loss: 0.2612819367944928
accuracy: 0.9296875
1--207--26624
loss: 0.2174146631443534
accuracy: 0.9140625
1--208--26752
loss: 0.34990426155293114
accuracy: 0.890625
1--209--26880
loss: 0.29917253759405366
accuracy: 0.9140625
1--210--27008
loss: 0.26558348395228615
accuracy: 0.921875
1--211--27136
loss: 0.24778017494325774
accuracy: 0.90625
1--212--27264
loss: 0.32988890825283457
accuracy: 0.921875
1--213--27392
loss: 0.34300200385577806
accuracy: 0.890625
1--214--27520
loss: 0.16635328808727912
accuracy: 0.9609375
1--215--27648
loss: 0.3487208999376741
accuracy: 0.8828125
1--216--27776
loss: 0.2690853334690969
accuracy: 0.9375
1--217--27904
loss: 0.2573507878034015
accuracy: 0.953125
1--218--28032
loss: 0.18163438002118937
accuracy: 0.953125
1--219--28160
loss: 0.2488713683519324
accuracy: 0.90625


2--117--15104
loss: 0.2865716975225735
accuracy: 0.9296875
2--118--15232
loss: 0.2842565280211326
accuracy: 0.921875
2--119--15360
loss: 0.15538020811969724
accuracy: 0.96875
2--120--15488
loss: 0.23394176019223806
accuracy: 0.9453125
2--121--15616
loss: 0.27684753801984474
accuracy: 0.9453125
2--122--15744
loss: 0.34217822744201
accuracy: 0.8828125
2--123--15872
loss: 0.2476873090890031
accuracy: 0.9375
2--124--16000
loss: 0.20952110997057602
accuracy: 0.9453125
2--125--16128
loss: 0.16079385670478685
accuracy: 0.9609375
2--126--16256
loss: 0.23832031661395198
accuracy: 0.9375
2--127--16384
loss: 0.1819166619802945
accuracy: 0.9453125
2--128--16512
loss: 0.2632972538770256
accuracy: 0.9140625
2--129--16640
loss: 0.2538021127031236
accuracy: 0.921875
2--130--16768
loss: 0.24752862182365795
accuracy: 0.9296875
2--131--16896
loss: 0.3242916687207346
accuracy: 0.8984375
2--132--17024
loss: 0.38590803523100325
accuracy: 0.8671875
2--133--17152
loss: 0.13423707209333413
accuracy: 0.96875
2-

3--28--3712
loss: 0.15476434958466345
accuracy: 0.9765625
3--29--3840
loss: 0.20963772021794147
accuracy: 0.921875
3--30--3968
loss: 0.19256686096944586
accuracy: 0.9375
3--31--4096
loss: 0.2935553955288488
accuracy: 0.9140625
3--32--4224
loss: 0.31115805380222594
accuracy: 0.890625
3--33--4352
loss: 0.22228210539054372
accuracy: 0.9453125
3--34--4480
loss: 0.27013937008115496
accuracy: 0.9296875
3--35--4608
loss: 0.2912872807869159
accuracy: 0.9140625
3--36--4736
loss: 0.26385767368979207
accuracy: 0.90625
3--37--4864
loss: 0.17417162311512116
accuracy: 0.9375
3--38--4992
loss: 0.2830441232612647
accuracy: 0.9296875
3--39--5120
loss: 0.2646953684743941
accuracy: 0.9296875
3--40--5248
loss: 0.3062922665270055
accuracy: 0.921875
3--41--5376
loss: 0.16164009720665318
accuracy: 0.953125
3--42--5504
loss: 0.2028410857948995
accuracy: 0.9453125
3--43--5632
loss: 0.16744005831067754
accuracy: 0.9609375
3--44--5760
loss: 0.23576960184583695
accuracy: 0.9453125
3--45--5888
loss: 0.464648431092

3--171--22016
loss: 0.20269827750146835
accuracy: 0.9453125
3--172--22144
loss: 0.2588965698420835
accuracy: 0.9296875
3--173--22272
loss: 0.21502885393132234
accuracy: 0.9375
3--174--22400
loss: 0.22419890984898766
accuracy: 0.9296875
3--175--22528
loss: 0.22314713745701986
accuracy: 0.9296875
3--176--22656
loss: 0.15580721978836176
accuracy: 0.9453125
3--177--22784
loss: 0.162110652372484
accuracy: 0.953125
3--178--22912
loss: 0.18279046493003864
accuracy: 0.9296875
3--179--23040
loss: 0.22865221778473802
accuracy: 0.9296875
3--180--23168
loss: 0.17347179831609896
accuracy: 0.9453125
3--181--23296
loss: 0.33143032935327177
accuracy: 0.9375
3--182--23424
loss: 0.2190145232734133
accuracy: 0.9375
3--183--23552
loss: 0.18728222620018842
accuracy: 0.9453125
3--184--23680
loss: 0.22847741936658794
accuracy: 0.9296875
3--185--23808
loss: 0.14818933356004754
accuracy: 0.953125
3--186--23936
loss: 0.28452760562922147
accuracy: 0.9140625
3--187--24064
loss: 0.38724581696704
accuracy: 0.90625


4--84--10880
loss: 0.16743564711697892
accuracy: 0.96875
4--85--11008
loss: 0.11638199466205286
accuracy: 0.984375
4--86--11136
loss: 0.16856729568199957
accuracy: 0.9375
4--87--11264
loss: 0.2794145724214695
accuracy: 0.8828125
4--88--11392
loss: 0.3054642257222573
accuracy: 0.9140625
4--89--11520
loss: 0.12453490668140726
accuracy: 0.9765625
4--90--11648
loss: 0.17515790705420367
accuracy: 0.9609375
4--91--11776
loss: 0.17861904539316117
accuracy: 0.9375
4--92--11904
loss: 0.31995014502137664
accuracy: 0.8984375
4--93--12032
loss: 0.2436488524129315
accuracy: 0.90625
4--94--12160
loss: 0.1703556926923637
accuracy: 0.9296875
4--95--12288
loss: 0.18831394732996187
accuracy: 0.9609375
4--96--12416
loss: 0.22949011741845593
accuracy: 0.953125
4--97--12544
loss: 0.27616845829241937
accuracy: 0.921875
4--98--12672
loss: 0.14695358523738591
accuracy: 0.9765625
4--99--12800
loss: 0.30826419021214185
accuracy: 0.9375
4--100--12928
loss: 0.18105837965519433
accuracy: 0.9375
4--101--13056
loss:

4--224--28800
loss: 0.3165181200996261
accuracy: 0.8984375
4--225--28928
loss: 0.18950658569351692
accuracy: 0.9453125
4--226--29056
loss: 0.19469315971257084
accuracy: 0.9453125
4--227--29184
loss: 0.3202240651737227
accuracy: 0.921875
4--228--29312
loss: 0.20897876309324015
accuracy: 0.9609375
4--229--29399
loss: 0.1699488396563522
accuracy: 0.9655172413793104


In [14]:
hat_label = forward(test_data.T, parameter, cache)
hat_label.keys()

dict_keys(['C1', 'A1', 'C2', 'A2', 'C3', 'A3', 'dC3', 'dW3', 'db3', 'dC2', 'dW2', 'db2', 'dC1', 'dW1', 'db1'])

In [15]:
hat_label = hat_label['A3']
hat_label.shape

(10, 12601)

In [16]:
loss(test_label.T, hat_label)

0.2211812376862138