# 神经网络

## 一、概念

神经网络是时下最热的人工智能话题，而神经网络的历史也由来已久，近年来的算力大爆发使人工智能和神经网络发现了彼此。

神经网络通过神经元进行组织，数据从上一层神经元流向下一层神经元直到输出神经元，损失函数衡量预测和输出之间的差距，再通过反向传播更新各层神经元的参数。

神经网络由如下元素构成：

1. 输入层：数据从输入层进入模型


2. 隐藏层：数据在隐藏层中进行交互和组合


3. 输出层：输出层输出预测结果


4. 激活函数：各个神经元的对上一层的输入进行非线性处理的函数


5. 损失函数：衡量预测结果和实际结果的差距


6. 优化器：即以何种方式更新参数

## 二、符号说明

- $X$：输入数据，$X\in R^{p\times n}$，p代表变量数，n代表样本数


- $\hat Y$：输出数据，$\hat Y\in R^{k\times n}$，n代表样本数，多分类时k代表分类数，二分类和回归时k为1


- $Y$：实际结果，$Y\in R^{k\times n}$，n代表样本数，多分类时k代表分类数，二分类和回归时k为1


- $p_i$：第i层的神经元数，$p_0=p$


- $W_i$：从第i-1层向第i层传播的矩阵，$W_i\in R^{p_{i}\times p_{i-1}}$，输入层为第0层时，$W_1\in R^{p_{1}\times p}$


- $\alpha(z)$：激活函数，每一层每一个神经元的激活函数都可以不同，此处统一用α


- $g(z)$：输出层的激活函数，通常和隐藏层的激活函数不同


- $b_i$：第i层的偏置项，$b_i \in R^{p_{i+1}}$


- $Z_i$：上一层激活函数的线性组合，$Z_i \in R^{p_i\times n}$


- $A_i$：线性组合的激活函数值，$A_i \in R^{p_i\times n}$


- \* ：逐元素相乘

## 三、Feed Forward 前向传播

### 1.从输入层到第一个隐藏层

首先是对输入数据的线性组合，由于偏执项是一个向量，对所有n个数据来说都相等。虽然此处维度按照线性代数并不能严格成立（因为$W_1X\in R^{p_1\times n}$，$b_1 \in R^{p_1\times 1}$），但是由于numpy中的广播（broadcast）机制存在，在编程中以下公式是成立的。如果非要按照数学定义上成立可以对$b_1$乘上一个$1\times n$的值全为1的向量。

$$
\begin{split}
        &Z_1 = W_1X + b_1 \ \ \in R^{p_1\times n}\\
    \Leftrightarrow & Z_1 = W_1X + b_11^{1\times n}
\end{split}
$$

然后是对第一层的各个神经元进行“激活”，对线性组合进行逐元素的函数计算

$$
A_1 = \alpha(Z_1)\ \ \in R^{p_1\times n}
$$

### 2.从第i-1层到第i层

与输入不同，此时是将上一层的激活函数值进行线性组合：

$$
Z_i = W_iA_{i-1} + b_i \ \ \in R^{p_i\times n}
$$

$$
A_i = \alpha(Z_i)\ \ \in R^{p_i\times n}
$$

### 3.从最后一个隐藏层到输出层

假设输入层是第0层，第1——m-1层是隐藏层，第m层是输出层。如果是二分类、回归等情况，则输出层只有一个神经元，若是多分类等情况则有多个神经元，将在后面介绍，暂时假定只有一个输出：

$$
Z_m = W_m A_{m-1} + b_m \ \ \in R^{k\times n}
$$

$$
\hat Y = A_m = g(Z_m)\ \ \in R^{k\times n}
$$

## 四、激活函数

激活函数有多种多样，本质上都是为了进行非线性组合，还有易于进行求导运算以便更新参数。此处简单介绍几种激活函数

### 1.sigmoid函数

Sigmoid函数已经在logistic回归中介绍过：

$$
sigmoid(z)=\frac{1}{1+1^{-z}}
$$

它是一种较早期的激活函数，现在多用于最后输出层的激活而不用在隐藏层中，这是因为当x远离原点时它的梯度会非常接近0，会造成非常著名的“梯度消失”的现象。

考虑sigmoid函数的导数：

$$
\frac{d}{dz} sigmoid(z)=\frac{e^{-z}}{(1+1^{-z})^2}
$$

当z=0时其梯度最大为0.25，当神经网络的层数变深时便是指数倍地降低，这便是“梯度消失”最直观和简洁的解释。

### 2.Relu（Rectified Linear Unit, 线性整流函数）

Relu也曾是红极一时的激活函数，因其简洁的函数形式和导数形式（x大于零导数为1，其他情况为0）使计算成本大大降低，但同时这也带来了神经元没有被激活的情况。这是因为当输入小于0时，输出和梯度都为0，导致神经元“死亡”。

$$
Relu(z) = max(0, z)
$$

$$
\frac{d}{dz}Relu(z) = \begin{cases}
1 & z>0 \\
0 & z\le 0
\end{cases}
$$

### 3.leaky Relu

leaky Relu是我最喜欢的激活函数，因为它兼具了Relu的优点，且当输入小于零时不会出现神经元死亡的情况，k通常的设置为0.1。

$$
leakyRelu(z, k) = max(kz, z)
$$

$$
\frac{d}{dz}leakyRelu(z) = \begin{cases}
1 & z>0 \\
k & z\le 0
\end{cases}
$$

### 4.softmax

softmax是专门用于多分类的输出层的激活函数，有两种等价形式，一种是针对K类有K个输出的线性相关的形式（即下式），另一个是针对K类有K-1个输出的线性无关的形式。

$$
softmax(z) = \begin{bmatrix}
    \frac{e^{z_1}}{\sum_{i=1}^ke^{z_i}}\\
    \frac{e^{z_2}}{\sum_{i=1}^ke^{z_i}}\\
    ...\\
    \frac{e^{z_j}}{\sum_{i=1}^ke^{z_i}}\\
    ...\\
    \frac{e^{z_k}}{\sum_{i=1}^ke^{z_i}}\\
\end{bmatrix} = \begin{bmatrix}
    \hat y_1\\
    \hat y_2\\
    ...\\
    \hat y_i\\
    ...\\
    \hat y_k\\
\end{bmatrix}
$$

它的针对单一分量的偏导数形式和sigmoid函数极为相似：

$$
\begin{split}
    \frac{\partial}{\partial z_i}softmax(z) &= \frac{d}{dz_i} \frac{e^{z_i}}{a+e^{z_i}}= \frac{ae^{z_i}}{(a+e^{z_i})^2} \\
        &= \frac{ae^{z_i}+a^2-a^2}{(a+e^{z_i})^2} \\
        &= \frac{a(e^{z_i}+a)-a^2}{(a+e^{z_i})^2} \\
        &= \frac{a}{a+e^{z_i}} - \left(\frac{a}{a+e^{z_i}}\right)^2 \\
        &= \frac{a}{a+e^{z_i}}\left(1-\frac{a}{a+e^{z_i}}\right) \\
        &= \left(1-\frac{e^{z_i}}{a+e^{z_i}}\right)\frac{e^{z_i}}{a+e^{z_i}}
\end{split}
$$

则它的梯度为：

$$
\triangledown softmax(z)=\begin{bmatrix}
    \hat y_1(1-\hat y_1)\\
    \hat y_2(1-\hat y_2)\\
    ...\\
    \hat y_i(1-\hat y_i)\\
    ...\\
    \hat y_k(1-\hat y_k)\\
\end{bmatrix}
$$

## 五、损失函数

二分类和回归的损失函数不再赘述，和logistic回归和多元线性回归类似，这里介绍多分类的损失函数。

多分类的损失函数和二分类相同，也是通过似然函数进行定义：假设随机变量Y一共有K个取值，第i个样本对第j个取值的概率估计值为：

$$
\begin{split}
P(y_i=j) = \hat y_{ij} \ \ j=1,2,...,k \\
\end{split}
$$

则对n个样本，其似然函数为：

$$
likelihood(Y, \hat Y)=\prod_{i=1}^n\prod_{j=1}^k \hat y_{ij}^{I(y_i=j)}
$$

对其求自然对数，除以样本数进行标准化取负数：

$$
loss(Y, \hat Y) = -\frac1n\sum_{i=1}^n \sum_{j=1}^k I(y_i=j)ln(\hat y_{ij})
$$

这就是最终的损失函数。

## 六、Backward propagation 反向传播

反向传播是神经网络更新参数最经典也是最有效、最具有广泛性的算法。

反向传播的基础仍然是梯度下降法。

### 1.输出层到第一个隐藏层的反向传播

$$
\frac{\partial}{\partial y_ij}loss = \frac1n\frac{I(y_i=j)}{\hat y_{ij}}
$$

由于输出是$\hat Y \in R^{k\times n}$向量，损失对输出层的梯度和输出保持一致的维度：

$$
\frac{\triangledown loss}{\triangledown \hat Y}=
\frac1n\begin{bmatrix}
    \frac{I(y_1=1)}{\hat y_{11}} & \dots & \frac{I(y_1=k)}{\hat y_{1k}}\\
     & \ddots & \vdots \\
    \frac{I(y_n=1)}{\hat y_{n1}} & & \frac{I(y_n=k)}{\hat y_{nk}}
\end{bmatrix}^T \in R^{k\times n}
$$

对输出层的激活函数有$\hat Y = g(Z_m)=softmax(Z_m)\in R^{k\times n}$，$Z_m\in R^{k\times n}$

$$
\frac{\triangledown\hat Y}{\triangledown Z_m} = 
\begin{bmatrix}
    \hat y_{11}(1-\hat y_{11}) & \dots & \hat y_{1k}(1-\hat y_{1k})\\
    & \ddots & \vdots \\
    \hat y_{n1}(1-\hat y_{n1}) &  & \hat y_{nk}(1-\hat y_{nk})
\end{bmatrix}^T \in R^{k\times n}
$$

此时还未涉及到参数的更新，而$Z_m = W_mA_{m-1} + b_m$中$W_m\in R^{p_m=k\times p_{m-1}}$、$b_m\in R^{k\times 1}$、$A_{m-1}\in R^{p_{m-1}\times n}$均为参数，其中前两个好理解，而激活函数值也需要更新是因为它是先前输入的函数，需要通过对激活函数更新使梯度传导到更靠前的隐藏层。

$$
\begin{split}
    &\frac{\triangledown Z_m}{\triangledown W_m} = A_{m-1}\in R^{p_{m-1}\times n}\\
    \\
    &\frac{\triangledown Z_m}{\triangledown A_{m-1}} = W_m^T\in R^{p_{m-1}\times k}\\
    \\
    &\frac{\triangledown Z_m}{\triangledown b_m} = 1^{1\times n} \in R^{1\times n}
\end{split}
$$

将其和之前的梯度结合起来：

$$
\begin{split}
    \frac{\triangledown loss}{\triangledown W_m} &= \left(\frac{\triangledown loss}{\triangledown \hat Y}*\frac{\triangledown\hat Y}{\triangledown Z_m}\right)\left(\frac{\triangledown Z_m}{\triangledown W_m}\right)^T \in R^{p_m=k\times p_{m-1}}\\
    \frac{\triangledown loss}{\triangledown A_{m-1}} &= \frac{\triangledown Z_m}{\triangledown A_{m-1}}\left(\frac{\triangledown loss}{\triangledown \hat Y}*\frac{\triangledown\hat Y}{\triangledown Z_m}\right)\in R^{p_{m-1}\times n}\\
    \frac{\triangledown loss}{\triangledown b_m} &= \left(\frac{\triangledown loss}{\triangledown \hat Y}*\frac{\triangledown\hat Y}{\triangledown Z_m}\right)1^{n\times 1}\in R^{k\times 1}
\end{split}
$$

### 2.第i层到第i-1层的反向传播

从第i层到第i-1层的反向传播和从输出层到最后一个隐藏层的推导相似：

假设$\frac{\triangledown loss}{\triangledown A_i}\in R^{p_i\times n}$已知，

$$
\begin{split}
    A_i &= \begin{bmatrix}
            \alpha (z_{11}) & \dots & \alpha (z_{1p_i})\\
             & \ddots & \vdots \\
            \alpha (z_{n1}) & &\alpha(z_{np_i}) 
          \end{bmatrix}\in R^{p_i\times n}\\
       \frac{\triangledown A_i}{\triangledown Z_i}&=\begin{bmatrix}
            \alpha '(z_{11}) & \dots & \alpha '(z_{1p_i})\\
             & \ddots & \vdots \\
            \alpha '(z_{n1}) & & \alpha '(z_{np_i}) 
       \end{bmatrix}\in R^{p_i\times n}
\end{split}
$$

其余部分和之前的相同

$$
\begin{split}
    &\frac{\triangledown Z_i}{\triangledown W_i} = A_{i-1}\in R^{p_{i-1}\times n}\\
    \\
    &\frac{\triangledown Z_i}{\triangledown A_{i-1}} = W_i^T\in R^{p_{i-1}\times i}\\
    \\
    &\frac{\triangledown Z_i}{\triangledown b_i} = 1^{1\times n} \in R^{1\times n}
\end{split}
$$

将其和之前的梯度结合起来：

$$
\begin{split}
    \frac{\triangledown loss}{\triangledown W_i} &= \left(\frac{\triangledown loss}{\triangledown A_i}*\frac{\triangledown A_i}{\triangledown Z_i}\right)\left(\frac{\triangledown Z_i}{\triangledown W_i}\right)^T \in R^{p_i\times p_{i-1}}\\
    \frac{\triangledown loss}{\triangledown A_{i-1}} &= \frac{\triangledown Z_i}{\triangledown A_{i-1}}\left(\frac{\triangledown loss}{\triangledown A_i}*\frac{\triangledown A_i}{\triangledown Z_m}\right)\in R^{p_{i-1}\times n}\\
    \frac{\triangledown loss}{\triangledown b_i} &= \left(\frac{\triangledown loss}{\triangledown A_i}*\frac{\triangledown A_i}{\triangledown Z_i}\right)1^{n\times 1}\in R^{p_i\times 1}\\
\end{split}
$$

### 3.从第一层到输入层

从第1层到输入层的反向传播和从第i层到第i-1层的推导相似，区别在于输入是固定的数据，而不再是激活函数值，也就不再需要对输入的数据$X$进行更新：

假设$\frac{\triangledown loss}{\triangledown A_1}\in R^{p_i\times n}$已知，

$$
\begin{split}
    A_1 &= \begin{bmatrix}
            \alpha (z_{11}) & \dots & \alpha (z_{1p_1})\\
             & \ddots & \vdots \\
            \alpha (z_{n1}) & &\alpha(z_{np_1}) 
          \end{bmatrix}\in R^{p_i\times n}\\
       \frac{\triangledown A_i}{\triangledown Z_i}&=\begin{bmatrix}
            \alpha '(z_{11}) & \dots & \alpha '(z_{1p_1})\\
             & \ddots & \vdots \\
            \alpha '(z_{n1}) & & \alpha '(z_{np_1})
       \end{bmatrix}\in R^{p_i\times n}
\end{split}
$$

$$
\begin{split}
    &\frac{\triangledown Z_1}{\triangledown W_1} = X \in R^{p\times n}\\
    &\frac{\triangledown Z_1}{\triangledown b_1} = 1^{1\times n} \in R^{1\times n}
\end{split}
$$

$$
\begin{split}
    \frac{\triangledown loss}{\triangledown W_1} &= \left(\frac{\triangledown loss}{\triangledown A_1}*\frac{\triangledown A_1}{\triangledown Z_i}\right)\left(\frac{\triangledown Z_i}{\triangledown W_i}\right)^T \in R^{p_1\times p}\\
    \frac{\triangledown loss}{\triangledown b_1} &= \left(\frac{\triangledown loss}{\triangledown A_1}*\frac{\triangledown A_1}{\triangledown Z_i}\right)1^{n\times 1}\in R^{p_1\times 1}\\
\end{split}
$$

## 七、优化器

优化器是指优化得到参数的方法，优化器基本都是基于梯度下降方法。如果你在线性回归中不用正规方程求解参数，而是用梯度下降，你会发现随着梯度不断下降，**梯度不断减小**。而这还不是最麻烦的问题，由于线性回归是凸优化，用梯度下降总会收敛到最小值，而神经网络多是非凸问题，梯度下降很可能会困在局部极值**无法收敛**。而且通常神经网络需要很多的数据进行训练，如果每次都像传统的梯度下降那样把所有数据都传入模型，则**计算成本很大**。

这里先介绍**SGD**（Stochastic Gradient Descnet，随机梯度下降）优化器。

SGD不再把所有的数据都用来进行梯度下降，而是只用小批量（**mini batch**）数据进行梯度下降，常见的选择是从2的4次方（16）到2的10次方之间，选用2的整数次方是根据计算机比特的特点决定的，而之前推导中梯度进行标准化时除以样本数，此时需要除以一批量的样本数。

控制梯度下降停止的条件也有所改变，由于神经网络强大的非线性组合能力，训练到收敛会造成过拟合，于是神经网络中用到最多的是早停法，也即小批量进行训练时将全部样本循环数遍（**epoch**）后就立即停下，避免过拟合。

## 八、应用

这次采用的是minist手写数字数据集，从kaggle的入门赛下载下来的训练数据集，有兴趣的可以把自己训练好的型跑一下kaggle上的测试数据集提交一下看看分数。（排名就不必了看了...）

In [21]:
import pandas as pd
import numpy as np

train_data = pd.read_csv('minist.csv')
train_data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
n = train_data.shape[0]
np.random.seed(2099)
index = np.random.permutation(n)
train_index = index[0: int(0.7*n)]
test_index = index[int(0.7*n): n]

test_data = train_data.iloc[test_index]
test_label = test_data['label']
del test_data['label']
test_data = np.array(test_data)
test_label = np.array(test_label).reshape([n-int(0.7*n), 1])

train_data = train_data.iloc[train_index]
train_label = train_data['label']
del train_data['label']
train_data = np.array(train_data)
train_label = np.array(train_label).reshape([int(0.7*n), 1])

In [23]:
def to_category(label, num_classes):
    n = label.shape[0]
    tmp = np.zeros([n, num_classes])
    j = 0
    for i in label:
        tmp[j, i]=1
        j += 1
    return tmp

In [24]:
def soft_max(z):
    """
    :param z: input, an p*n matrix
    :return: p*n matrix
    """
    e = np.exp(z)
    total = np.sum(e, axis=0, keepdims=True)
    weight = e / total
    return weight

In [25]:
def accuracy(y, y_hat):
    y = np.argmax(y, axis=0)
    y_hat = np.argmax(y_hat, axis=0)
    return sum(y == y_hat)/len(y)

In [26]:
def likelihood(y, y_hat):
    """
    :param y: the ture value
    :param y_hat: the predicted value
    :return: minimizing loss is the same as maximizing likelihood function,
            so we spare np.log
    """
    n = y.shape[1]
    return -np.sum(y * np.log(y_hat)) / n

In [27]:
def leaky_relu(x, k=0.3):
    return (x > 0)*x + k*(x < 0)*x


def d_leaky_relu(x, k=0.3):
    return (x > 0) + k*(x < 0)


In [28]:
def init_w(b, a):
    w = np.random.randn(a * b)
    w = np.reshape(w, [b, a])
    return w


def init_b(b):
    b = np.zeros([b, 1])
    return b


In [29]:
def forward(x, parameter, cache):
    """
    :param x:input data p*n matrix
    :param parameter: a dict storing parameters
    :param cache: a dict storing computation result of each layer
    :return: the predicted value
    """
    cache['C1'] = np.dot(parameter['W1'], x) + parameter['b1']
    cache['A1'] = leaky_relu(cache['C1'])
    cache['C2'] = np.dot(parameter['W2'], cache['A1']) + parameter['b2']
    cache['A2'] = leaky_relu(cache['C2'])
    cache['C3'] = np.dot(parameter['W3'], cache['A2']) + parameter['b3']
    cache['A3'] = soft_max(cache['C3'])
    return cache

In [30]:
def back_propagation(x, y, parameter, cache, step):
    """
    X 784*n / Y 10*n
    dW1 W1 800*784, db1 b1 800*1, A1 C1, 800*n
    dW2 W2 400*800, db2 b2 400*1, A2 C2, 400*n
    dW3 W3 10*400, db3 b3 10*1, A3 C3, 10*n
    :param y: true value
    :param parameter: dictionary storing all parameters
    :param cache: dictionary storing all the computation in process
    :param step: learning rate
    :return: updated parameters
    """
    number = y.shape[1]
    cache['dC3'] = cache['A3'] - y  # 10*n
    cache['dW3'] = np.dot(cache['dC3'], cache['A2'].T)/number  # 10*400
    cache['db3'] = np.sum(cache['dC3'], axis=1, keepdims=True)/number  # 10*1
    parameter['W3'] = parameter['W3'] - step*cache['dW3']  # 10*400
    parameter['b3'] = parameter['b3'] - step*cache['db3']  # 10*1

    cache['dC2'] = np.dot(parameter['W3'].T, 
                          cache['dC3'])*d_leaky_relu(cache['C2'])  # 400*n
    cache['dW2'] = np.dot(cache['dC2'], cache['A1'].T)/number  # 400*800
    cache['db2'] = np.sum(cache['dC2'], axis=1, keepdims=True)/number  # 400*1
    parameter['W2'] = parameter['W2'] - step*cache['dW2']  # 400*800
    parameter['b2'] = parameter['b2'] - step*cache['db2']  # 400*1

    cache['dC1'] = np.dot(parameter['W2'].T, 
                          cache['dC2'])*d_leaky_relu(cache['C1'])  # 800*n
    cache['dW1'] = np.dot(cache['dC1'], x.T)/number  # 800*784
    cache['db1'] = np.sum(cache['dC1'], axis=1, keepdims=True)  # 800*1
    parameter['W1'] = parameter['W1'] - step*cache['dW1']  # 800*784
    parameter['b1'] = parameter['b1'] - step*cache['db1']  # 800*1
    return cache, parameter


In [31]:
def train(x, y, learning_rate=0.001, batch_size=128, epoch=5):
    """
    :param x: training data
    :param y: training label
    :param learning_rate: the length of a step
    :param batch_size: numbers of samples we train in a round
    :param epoch: rounds we train through training data
    :return: a trained set of parameters
    """
    parameter = dict()
    nx = x.shape[1]
    parameter['W1'] = init_w(800, 784)/100
    parameter['b1'] = init_b(800)
    parameter['W2'] = init_w(400, 800)/100
    parameter['b2'] = init_b(400)
    parameter['W3'] = init_w(10, 400)/100
    parameter['b3'] = init_b(10)

    index = np.array([], dtype='int')
    for i in range(0, nx, batch_size):
        index = np.append(index, i)
    index = np.append(index, nx)

    cache = dict()
    for i in range(0, epoch):
        for j in range(0, int(nx/batch_size)+1):
            one_batch_x = x[:, index[j]:index[j+1]]
            one_batch_y = y[:, index[j]:index[j+1]]
            cache = forward(one_batch_x, parameter, cache)
            prob = likelihood(one_batch_y, cache['A3'])
            acc = accuracy(one_batch_y, cache['A3'])
            print(str(i)+'--'+str(j)+'--'+str(index[j+1]))
            print('loss: '+str(prob))
            print('accuracy: '+str(acc))
            [cache, parameter] = back_propagation(one_batch_x, one_batch_y,
                                        parameter, cache, step=learning_rate)
    return cache, parameter

In [32]:
train_label = to_category(train_label, num_classes=10)
test_label = to_category(test_label, num_classes=10)

print(train_label.shape)
print(test_label.shape)

(29399, 10)
(12601, 10)


In [33]:
cache, parameter = train(x=train_data.T, y=train_label.T, epoch=5)

0--0--128
likelihood: 2.5144562796833654
accuracy: 0.0703125
0--1--256
likelihood: 2.2170502380328974
accuracy: 0.1875
0--2--384
likelihood: 2.1658027472486903
accuracy: 0.234375
0--3--512
likelihood: 2.0168412770294104
accuracy: 0.296875
0--4--640
likelihood: 1.8946818751163836
accuracy: 0.40625
0--5--768
likelihood: 1.8244820246098055
accuracy: 0.421875
0--6--896
likelihood: 1.6413233158646316
accuracy: 0.515625
0--7--1024
likelihood: 1.6209324889136416
accuracy: 0.5546875
0--8--1152
likelihood: 1.5990883453285492
accuracy: 0.5390625
0--9--1280
likelihood: 1.3856347927896082
accuracy: 0.6484375
0--10--1408
likelihood: 1.303833712288761
accuracy: 0.703125
0--11--1536
likelihood: 1.3310439935553542
accuracy: 0.6484375
0--12--1664
likelihood: 1.1805366474892718
accuracy: 0.7265625
0--13--1792
likelihood: 1.260705592177902
accuracy: 0.6640625
0--14--1920
likelihood: 1.1617022979770701
accuracy: 0.7109375
0--15--2048
likelihood: 1.1734547644587505
accuracy: 0.6875
0--16--2176
likelihood: 

0--131--16896
likelihood: 0.5487121150878685
accuracy: 0.84375
0--132--17024
likelihood: 0.48033160960419174
accuracy: 0.8671875
0--133--17152
likelihood: 0.2827652008103738
accuracy: 0.9375
0--134--17280
likelihood: 0.5280190354445662
accuracy: 0.84375
0--135--17408
likelihood: 0.4013633538688606
accuracy: 0.90625
0--136--17536
likelihood: 0.4341277717816206
accuracy: 0.90625
0--137--17664
likelihood: 0.36986659277996203
accuracy: 0.890625
0--138--17792
likelihood: 0.3639065606743461
accuracy: 0.90625
0--139--17920
likelihood: 0.37277113417056407
accuracy: 0.875
0--140--18048
likelihood: 0.3281691140953412
accuracy: 0.921875
0--141--18176
likelihood: 0.30284969809562756
accuracy: 0.9375
0--142--18304
likelihood: 0.37247722481492523
accuracy: 0.8984375
0--143--18432
likelihood: 0.4380878591385008
accuracy: 0.8828125
0--144--18560
likelihood: 0.37506361842601255
accuracy: 0.8984375
0--145--18688
likelihood: 0.390889982281352
accuracy: 0.90625
0--146--18816
likelihood: 0.3599942963969399

1--31--4096
likelihood: 0.37485050138090525
accuracy: 0.8984375
1--32--4224
likelihood: 0.40481754889162824
accuracy: 0.875
1--33--4352
likelihood: 0.3097313623292942
accuracy: 0.921875
1--34--4480
likelihood: 0.35920632785927703
accuracy: 0.8984375
1--35--4608
likelihood: 0.38687265900328316
accuracy: 0.8984375
1--36--4736
likelihood: 0.41814952218396384
accuracy: 0.8515625
1--37--4864
likelihood: 0.24292502593632326
accuracy: 0.921875
1--38--4992
likelihood: 0.3298984284439357
accuracy: 0.90625
1--39--5120
likelihood: 0.35398397172634577
accuracy: 0.8984375
1--40--5248
likelihood: 0.39377776996960023
accuracy: 0.8984375
1--41--5376
likelihood: 0.2149024317936732
accuracy: 0.9453125
1--42--5504
likelihood: 0.30713244667517786
accuracy: 0.9140625
1--43--5632
likelihood: 0.2937702218335918
accuracy: 0.8828125
1--44--5760
likelihood: 0.3134226331172354
accuracy: 0.921875
1--45--5888
likelihood: 0.5137701880222669
accuracy: 0.8828125
1--46--6016
likelihood: 0.38857735940757243
accuracy: 0

1--161--20736
likelihood: 0.3213300303503326
accuracy: 0.921875
1--162--20864
likelihood: 0.3323971150616603
accuracy: 0.921875
1--163--20992
likelihood: 0.309013077594275
accuracy: 0.8984375
1--164--21120
likelihood: 0.20684476378976666
accuracy: 0.9453125
1--165--21248
likelihood: 0.18287534562771526
accuracy: 0.9453125
1--166--21376
likelihood: 0.36682480166017384
accuracy: 0.8984375
1--167--21504
likelihood: 0.3733493286387235
accuracy: 0.8984375
1--168--21632
likelihood: 0.3130223202982156
accuracy: 0.9296875
1--169--21760
likelihood: 0.2600165076204857
accuracy: 0.9296875
1--170--21888
likelihood: 0.30039068884056314
accuracy: 0.90625
1--171--22016
likelihood: 0.266503238653585
accuracy: 0.9140625
1--172--22144
likelihood: 0.3711276663502726
accuracy: 0.8984375
1--173--22272
likelihood: 0.2760758573351142
accuracy: 0.90625
1--174--22400
likelihood: 0.28228296683718684
accuracy: 0.9296875
1--175--22528
likelihood: 0.2760608740880042
accuracy: 0.9296875
1--176--22656
likelihood: 0.

2--61--7936
likelihood: 0.4169072655482665
accuracy: 0.8984375
2--62--8064
likelihood: 0.2502321834652378
accuracy: 0.921875
2--63--8192
likelihood: 0.2769833967055245
accuracy: 0.9375
2--64--8320
likelihood: 0.2508444807950252
accuracy: 0.90625
2--65--8448
likelihood: 0.17624664860032632
accuracy: 0.9609375
2--66--8576
likelihood: 0.1695451298105187
accuracy: 0.9609375
2--67--8704
likelihood: 0.16605490863698502
accuracy: 0.96875
2--68--8832
likelihood: 0.28904051210435777
accuracy: 0.9140625
2--69--8960
likelihood: 0.2474821702686613
accuracy: 0.9140625
2--70--9088
likelihood: 0.20861731371885953
accuracy: 0.9375
2--71--9216
likelihood: 0.2825250780616856
accuracy: 0.90625
2--72--9344
likelihood: 0.404378257158427
accuracy: 0.859375
2--73--9472
likelihood: 0.28392976515527
accuracy: 0.890625
2--74--9600
likelihood: 0.3913316358280359
accuracy: 0.9140625
2--75--9728
likelihood: 0.2796643443951746
accuracy: 0.90625
2--76--9856
likelihood: 0.28306431578429436
accuracy: 0.9140625
2--77--

2--190--24448
likelihood: 0.26800486477491714
accuracy: 0.9140625
2--191--24576
likelihood: 0.33900175576320246
accuracy: 0.90625
2--192--24704
likelihood: 0.2130926247786701
accuracy: 0.9296875
2--193--24832
likelihood: 0.26249411182650456
accuracy: 0.890625
2--194--24960
likelihood: 0.2911193561797826
accuracy: 0.921875
2--195--25088
likelihood: 0.18439977161928411
accuracy: 0.953125
2--196--25216
likelihood: 0.330365452474791
accuracy: 0.90625
2--197--25344
likelihood: 0.334585838882031
accuracy: 0.8984375
2--198--25472
likelihood: 0.40809922062566883
accuracy: 0.90625
2--199--25600
likelihood: 0.2539642070465923
accuracy: 0.9296875
2--200--25728
likelihood: 0.23080439146098
accuracy: 0.9140625
2--201--25856
likelihood: 0.29950389239497976
accuracy: 0.890625
2--202--25984
likelihood: 0.16835276051997342
accuracy: 0.9609375
2--203--26112
likelihood: 0.25820327049490455
accuracy: 0.9296875
2--204--26240
likelihood: 0.18991310727076
accuracy: 0.9296875
2--205--26368
likelihood: 0.36439

3--90--11648
likelihood: 0.19741723433904124
accuracy: 0.9609375
3--91--11776
likelihood: 0.199206717368618
accuracy: 0.9375
3--92--11904
likelihood: 0.3392661709258577
accuracy: 0.890625
3--93--12032
likelihood: 0.2729901199704835
accuracy: 0.90625
3--94--12160
likelihood: 0.19162003928997212
accuracy: 0.9296875
3--95--12288
likelihood: 0.21215231893450287
accuracy: 0.953125
3--96--12416
likelihood: 0.25332895662748023
accuracy: 0.953125
3--97--12544
likelihood: 0.30036186772331486
accuracy: 0.8984375
3--98--12672
likelihood: 0.15554875667870727
accuracy: 0.96875
3--99--12800
likelihood: 0.3243784842657135
accuracy: 0.9375
3--100--12928
likelihood: 0.20262561838601317
accuracy: 0.9375
3--101--13056
likelihood: 0.21068553752962627
accuracy: 0.9453125
3--102--13184
likelihood: 0.13012513994354524
accuracy: 0.96875
3--103--13312
likelihood: 0.2770658745666306
accuracy: 0.9140625
3--104--13440
likelihood: 0.28400294694329287
accuracy: 0.90625
3--105--13568
likelihood: 0.31554719153324196


3--218--28032
likelihood: 0.1314429630482407
accuracy: 0.9609375
3--219--28160
likelihood: 0.20354789307960947
accuracy: 0.9296875
3--220--28288
likelihood: 0.34624666484090116
accuracy: 0.9140625
3--221--28416
likelihood: 0.24043570528279382
accuracy: 0.9453125
3--222--28544
likelihood: 0.23452090810490678
accuracy: 0.9375
3--223--28672
likelihood: 0.13009297025790123
accuracy: 0.96875
3--224--28800
likelihood: 0.3404381906934405
accuracy: 0.8828125
3--225--28928
likelihood: 0.21532846926624505
accuracy: 0.9296875
3--226--29056
likelihood: 0.2158970796009546
accuracy: 0.9296875
3--227--29184
likelihood: 0.33927296560833353
accuracy: 0.90625
3--228--29312
likelihood: 0.22650685669148862
accuracy: 0.9609375
3--229--29399
likelihood: 0.19164046936831952
accuracy: 0.9540229885057471
4--0--128
likelihood: 0.3473088051415014
accuracy: 0.90625
4--1--256
likelihood: 0.15428242324710867
accuracy: 0.9375
4--2--384
likelihood: 0.37500307217663864
accuracy: 0.90625
4--3--512
likelihood: 0.1815322

4--119--15360
likelihood: 0.12830747834828649
accuracy: 0.9609375
4--120--15488
likelihood: 0.18224703284315363
accuracy: 0.9453125
4--121--15616
likelihood: 0.23220465376461621
accuracy: 0.9609375
4--122--15744
likelihood: 0.2878933457801678
accuracy: 0.8984375
4--123--15872
likelihood: 0.20799687716837525
accuracy: 0.953125
4--124--16000
likelihood: 0.16005518551703848
accuracy: 0.96875
4--125--16128
likelihood: 0.12793902310279098
accuracy: 0.9609375
4--126--16256
likelihood: 0.19433623125283483
accuracy: 0.9453125
4--127--16384
likelihood: 0.13553740913317208
accuracy: 0.9453125
4--128--16512
likelihood: 0.21666334408277954
accuracy: 0.9453125
4--129--16640
likelihood: 0.19770677363120415
accuracy: 0.9453125
4--130--16768
likelihood: 0.1971772671028427
accuracy: 0.9609375
4--131--16896
likelihood: 0.24574046325169374
accuracy: 0.90625
4--132--17024
likelihood: 0.33301069539701794
accuracy: 0.890625
4--133--17152
likelihood: 0.09897151645755026
accuracy: 0.9765625
4--134--17280
like

In [34]:
hat_label = forward(test_data.T, parameter, cache)
hat_label.keys()

dict_keys(['C1', 'A1', 'C2', 'A2', 'C3', 'A3', 'dC3', 'dW3', 'db3', 'dC2', 'dW2', 'db2', 'dC1', 'dW1', 'db1'])

In [35]:
hat_label = hat_label['A3']
hat_label.shape

(10, 12601)

In [36]:
likelihood(test_label.T, hat_label)

0.2211812376862138