# Two-layer Neural Network
The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is \\(y\\), and cross entropy cost is used.

$$z_{1} = xW_{1} + b_{1}$$
$$h_{1} = sigmoid(z_{1})$$
$$z_{2} = h_{1}W_{2} + b_{2}$$
$$\hat{y} = softmax(z_{2})$$
$$CE(y, \hat{y}) = -\frac{1}{N} \sum_{i} y_i log\hat{y_i}$$

we can get the gradients.

$$\frac{\partial CE(y, \hat{y})}{\partial z_{2}} = \frac{1}{N} (\hat{y} - y)$$
$$\frac{\partial z_{2}} {\partial h_{1}} = W_{2}^\top$$
$$\frac{\partial h_{1}} {\partial z_{1}} = h_{1}(1-h_{1})$$

In [None]:
import numpy as np
import random

from q1_softmax import softmax
from q2_sigmoid import sigmoid, sigmoid_grad
from q2_gradcheck import gradcheck_naive

In [None]:
N = 20
dimensions = [10, 5, 10]
X = np.random.randn(N, dimensions[0])   # each row will be a datum
labels = np.zeros((N, dimensions[2]))
for i in range(N):
    labels[i, random.randint(0,dimensions[2]-1)] = 1

In [None]:
X.shape, labels.shape

In [None]:
params = np.random.randn((dimensions[0] + 1) * dimensions[1] + (
        dimensions[1] + 1) * dimensions[2], )

In [None]:
params.shape

In [None]:
# (Dx+1) * H + (H+1) * Dy = 115
# (Dx, H, Dy) = (10, 5, 10)
# packed parameters: W1,b1, W2, b2, into one dimention.

In [None]:
ofs = 0
Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

W1 = np.reshape(params[ofs:ofs+ Dx * H], (Dx, H))
ofs += Dx * H
b1 = np.reshape(params[ofs:ofs + H], (1, H))
ofs += H
W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
ofs += H * Dy
b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

In [None]:
Dx, H, Dy

In [None]:
W1.shape, b1.shape, W2.shape, b2.shape

In [None]:
# unpack the parameters into: W1, b1, W2, b2

In [None]:
z1 = np.dot(X, W1) + b1
z1.shape

In [None]:
h1 = sigmoid(z1)
h1.shape

In [None]:
z2 = np.dot(h1, W2) + b2
z2.shape

In [None]:
y_hat = softmax(z2)
y_hat.shape

In [None]:
log_y_hat = np.log(y_hat)
log_y_hat.shape

In [None]:
labels.shape

In [None]:
ce_y_y_hat = - np.sum(labels * log_y_hat) / N
ce_y_y_hat

In [None]:
# forward pass: we computed the cost ce_y_y_hat

In [None]:
grad_z2 = (y_hat - labels) / N
grad_z2.shape

In [None]:
grad_b2 = np.sum(grad_z2, axis=0).reshape((1, -1))
grad_b2.shape

In [None]:
grad_W2 = np.dot(h1.T, grad_z2)
grad_W2.shape

In [None]:
grad_h1 = np.dot(grad_z2, W2.T)
grad_h1.shape

In [None]:
grad_z1 = grad_h1 * h1 * (1-h1)
grad_z1.shape

In [None]:
grad_b1 = np.sum(grad_z1, axis=0).reshape((1, -1))
grad_b1.shape

In [None]:
grad_W1 = np.dot(X.T, grad_z1)
grad_W1.shape

In [None]:
# backward pass: we computed the gradients