# Lecture 22: Multi-Layer Neural Network I


----

### A supervised learning problem
We have access to labeled training samples $(\mathbf{x}^{(i)},y^{(i)})$. Neural networks give a way of defining a complex, highly non-linear form of hypotheses model function $h(\mathbf{x}; W, b)$, with parameters $W$ and $b$ that we can fit to our data. This nonlinear function is capable of approximating some of the most obscure relations in real life, if we have enough parameters.

#### References:

* [Simple MNIST numpy from scratch](https://www.kaggle.com/scaomath/simple-mnist-numpy-from-scratch)
* [Stanford Deep Learning Tutorial in Matlab](http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/)
* [3Blue1Brown's video series on Deep Learning](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
* [A visual proof that neural nets can compute any function](http://neuralnetworksanddeeplearning.com/chap4.html)
* [Backpropagation and chain rule](http://colah.github.io/posts/2015-08-Backprop/)
* [Chapter 6: Multiplayer Feedforward Neural Network in *Deep Learning Book* by Goodfellow et al](http://www.deeplearningbook.org/contents/mlp.html)

## A feedforward neural network

Below is one example of a feedforward neural network, the name comes from the fact that the connectivity graph does not have any directed loops or cycles.

<img src="https://faculty.sites.uci.edu/shuhaocao/files/2019/03/neural_net.png" alt="drawing" width="1000"/>

## How a single neuron works in the $l$-th layer

<img src="https://faculty.sites.uci.edu/shuhaocao/files/2019/03/neuron-1.png" alt="drawing" width="900"/>

* $\mathbf{w}$ and $b$: weights and bias
* $\mathbf{a} = (a_1, a_2, a_3)$: input (outputs/activations from the previous layer)

This "neuron" is a computational unit/node that takes an input $\mathbf{a}$, and outputs the 
model function $h^{\text{single} }(\mathbf{a}; \mathbf{w}, b)$ (aka activation):
$$ h^{\text{single} }(\mathbf{a}; \mathbf{w}, b) = f(\mathbf{w}^{\top} \mathbf{a} + b) = f\Big(\sum_{i=1}^3 w_i a_i +b\Big) 
$$
The $f(\cdot)$ is called an "activation function", common choices are $\tanh$, Sigmoid and ReLU:
$$
\text{ReLU} (x) = \max(0, x), \; \sigma(x) = \frac{1}{1 + e^{-x}}, \; \tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$$

## Putting neurons together

A neural network is put together by hooking many of our simple "neurons", so that the output of a neuron can be the input of another. For example, here is a small neural network (a slice of a bigger network):

<img src="https://faculty.sites.uci.edu/shuhaocao/files/2019/04/neural_net_3l.png" alt="drawing" width="800"/>

In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units. Layer 1 is called the input layer, and Layer 3 is the output layer (which, in this example, has only one node). The middle layer, Layer 2, is called a hidden layer, because its values are not observed in the training set. 

The neural network in our example has 2 *input units* (not counting the bias unit), 3 *hidden units*, and *1 output unit*.

## Parameters and the forward pass

Our neural network has parameters $(W, b) := \big(W^{(1)},b^{(1)},W^{(2)},b^{(2)}\big)$.

* $W^{(l)} = \big(w^{(l)}_{ij}\big)$ to denote the weight matrix, where the entry-$ij$ is associated with the connection between unit $j$ in layer $l$, and unit $i$ in layer $l+1$. Note the order of the indices, $j$ is the closer to the input that this matrix is acting on 

* $b^{(l)}_i$ is the bias associated with unit $i$ in layer $l+1$. 

In our example above, we have $W^{(1)}\in \mathbb{R}^{3×2}$, and $W^{(2)}\in \mathbb{R}^{1×3}$. Note that bias units do not have inputs or connections going into them, we write their output the value `+1` for convenience. When we count the number of units in layer $l$, we do not count the bias unit.

## Activation and function compositions

We will write $a^{(l)}_i$ to denote the activation or output value of unit $i$ in layer $l$. For $l=1$, $a^{(1)}_i= x_i$ denotes the $i$-th input to this network. Given a fixed set of parameters $(W,b)$, and the input $\mathbf{x}$, the neural network above defines a model function $h(\mathbf{x}; W, b)$ made of layers of function compositions that outputs a real number. Specifically, the computation that this neural network represents is given by:

$$\begin{aligned}
a_1^{(2)} &= f\big(w_{11}^{(1)}x_1 + w_{12}^{(1)} x_2  + b_1^{(1)}\big)  \\
a_2^{(2)} &= f\big(w_{21}^{(1)}x_1 + w_{22}^{(1)} x_2 + b_2^{(1)}\big)  \\
a_3^{(2)} &= f\big(w_{31}^{(1)}x_1 + w_{32}^{(1)} x_2 + b_3^{(1)}\big)  \\
h(\mathbf{x}; W,b) &=a =  a_1^{(3)} =  f\big(w_{11}^{(2)} a_1^{(2)} + w_{12}^{(2)} a_2^{(2)} + w_{13}^{(2)} a_3^{(2)} + b_1^{(2)}\big) 
\end{aligned}
$$

----

## Compact notation: forward pass

If we allow the activation function $f(\cdot)$ to act on vectors in an element-wise fashion: $f([\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3])=[f(\mathbf{z}_1),f(\mathbf{z}_3),f(\mathbf{z}_3)]$, then we can write the equations above more compactly as:
$$\begin{aligned}
\mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\
\mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\
\mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\
h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = f(\mathbf{z}^{(3)})
\end{aligned}
$$
More generally, recalling that $\mathbf{a}^{(1)}=\mathbf{x}$ also denotes the values from the input layer, then given layer $l$'s activations $\mathbf{a}^{(l)}$, we can compute layer $(l+1)$'s activations $\mathbf{a}^{(l+1)}$ as:
$$
\begin{aligned}
\mathbf{z}^{(l+1)} &= W^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)}   \\
\mathbf{a}^{(l+1)} &= f(\mathbf{z}^{(l+1)})
\end{aligned}
$$
By organizing the parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.

## Loss function for regression

Suppose we have a fixed and labeled training set $\{ (\mathbf{x}^{(1)}, y^{(1)}), \dots, (\mathbf{x}^{(N)}, y^{(N)}) \}$ of $N$ training examples. For a single training sample and its target value $(\mathbf{x}, y)$, we define the sample loss function with respect to this single example to be:
$$
J(W,b; \mathbf{x},y) = \frac{1}{2} \big| h(\mathbf{x}; W,b) - y \big|^2,
$$
or if the label is a vector, 
$$
J(W,b; \mathbf{x},y) = \frac{1}{2} \big\| h(\mathbf{x}; W,b) - y \big\|^2,
$$

Then the overall loss function is the mean of the sample losses, plus the regularization term (aka a weight decay term) that tends to decrease the magnitude of the weights $w_{ij}^{(l)}$ but not the biases, and helps prevent overfitting, lastly the $1/2$ factor is added so that upon taking derivate we can get a nice rounded expression without any factors.
$$
\begin{aligned}
J(W,b)
&=  \frac{1}{N} \sum_{i=1}^N J(W,b;\mathbf{x}^{(i)},y^{(i)}) 
                       + \frac{\epsilon}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( w^{(l)}_{ji} \right)^2
 \\
&= \frac{1}{N} \sum_{i=1}^N \left( \frac{1}{2} \left\| h(\mathbf{x}^{(i)}; W,b) - y^{(i)} \right\|^2 \right) 
+ \frac{\epsilon}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( w^{(l)}_{ji} \right)^2,
\end{aligned}
$$
where $n_l$ denote the number of layers in the network, and $s_l$ denote the number of nodes in layer $l$ (not counting the bias unit).

## How a neural net works in action

We are gonna perform forward passes for a trained neural net with the input layer having 784 input units (28x28 grayscale images), 1 hidden layer with 256 hidden units (neurons), the output layer has 10 units (each represents a class), the activation is ReLU.

----

#### Implementation Remark:
This cost function above is often used both for classification and for regression problems. For classification, we let $y=0$ or $1$ represent the two class labels (recall that the sigmoid activation function outputs values in $[0,1]$; If we were using a $\tanh$ activation function, we would instead use $-1$ and $+1$ to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the $[0,1]$ range or $[−1,1]$ range. Most of the time, rescaling inputs is helpful too.

In [None]:
# first the implementation of a vectorized ReLU
def relu(x):
    return x*(x>0)

In [None]:
# X is the input, of which the second dimension is 784
# if X is the trained samples, X.shape = (60000, 784)
# if X is a single testing sample, we should make X's shape to be (1, 784)
# W is the weight, which is implemented as a list so that
# W[0].shape = (784, 256), W[0] maps the input layer to the hidden layer
# W[1].shape = (256, 10), W[1] maps the output from the hidden layer to the output layer (10 classes)
# b is the bias in each layer, it is also a list so that
# b[0].shape = (256,) and b[0] is the bias in the input layer
# b[1].shape = (10,) and b[1] is the bias in the hidden layer
def h(X,W,b):
    # layer 1 = input layer
    a1 = X
    # layer 1 (input layer) -> layer 2 (hidden layer)
    z2 = np.matmul(X, W[0]) + b[0]
    # layer 2 activation
    a2 = relu(z2)
    # layer 2 (hidden layer) -> layer 3 (output layer)
    z3 = np.matmul(a2, W[1]) + b[1]
    # output layer activation
    output = relu(z3)
    return output 

In [None]:
# load the trained weights
W = np.load('weights.npz')['weights']
b = np.load('weights.npz')['bias']

In [None]:
X_test = np.load('mnist_test.npz')['X']/255
y_test = np.load('mnist_test.npz')['y'].astype(int)

In [None]:
y_pred = np.argmax(h(X_test, W, b), axis=1)  # pick the biggest activated output unit's index as our prediction

In [None]:
print("accuracy is:", np.mean(y_pred == y_test))