# $$Neuron~~and~~Neural~~Networks$$

# **Definition**

## **Neuron**

### **Defintion**

Airplane mimics how birds fly, Shinkansen mimics the kingfisher's sleak beak; by mimicking the nature, humans have advanced the technology. is called biomimicry. In machine learning, since our goal is to make machines that can think like humans, we mimic neurons which are the building blocks for the brain and that is how **Neural Network** come along. 

But first, let's find out what are neurons and how they can be applied to machine learning (Click <a href="https://kite.com/wp-content/uploads/2019/05/image8.png">here</a> to find out the structure of neurons).
A neuron uses **dendrite** to take electricity pulses (called **spikes**), process in the nucleus to give the outputs (**axons**). 

Similarly, a neuron cell is basically computational units that takes input features $x_1, x_2, ..., x_n$ (dendrites) as electrical inputs that are channeled to outputs (axons). In this model, our $x_0$ input node and $a^{(i)}_0$ is sometimes called the **bias unit**, which is always equal to $1$. 

### **Usage in Neural Networks**

In neural network, neurons are nodes containing numbers called **activation** between $0$ and $1$, sort of analogous to how neurons in the brain can be active or inactive. You can imagine them like <a href="https://www.3blue1brown.com/content/lessons/2017/neural-networks/activations.svg">this</a>.

## **Neural Networks**

### **Definition**

A neural network are composed of many layers of neurons, hence the name multilayer perceptrons. <a href="https://victorzhou.com/media/nn-series/network.svg">Here</a> is the illustration of a simple 4-layer neural network. This is the foundation for many other deep learning models like **Convolutional** and **Recurrent Neural Networks**.

The first layer (layer $1$) is also called the **input layer**, which outputs will be given into another node in intermediate layers (layer $2, 3, ..., n - 1$) (called the **hidden layers**). Finally, the last layer outputs the hypothesis function, which is called the **output layer**. Since there are no cycles or loops in the network and the flow of information only goes forward from input layer, through hidden layers and to the output layer. This is called **Feedforward**.

We label these intermediate layer nodes $a^{(2)}_0, a^{(2)}_1, ..., a^{(2)}_n$ and called them **activation units**, in which $a^{(2)}_0$ is also a **bias unit**.

Some notion explained:
$$a^{(i)}_j = \text{``activation" of unit $j$ in layer $i$}.$$ 
$$\theta^{(i)} = \text{matrix of weights controlling function mapping from layer $i$ to layer $i+1$}.$$
$$\theta^{(i)}_{jk} = \text{the weight of the $k^{th}$ activation controlling function mapping from the $j^{th}$ activation unit in layer $i$ to the $j^{th}$ activation unit in the $i+1$ layer.}$$
From this, we have some special cases:

* $a^{(0)}$ is the input layer.

* $a^{(i)}_0 = 1$

* $\theta^{(i)}_{j0} = b^{(i)}_j$ ($b$ stands for **bias**)

### **Feedforward**

If we had only one hidden layer, it would look like this:
$$[x_0x_1x_2x_3] \rightarrow [a^{(2)}_1a^{(2)}_2a^{(2)}_3] \rightarrow f_\theta(x)$$
For each **activation** nodes, the values are obtained as follows:
$$a^{(2)}_1 = g(\theta^{(1)}_{10}x_0 + \theta^{(1)}_{11}x_1 + \theta^{(1)}_{12}x_2 + \theta^{(1)}_{13}x_3)$$
$$a^{(2)}_2 = g(\theta^{(1)}_{20}x_0 + \theta^{(1)}_{21}x_1 + \theta^{(1)}_{22}x_2 + \theta^{(1)}_{23}x_3)$$
$$a^{(2)}_3 = g(\theta^{(1)}_{30}x_0 + \theta^{(1)}_{31}x_1 + \theta^{(1)}_{32}x_2 + \theta^{(1)}_{33}x_3)$$
The next layer (layer $3$) will be the output:
$$f_\theta(x) = a^{(3)}_1 = g(\theta^{(2)}_{10}x_0 + \theta^{(2)}_{11}x_1 + \theta^{(2)}_{12}x_2 + \theta^{(2)}_{13}x_3)$$
To sum up:
$$a^{(i+1)}_j = \sum_{k=0}^{n}{g(\theta^{(i)}_{jk}x_k)}$$
$$a^{(i+1)}_j = g(\theta^{(i)}_{j0}x_0 + \theta^{(i)}_{j1}x_1 + \theta^{(i)}_{j2}x_2 + ... + \theta^{(i)}_{jn}x_n)$$
In which $g(x)$ is the **sigmoid** function.

We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix $\theta^{(2)}$ containing the weights for our second layer of nodes.

In this example, we can see that from the first layer to the second layer, the $\theta$ is a $3 \times 4$ dimensions matrix, since there are $3$ nodes in the second layer and four $\theta$ corresponding to $3$ input nodes and $1$ bias node. 

Generally speaking, to determine the dimensions of these matrices of weights, we will apply this formula: **if network has** $s_j$ **units in layer** $j$ **and** $s_{j+1}$ **units in layer** $j+1$**, then** $\theta^{(j)}$ **will be of dimensions** 
$$s_{j+1} \times (s_j + 1)$$ 
The $+1$ in $s_j + 1$ comes from the addition in $\theta^{(i)}$ of the bias nodes: $x_0$ and $\theta^{(i)}_0$. In other words, the output nodes will not include the bias nodes while the inputs will.

### **Vectorized Implementation**

To implement this better, we can define a new variable $z^{(i)}_j$ that encompasses the parameters inside our activation function $g(x)$. So now our activation units will now be: 
$$a^{(2)}_1 = g(z^{(2)}_1)$$
$$a^{(2)}_2 = g(z^{(2)}_2)$$
$$a^{(2)}_3 = g(z^{(2)}_3)$$
Generally:
$$a^{(i)}_j = g(z^{(i)}_j)$$

This means, for layer $i = 2$ and node $j$, the value $z$ is now replacing:
$$z^{(2)}_j = \theta^{(1)}_{j0}x_0 + \theta^{(1)}_{j1}x_1 + \theta^{(1)}_{j2}x_2 + ... + \theta^{(1)}_{jn}x_n$$

The vector $x$ can be represents as:
$$x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix}$$
but since the input $x$ can be considered as activation units of the first layer ($i=1$) $x = a^{(1)}$ so we can generalize all nodes in the neural network as:
$$a^{(i)} = \begin{bmatrix} a^{(i)}_0 \\ a^{(i)}_1 \\ a^{(i)}_2 \\ ... \\ a^{(i)}_n \end{bmatrix}$$

Vectorizing $x$ and $z^{(i)}$ and annotate $n, m$ as the number of node in each layer, we obtain:
$$a^{(i)} = \begin{bmatrix} a^{(i)}_0 \\ a^{(i)}_1 \\ a^{(i)}_2 \\ ... \\ a^{(i)}_n \end{bmatrix}~~~~z^{(i)} = \begin{bmatrix} z^{(i)}_1 \\ z^{(i)}_2 \\ ... \\ z^{(i)}_m \end{bmatrix}~~~~\Theta^{(i)} = \begin{bmatrix} 
            \theta^{(i)}_{10} & \theta^{(i)}_{11} & \theta^{(i)}_{12} & ... & \theta^{(i)}_{1k} & ... & \theta^{(i)}_{1n} \\
            \theta^{(i)}_{20} & \theta^{(i)}_{21} & \theta^{(i)}_{22} & ... & \theta^{(i)}_{2k} & ... & \theta^{(i)}_{2n} \\
            & & & ... & \\
            \theta^{(i)}_{j0} & \theta^{(i)}_{j1} & \theta^{(i)}_{j2} & ... & \theta^{(i)}_{jk} & ... & \theta^{(i)}_{jn} \\
            & & & ... & \\
            \theta^{(i)}_{m0} & \theta^{(i)}_{m1} & \theta^{(i)}_{m2} & ... & \theta^{(i)}_{mk} & ... & \theta^{(i)}_{mn}
                            \end{bmatrix}$$
You can see that the $\Theta^{(i)}$ matrix have $(n+1)$ columns and $m$ rows ($m \times (n+1)$), since $n$ is the number of nodes in the current layer $(i)$ and $m$ is the number of nodes in the next layer $(i+1)$ so this proves our formula from before. 

With these matrices, we can simplify the complex computation for $z^{(i)}$ as:
$$z^{(i)} = \Theta^{(i-1)}a^{(i-1)}$$

We are multiplying our matrix $\Theta^{(i-1)}$ with dimensions $m \times (n+1)$ by our vector $a^{(i-1)}$ with dimensions $(n+1) \times 1$. This gives us the vector $z^{(i)}$ with dimensions $m \times 1$. Now we can obtain the vector of our activation nodes by applying the activation function for the $z$ vector layer $i$ as:
$$a^{(i)} = g(z^{(i)})$$
We can then add a bias unit (= 1) to layer $j$ after we have computated $a^{(i)}$. This will be element $a^{(i)}_0 = 1$. So our final hypothesis, let's compute another $z$ vector:
$$z^{(i+1)} = \Theta^{(i)}a^{(i)}$$
Since we only have 1 node in this layer so the $\Theta^{(i)}$ will only have $1 \times (m + 1)$, then multiplied that by a column vector $a^{(i)}$ so our result is a single number. Then our final result is:
$$f_\theta(x) = a^{(i+1)} = g(z^{(i+1)})$$
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.

### **Examples**

#### **AND Logical Operator**

By applying neural networks, we can predict the result of $x_1$ AND $x_2$, which is only true when both of $x_1$ and $x_2$ are $1$.

|$x_1$||

Since it is quite simple, we can use just 1 layer.

The graph of our functions will look like this:
$$\begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} \rightarrow \begin{bmatrix} g(z^{(2)}) \end{bmatrix} \rightarrow f_\theta(x)$$
with $x_0 = 1$ is our bias unit.

Since there are only 1 layer, we only need 1 row of $\theta$, which is:
$$\Theta^{(1)} = \begin{bmatrix} -30 & 20 & 20 \end{bmatrix}$$
So now our $z^{(2)} = \Theta^{(1)}X = \theta^{(1)}_{10}x_0 + \theta^{(1)}_{11}x_1 + \theta^{(1)}_{12}x_2$. Then our hypotheses will become:
$$f_\theta(x) = g(\Theta^{(1)}X) = g(\theta^{(1)}_{10}x_0 + \theta^{(1)}_{11}x_1 + \theta^{(1)}_{12}x_2)$$

Now it is capable of predicting the result of the AND operator. For example:
* If $x_1 = 0, x_2 = 0 \Rightarrow g(-30) \approx 0$
* If $x_1 = 0, x_2 = 1 \Rightarrow g(-10) \approx 0$
* If $x_1 = 1, x_2 = 0 \Rightarrow g(-10) \approx 0$
* If $x_1 = 1, x_2 = 1 \Rightarrow g(10) \approx 1$

This perfectly fits the output of a normal AND logical operator.