# $$Neuron~and~Neural~Network$$

# Definition

## **Neuron**

Airplane mimics how birds fly, Shinkansen mimics the kingfisher's sleak beak; by mimicking the nature, humans have advanced the technology. is called biomimicry. In machine learning, since our goal is to make machines that can think like humans, we mimic neurons which are the building blocks for the brain and that is how **Neural Network** come along. 

But first, let's find out what are neurons and how they can be applied to machine learning (Click <a href="https://kite.com/wp-content/uploads/2019/05/image8.png">here</a> to find out the structure of neurons).
A neuron uses **dendrite** to take electricity pulses (called **spikes**), process in the nucleus to give the outputs (**axons**). 

Similarly, a neuron cell is basically computational units that takes input features $x_1, x_2, ..., x_n$ (dendrites) as electrical inputs that are channeled to outputs (axons). In this model, our $x_0$ input node is sometimes called the **bias unit**, which is always equal to $1$. In neural networks, we use the same logistic function as in classification, $\frac{1}{1 + e^{-\theta^{T}x}}$, which is sometimes called sigmoid (logistic) **activation** function. In this situation, $\theta$ parameters are sometimes called **weights**.

## **Neural Network**

A neural network are composed of many layers of neurons. The first layer (layer $1$) is also called the **input layer**, which outputs will be given into another node in intermediate layers (layer $2, 3, n - 1$) (called the **hidden layers**). Finally, the last layer outputs the hypothesis function, which is called the **output layer**.

We label these intermediate layer nodes $a^{(2)}_0, a^{(2)}_1, ..., a^{(2)}_n$ and called them **activation units**, in which $a^{(2)}_0$ is also a **bias unit**.

Some notion explained:
$$a^{(j)}_i = \text{``activation" of unit $i$ in layer $j$}$$ 
$$\theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}$$
$$\theta^{(j)}_{ik} = \text{the weight of the $k^{th}$ input controlling function mapping from the $i^{th}$ activation unit in layer $j$ to the $i^{th}$ activation unit in the $j+1$ layer.}$$

If we had only one hidden layer, it would look like this:
$$[x_0x_1x_2x_3] \rightarrow [a^{(2)}_1a^{(2)}_2a^{(2)}_3] \rightarrow f_\theta(x)$$
For each **activation** nodes, the values are obtained as follows:
$$a^{(2)}_1 = g(\theta^{(1)}_{10}x_0 + \theta^{(1)}_{11}x_1 + \theta^{(1)}_{12}x_2 + \theta^{(1)}_{13}x_3)$$
$$a^{(2)}_2 = g(\theta^{(1)}_{20}x_0 + \theta^{(1)}_{21}x_1 + \theta^{(1)}_{22}x_2 + \theta^{(1)}_{23}x_3)$$
$$a^{(2)}_3 = g(\theta^{(1)}_{30}x_0 + \theta^{(1)}_{31}x_1 + \theta^{(1)}_{32}x_2 + \theta^{(1)}_{33}x_3)$$
The next layer (layer $3$) will be the output:
$$f_\theta(x) = a^{(3)}_1 = g(\theta^{(2)}_{10}x_0 + \theta^{(2)}_{11}x_1 + \theta^{(2)}_{12}x_2 + \theta^{(2)}_{13}x_3)$$
To sum up:
$$a^{(j+1)}_i = \sum_{k=0}^{n}{g(\theta^{(j)}_{ik}x_k)}$$
$$a^{(j+1)}_i = g(\theta^{(j)}_{i0}x_0 + \theta^{(j)}_{i1}x_1 + \theta^{(j)}_{i2}x_2 + ... + \theta^{(j)}_{in}x_n)$$
In which $g(x)$ is the **sigmoid** function.

We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix $\theta^{(2)}$ containing the weights for our second layer of nodes.

In this example, we can see that from the first layer to the second layer, the $\theta$ is a $3 \times 4$ dimensions matrix, since there are $3$ nodes in the second layer and 4 $\theta$ corresponding to 3 input nodes and 1 bias node. Generally speaking, to determine the dimensions of these matrices of weights, we will apply this formula: **if network has** $s_j$ **units in layer** $j$ **and** $s_{j+1}$ **units in layer** $j+1$**, then** $\theta^{(j)}$ **will be of dimensions** $s_{j+1} \times (s_j + 1)$.