<a href="https://colab.research.google.com/github/JardRily/Mathematical-Methods-Data-Sciences/blob/main/MAT%20494%20Data%20Science/3.%207%20Neural%20Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3.7 Neural Networks

Artificial neural networks are collections of connected layers of units or nodes to loosely model the neurons in a brain. This sections will illistrate the use of differentiation for training artificial neural networks to minimize cost functions.

## 3.7.1 Mathematical Formulation

A neural network has inputs on the left side, where the set of inputes can be thought of the set $\{x^{(n)}_1, x^{(n)}_2, \ldots, x^{(n)}_m\}$, and a forecast output on the right $\hat{y}$, which is modified by the activation function $\sigma(z)$ chosen in advance: $\hat{y} = \sigma(z) = \sigma (w_1 a_1 + w_2 a_2 + b)$. In neural networks, the weights, the $w_i$, and the bias, $b$, will be found numerically to best fit our forecast output with our given data.

A general neural network may have any number of nodes. They demonstrate the inputs and outputs of neural networks. The input units receive various forms and structures of information based on an internal weighting system, and the neural network attempts to learn about the information presented to produce one output report. Specifically, it adjusts its weighted associations according to the neural network to produce output which is increasingly similar to the target output. After a sufficient number of these adjustments, the training can be terminated based upon certain criteria. This is known as supervised learning.

Now we formulate mathmatical notation for a neural netwrok: We can label each column in a neural network as a layer. In a general neural network, the rightmost layer will be called $Layer\; l$, with the layer to the left of it being $Layer\; l-1$. The values inputed on layer $l$ will be determined by the output of the values of layer $l-1$. Let 
\begin{gather*}
z^{(l)}_{j'} = \sum_{j=1}^{J_{l-1}} w_{j,j'}^{(l)}, a_j^{(l-1)} + b_{j'}^{(l)}
\end{gather*}
where $J_I$ means the number of nodes in layer $l$. For a given activation function, $\sigma$, we end up with the following expression for the values in the next layer, $a_{j'}^{(l)} = \sigma (z_{j'}^{(l)})$. In matrix form, $z^{(l)} = W^{(l)} a^{(l-1)} + b^{l}$, with the matrix $W^(l)$ containing all the multiplicative parameters (the weights $w^{(l)}_{j,j'}$ and $b^{(l)}$ is the bias). The bias is just the constant in the linear transformation: $a^{(l)} = \sigma (z^{(l)}) = \sigma (W^{(l)} a^{(l-1)} + b^{(l)})$.

## 3.7.2 Activation Functions

In neural networks, the activation function of a node abstracts the output of that node given an input or set of inputs for specific purposes (like classification). In biological neural networks, the activation function may represent an electrical signal, whether or not the neuron fires. We use $\sigma$ to represnt the activation functions. It will be the same for all nodes in a layer: $a^{(l)} = \sigma (z^{(l)}) = \sigma (W^{(l)} a^{(l-1)} + b^{(l)})$.

Here we discuss a number of activation functions.

### 3.7.2.1 Step Function

\begin{gather*}
\sigma (x)=
\begin{cases}
0 & x < 0 \\
1 & x \ge 0
\end{cases}
\end{gather*}
This is also called the Heaviside step function, or the unit step funciton, often represents a signal that switches on at a specified time and stays switched on indefinie. The step function can be used for classification problems. 

### 3.7.2.2  Rectified Linear Units (ReLU) Function

Positive linear/ReLU function is defined as $\sigma(x) = \max (0, x)$. It is one of the most commonly used activation functions. The signal either passes through untouched or dies completely. It is used to enable better training of deeper networks compared to the widely used activation functions. Rectified linear units, compared to sigmoid function or similar activation functions, allow faster and effective training of deep neural architectures on large and complex datasets.

### 3.7.2.3 Sigmoid

Sigmoid or logistic fucntion $\sigma(x) = \frac{1}{1+e^{-x}}$. The logisitc function finds applications in variety of fields, including biomathematics. The logistic sigmoid can be used in teh output layer for predicting probability.

### 3.7.2.4 Softmax Function

The softmax function converts a vector of number (an array of $K$ values ($z$)) into a vector of probabilities, where the probabiliities of each value are proportional to the relative scale of each value in the vector. It is thus a function that turns several numbers into quantities that can be perhaps interpreted as probabilites.
\begin{gather*}
\frac{e^{Z_K}}{\sum^K_{k=1}e^{Z_k}}
\end{gather*}
It is often used in the final output layer of a neural network, expecially with classification problems.

## 3.7.3 Cost Function

In practice, we can use the least squares for a cost function. Since we will have a set of independent input data $y^n$ (from the training dataset) and corresponding output data $\hat{y^n}$ or the forecast output. $k$ is the $k$-th node of the output. We define the cost function as
\begin{gather*}
J=\frac{1}{2}\sum^N_{n=1} \sum^K_{K=1} (\hat{y^{(n)})k}-y^{(n)}_k)^2
\end{gather*}

For classification problems where only one output, the cost fucntion commonly used for such an output is similar to logistic regression. And this is, for a binary classification $(y^{(n)} = 0, 1)$, the cost function is
\begin{gather*}
J=-\sum^N_{n=1} (y^{(n)} ln(\hat{y^{(n)}}) + (1-y^{(n)}) ln(1-\hat{y^{(n)}}))
\end{gather*}
This is related to the cross entropy function.

## 3.7.4 Backpropagation

Back-propogation is the essence of neural network training. It is the practice of fine-tuning the weights of a neral network based on the error rate obtained in the previous iteration. Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization. We want to minimize the cost fucntion, $J$, with respect to the parameters, the components of $W$ and $b$. To do that using gradient descent, we are going to need the derivatives of $J$ with respect to each of those parameters. Here we focus the layer $l$ and node $j'$ and node $j$ from layer $l-1$.
\begin{gather*}
\frac{\partial J}{\partial w^{(l)}_{j,j'}}\;\text{and}\;\frac{\partial J}{\partial b_{j'}^{(l)}}
\end{gather*}

We introduce the quantity 
\begin{gather*}
\delta_{j'}^{(l)}= \frac{\partial J}{z_{j'}^{(l)}}
\end{gather*}
From the chain rule we have:
\begin{gather*}
\delta_{j}^{(l-1)}=\frac{\partial J}{z_{j}^{(l-1)}}=\sum_{j'} \frac{\partial J}{z_{j'}^{(l)}} \frac{\partial z_{j'}^{(l)}}{z_{j}^{(l-1)}}
\end{gather*}

It follows that
\begin{gather*} 
z_{j'}^{(l)} = \sum_{j_k} w^{(l)}_{j_k,j'} a^{(l-1)}_{j'} = \sum_{j_k} w^{(l)}_{j_k,j'}, \sigma (z^{(l-1)}_{j_k}) + b^{(l)}_{j'}
\end{gather*}

In addition,
\begin{gather*} 
\delta_{j}^{(l-1)} = \frac{dg^{(l-1)}}{dz} \bigg|_{z_{j}^{(l-1)}} \sum_{j'} \frac{\partial J}{\partial z_{j'}^{(l)}} w_{j,j'}^{(l)} = \frac{dg^{(l-1)}}{dz}\bigg|_{z_{j}^{(l-1)}} \sum_{j'} {\delta}_{j'}^{(l)} w_{j,j'}^{(l)}
\end{gather*}

As a result, we can find that the $\delta$'s in a layer if we know the $\delta$'s in all layers to the right. In summer, we have 
\begin{gather*}
\frac{\partial J}{\partial w_{j,j'}^{(l)}} = \frac{\partial J}{\partial z_{j'}^{(l)}} \frac{\partial z_{j'}^{(l)}}{\partial w_{j, j'}^{(l)}} = {\delta}_{j'}^{(l)} a_{j}^{(l-1)}
\end{gather*}

Now the derivatives of the cost function $J$, to the $w$'s, can be written in terms of the $\delta$'s, which in turn are backpropagated from the network layers that are just to the right, one nearer the output. And the derivative of the cost function to the bias, $b$, is simply:
\begin{gather*}
\frac{\partial J}{\partial b_{j'}^{(l)}} = {\delta}_{j'}^{(l)}
\end{gather*}  

It is clear that the derivatives of $J$ depend on which activation function we use. If it is ReLU, then the derivative is either zero or one. If we use the logistic function, then we find that ${\sigma}' (z) = \sigma (1 - \sigma)$.

## 3.7.5 Backpropogation Algorithm

From above analysis, we can easily derive the backpropogation algorithm as follow: First we initialze weights and biases, typically at random. Then pick input data and input the vector $x$ into the left side of the network, and calculate all the $z_s, a_s, etc.$. Finally, calculate the output $\hat{y}$. We can now update hte parameters by the (stocahstic) gradient descent. Repeat the process until the desired accuracy is reached. For example, if using the quadratic cost function in one dimension, then 
\begin{gather*}
{\delta}^(L) = \frac{dg^{(L)}}{dz} \bigg|_{z_j^{(L)}} (\hat{y} - y)
\end{gather*}
Continue to the left
\begin{gather*}
{\delta}_{j}^{(l-1)}=\frac{dg^{(l-1)}}{dz} \bigg|_{z_{j}^{(l-1)}} \sum_j {\delta}_{j'}^{l} w_{j, j'}^{(l)}
\end{gather*}

Then update the weights and biases using the following formulas.
\begin{gather*}
New\;w^{(l)}_{j,j'} = Old\; w^{(l)}_{j,j'} - \beta \frac{\partial J}{\partial w^{(l)}_{j,j'}} = Old\;w^{(l)}_{j,j'}-\beta {\delta}^{(l)}_{j'} a^{(l-1)}_{j}
\end{gather*}
and
\begin{gather*}
New\;b^{(l)}_{j'} = Old\; b^{(l)}_{j'} - \beta \frac{\partial J}{\partial b^{(l)}_{j'}} = Old\;b^{(l)}_{j'}-\beta {\delta}^{(l)}_{j'}
\end{gather*}