# 3.7 Neural Networks
<a target="_blank" href="https://colab.research.google.com/github/SaajanM/mat422-homework/blob/main/3.7%20Neural%20Networks/neuralnets.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

In [None]:
# Install a numpy package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install scipy
!{sys.executable} -m pip install matplotlib

In [None]:
# Import the numpy package
import numpy as np
import matplotlib
import matplotlib.pyplot as pyplt
from mpl_toolkits.mplot3d import Axes3D
import math
import matplotlib.pyplot as plt
from scipy import interpolate

$\newcommand\norm[1]{\left\lVert#1\right\rVert}$
$\newcommand\argmax{\text{arg}\,\text{max}}$
$\newcommand\argmin{\text{arg}\,\text{min}}$

## Section 3.7.0 Neural Networks

Based on the human brain, artificial neural networks - aka neural networks - are multilinear and sometimes nonlinear systems that transform some input vector into an output vector after some number of transformations (layers).

## Section 3.7.1 Mathematical Formulation

We will mainly talk about classical feed-forward neural networks here, where multiple layers are present with no backtracking or special layer types.

Here we will discuss how the values (activations) of layer $l$ are calculated using the activations of layer $l-1$ ($\mathbf{a}^{(l-1)}$), the biases of layer $l$ ($\mathbf{b}^{(l)}$), and the weights of layer $l$ ($W^{(l)}$).

Because the pre-activation function values of the layer $l$ ($\mathbf{z}^l$) are an affine function of those in layer $l-1$ we can write
$$
\mathbf{z}^{(l)} = W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}
$$

Finally we apply some transform (the activation function $sigma$) to the pre-function activations to get

$$
\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)})
$$

## Section 3.7.2 Activation Functions

We just talked about activation functions but what even are they?

They are functions that perform some transform on the incoming activations. This is usally to apply some mathematical properties to the end result (perhaps boundedness, linearity, etc) or to give an intuitive understanding for what these activations represent (softmax: proportion of total)

### Step Function

This activation function is takes on a value of zero for all inputs less than 0 and 1 for all other values. It can be used for classification tasks.

### Rectified Linear Unit (ReLU)

The ReLU is simply $\max(0,x)$. It represents either leaving a signal untouched or killing it. It is very popular because of its good performance and easiness to compute.

### Sigmoid

This is the function $\frac{1}{1+e^{-x}}$. It has heavy use in STEM applications and can be used as a final layer.

### Softmax

This function simply converts the input vector into a vector of weighted probabilities. This is heavily used in classification tasks, because it can represent a soft confidence score rather than a YES/NO answer.

## Section 3.7.3 Cost Function

In the realm of supervised learning we must have a cost function that lets us know how far the outputs of our network are to the desired outputs. This is either usually the least squares or cross entropy loss functions.

We denote this function $J$

## Section 3.7.4 Back Propagation

Direct gradient descent on nonlinear activation functions across the entire network is wildly expensive, so modern systems employ a method known as backpropagation. It is called backpropagation because we are propagating the error of the last layer backwards through the entire system in order to fine tune the weights and biases. Essentially, we perform gradient descent on a layer-by-layer basis to calculate the overall network gradients

Because we care about weights and biases, we are most concerned with calculating the gradients involving the weights and biases from node $j$ in layer $l-1$ to node $j'$ in layer $l$:
$$
\frac{\partial J}{\partial w^{(l)}_{j,j'}},\hspace{3em}\frac{\partial J}{\partial b^{(l)}_{j'}}
$$

We introduce the quantity $\delta_{j'}^{(l)} = \frac{\partial J}{\partial z_{j'}^(l)}$

We can actually find that this value relies on the many $\delta_{j'}^{(l+1)}$. Additionally we can find that our original gradients can be calculated from all the delta values. Therefore, if we find the rightmost delta we can backpropagate it through the full network and find the gradients for all the parameters.

When dealing with the activation functions, because they are very simple (especially in the case of ReLU) we can use the chain rule.

## Section 3.7.5 Back Propagation Algorithm

We first initialize the parameters of the network randomly. Then we pick an input vector at random and calculate our networks output. Then we calculate the gradients and update our parameters according to gradient descent. We then repeat until our desired accuracy of the network is reached.