# What is a Neural Network?

Neural networks are the overarching idea behind deep learning. By adjusting a set of weights and biases to fit itself to a dataset, neural networks are extremely versatile and can find patterns in many different types of data. Neural networks can also be built to specification, with different types networks and fine-tuning built for use cases such as image classification, time series analysis, and more.

## Parts of a neural network

Often when people try to explain a neural network, they show this image:

[conventional digram of neural network]

However, this diagram does not really make mathematical sense (nor does it make much sense from a programming perspective either). Thus, we shall begin with this diagram:

``` 
           f_1(X)      f_2(X)             f_N(X)
          +------+    +------+           +------+    
input     |      |    |      |           |      |  output  
--------> |      | -> |      | -> ... -> |      | --------> prediction 
          |      |    |      |           |      | 
          +------+    +------+           +------+    
```

The diagram shows that at its core, a neural network is just a composition of functions. These functions typically have the form 

$\overrightarrow{y} = f_i(\overrightarrow{x}; \theta_1, ..., \theta_n; h_1, ..., h_m)$

Where:

- $\overrightarrow{x}$: input vector of values
- $\overrightarrow{y}$: output vector of values
- $\theta_i$: trainable parameter (i.e., weight or bias)
- $h_i$: hyperparameter (non-trainable parameter which affects the behaviour of the neural network)
- $n \geq{0}, m \geq{0}$

This is the bare essentials of a neural network. We are, however, missing some other essential parts which actually allow the neural network to actually get better at making predictions. Now we can refine this diagram a bit without loss of generality:

``` 
           f_1(X)      f_2(X)             f_N(X)
          +------+    +------+           +------+  output y 
input     |      | -> |      | -> ... -> |      | ---------->        comparison 
--------> |      |    |      |           |      |                vs. expected output
          |      | <- |      | <- ... <- |      | <----------         E(y, y')
          +------+    +------+           +------+   error O
           b_1(O)      b_2(O)             b_N(O) 
```

We have now added a few functions which give a better picture of how the neural network works. Once we use the forward functions $f_i$ to get a prediction (this process is known as feed-forward or forward propagation), we compare it against our expected output. We can then get an error metric using this comparison, denoted by a function $E$. This error is then passed back through the neural network using the backwards functions $b_i$ (this process is known as back propagation) to update the trainable parameters.

We now have all the components of a neural network. But this is all very abstract. How do we actually define the functions $f_i$, $b_i$, and $R$? That is usecase-specific. In fact, we can build neural networks as we wish. We refer to each function $f_i$ as a layer of the neural network. Each layer has a corresponding backpropagation function $b_i$ which is defined based on the forward propagation function. With these definitions, we can now begin to describe different classes of neural network.



In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import torch

## ANN/FNN

### Introduction

The Artificial Neural Network (ANN), also known as the feed-forward neural network (FNN), is the simplest type of neural network. It is also the one pictured in a conventional diagram like this one:

[conventional diagram of neural network]

An ANN is comprised of one or more layers, where each layer is comprised of one or more nodes, or neurons. Each neuron has a set of weights and biases, and produces an output, known as the neuron's activation, based on its weights and biases. Layers have the following form: 

$f_i(\overrightarrow{x})$ = [$\sigma(\overrightarrow{x} \cdot \overrightarrow{w_j} + b_j)$ for each node $j$ in layer $i$] $\in \mathbb{R}^m$

Such that:

- $\overrightarrow{x}$: input vector of values
- $\overrightarrow{w}$: vector of trainable weights
- $b$: trainable bias parameter
- $\sigma$: activation function which transforms the final value, altering the behaviour of the neural network
- $\overrightarrow{x}, \overrightarrow{w} \in \mathbb{R}^n$ where n is the input size
- output is a vector in $\mathbb{R}^m$

The input to $f_i$ is the result of $f_{i-1}$, that is, the output of all nodes from the previous layer. The first layer, $f_1$, is referred to as the input layer, while the last layer, $f_{N}$, is the output layer. All other layers are are referred to as hidden layers of the ANN.

The layers of the neural network are **densely** connected- that is, each neuron from a layer receives input from **all** neurons in the layer before it, and propagates its output to **all** neurons in the layer after it. The following diagrams summarize the neuron and ANN architecture:

[diagram of node]

[detailed diagram of ANN]

### Specifics

So with that simple overview, we are ready to explain how an ANN actually works. This is the most important neural network to understand as all others are essentially variations on this one. 

We will walk through an example, and explain all the math behind it.

### Definition

We must first define our neural network. Consider the following structure:

```
input layer: 3 neurons
hidden layer 1: 5 neurons
hidden layer 2: 4 neurons
output layer: 4 neurons
```

We know the size of each layer, but is that enough to get started? No. We also need to pick out an activation function $\sigma$ for each layer. Activation function are generally one of the following for an ANN, and we choose based on which one we think could perform well on our dataset. A full list of common activation functions can be found [here](https://www.v7labs.com/blog/neural-networks-activation-functions) but for ANNs we usually pick one or more of these:

- sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$, commonly used when the output is a probability e.g., in classification problems (because we want the probability that an input belongs to a given class as output)
- TanH: $\sigma(x) = \frac{(e^{x}-e^{-x})}{(e^{x}+e^{-x})}$, steeper version of sigmoid.
- ReLU (Rectified Linear Unit): $\sigma(x) = max(x, 0)$, commonly used for images where we don't want all neurons to be activated (have a high output) at the same time.

Hidden layers typically all use the same activation function, while the output layer may use a different one to correctly translate the value. However, the correct activation function to use depends heavily on the type of data you are using. For example, if we are solving a classification problem on a set of images, we may use ReLU for the hidden layers, then sigmoid for the output layer as the output *must* be between 0 and 1. We select activation functions as follows for our example (note: the selections for this example are arbitrary):

```
input layer: 3 neurons, ReLU
hidden layer 1: 5 neurons, sigmoid
hidden layer 2: 4 neurons, sigmoid
output layer: 4 neurons, TanH
```

Are we done yet? No!

We also need to choose a loss or error function $E$ for our neural network. This function will calculate the difference between our output and the expected output, known as loss or cost (or error). This function will also affect how our neural network is trained. A full list of common loss functions can be found [here](https://www.v7labs.com/blog/neural-networks-activation-functions) but for ANNs we usually pick one of these:

- Mean Squared Error (MSE): $\frac{1}{n}\sum^{n}_{i = 1}(e_i-a_i)^2$, punishes high differences between output and expected output
- Cross Entropy: $-\frac{1}{n}\sum^{n}_{i = 1}e_iln(a_i)^2$, used for classification to heavily punish incorrect answers

The right loss function to use also depends heavily on the type of problem you are trying to solve, in this case we will aribtrarily choose one for demonstration purposes.

```
input layer: 3 neurons, ReLU
hidden layer 1: 5 neurons, sigmoid
hidden layer 2: 4 neurons, sigmoid
output layer: 4 neurons, TanH
Loss: Mean Squared Error
```

Now, we have successfully defined a neural network!

### Initialization

The second step is to initialize the neural network, that is, set values for its weights and biases. These can be random, or determinate (e.g. 0). In our case, let's set our weights to be random normally distributed numbers.

```
input layer: {w: [0, 0, 0], b: [0, 0, 0]}
hidden layer 1: {w: [[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]], b: [0, 0, 0, 0, 0]}
hidden layer 2: {w: [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], b: [0, 0, 0, 0]}
output layer: {w: [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], b: [0, 0, 0, 0]}
```

