<a href="https://colab.research.google.com/github/AbhiJ2706/ds-workshops/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is a Neural Network?

Neural networks are the overarching idea behind deep learning. By adjusting a set of weights and biases to fit itself to a dataset, neural networks are extremely versatile and can find patterns in many different types of data. Neural networks can also be built to specification, with different types networks and fine-tuning built for use cases such as image classification, time series analysis, and more.

## Parts of a neural network

Often when people try to explain a neural network, they show this image:

[conventional digram of neural network]

However, this diagram does not really make mathematical sense (nor does it make much sense from a programming perspective either). Thus, we shall begin with this diagram:

```
           f_1(X)      f_2(X)             f_N(X)
          +------+    +------+           +------+    
input     |      |    |      |           |      |  output  
--------> |      | -> |      | -> ... -> |      | --------> prediction
          |      |    |      |           |      |
          +------+    +------+           +------+    
```

The diagram shows that at its core, a neural network is just a composition of functions. These functions typically have the form

$\overrightarrow{y} = f_i(\overrightarrow{x}; \theta_1, ..., \theta_n; h_1, ..., h_m)$

Where:

- $\overrightarrow{x}$: input vector of values
- $\overrightarrow{y}$: output vector of values
- $\theta_i$: trainable parameter (i.e., weight or bias)
- $h_i$: hyperparameter (non-trainable parameter which affects the behaviour of the neural network)
- $n \geq{0}, m \geq{0}$

This is the bare essentials of a neural network. We are, however, missing some other essential parts which actually allow the neural network to actually get better at making predictions. Now we can refine this diagram a bit without loss of generality:

```
           f_1(X)      f_2(X)             f_N(X)
          +------+    +------+           +------+  output y
input     |      | -> |      | -> ... -> |      | ---------->        comparison
--------> |      |    |      |           |      |                vs. expected output
          |      | <- |      | <- ... <- |      | <----------         E(y, y')
          +------+    +------+           +------+   error O
           b_1(O)      b_2(O)             b_N(O)
```

We have now added a few functions which give a better picture of how the neural network works. Once we use the forward functions $f_i$ to get a prediction (this process is known as feed-forward or forward propagation), we compare it against our expected output. We can then get an error metric using this comparison, denoted by a function $E$. This error is then passed back through the neural network using the backwards functions $b_i$ (this process is known as back propagation) to update the trainable parameters.

We now have all the components of a neural network. But this is all very abstract. How do we actually define the functions $f_i$, $b_i$, and $R$? That is usecase-specific. In fact, we can build neural networks as we wish. We refer to each function $f_i$ as a layer of the neural network. Each layer has a corresponding backpropagation function $b_i$ which is defined based on the forward propagation function. With these definitions, we can now begin to describe different classes of neural network.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import torch

## ANN/FNN

### Introduction

The Artificial Neural Network (ANN), also known as the feed-forward neural network (FNN), is the simplest type of neural network. It is also the one pictured in a conventional diagram like this one:

[conventional diagram of neural network]

An ANN is comprised of one or more layers, where each layer is comprised of one or more nodes, or neurons. Each neuron has a set of weights and biases, and produces an output, known as the neuron's activation, based on its weights and biases. Layers have the following form:

$f_i(\overrightarrow{x})$ = [$\sigma(\overrightarrow{x} \cdot \overrightarrow{w_j} + b_j)$ for each node $j$ in layer $i$] $\in \mathbb{R}^m$

Such that:

- $\overrightarrow{x}$: input vector of values
- $\overrightarrow{w}$: vector of trainable weights
- $b$: trainable bias parameter
- $\sigma$: activation function which transforms the final value, altering the behaviour of the neural network
- $\overrightarrow{x}, \overrightarrow{w} \in \mathbb{R}^n$ where n is the input size
- output is a vector in $\mathbb{R}^m$

The input to $f_i$ is the result of $f_{i-1}$, that is, the output of all nodes from the previous layer. The first layer, $f_1$, is referred to as the input layer, while the last layer, $f_{N}$, is the output layer. All other layers are are referred to as hidden layers of the ANN.

The layers of the neural network are **densely** connected- that is, each neuron from a layer receives input from **all** neurons in the layer before it, and propagates its output to **all** neurons in the layer after it. The following diagrams summarize the neuron and ANN architecture:

[diagram of node]

[detailed diagram of ANN]

The weights and biases are the trainable parameters of the model By tweaking these, we begin to "learn" our dataset and thus can create good predictions. Weights are applied to the input, while biases are applied to the output of a given neuron. At the end, the activation value transforms the value.

### Specifics

So with that simple overview, we are ready to explain how an ANN actually works. This is the most important neural network to understand as all others are essentially variations on this one.

We will walk through an example, and explain all the math behind it.

### Definition

We must first define our neural network. Consider the following structure:

```
input layer: 3 neurons
hidden layer 1: 5 neurons
hidden layer 2: 4 neurons
output layer: 4 neurons
```

We know the size of each layer, but is that enough to get started? No. We also need to pick out an activation function $\sigma$ for each layer. Activation function are generally one of the following for an ANN, and we choose based on which one we think could perform well on our dataset. A full list of common activation functions can be found [here](https://www.v7labs.com/blog/neural-networks-activation-functions) but for ANNs we usually pick one or more of these:

- sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$, commonly used when the output is a probability e.g., in classification problems (because we want the probability that an input belongs to a given class as output)
- TanH: $\sigma(x) = \frac{(e^{x}-e^{-x})}{(e^{x}+e^{-x})}$, steeper version of sigmoid.
- ReLU (Rectified Linear Unit): $\sigma(x) = max(x, 0)$, commonly used for images where we don't want all neurons to be activated (have a high output) at the same time.

Hidden layers typically all use the same activation function, while the output layer may use a different one to correctly translate the value. However, the correct activation function to use depends heavily on the type of data you are using. For example, if we are solving a classification problem on a set of images, we may use ReLU for the hidden layers, then sigmoid for the output layer as the output *must* be between 0 and 1. We select activation functions as follows for our example (note: the selections for this example are arbitrary):

```
input layer: 3 neurons, ReLU
hidden layer 1: 5 neurons, sigmoid
hidden layer 2: 4 neurons, sigmoid
output layer: 4 neurons, TanH
```

Are we done yet? No!

We also need to choose a loss or error function $E$ for our neural network. This function will calculate the difference between our output and the expected output, known as loss or cost (or error). This function will also affect how our neural network is trained. A full list of common loss functions can be found [here](https://www.v7labs.com/blog/neural-networks-activation-functions) but for ANNs we usually pick one of these:

- Mean Squared Error (MSE): $\frac{1}{n}\sum^{n}_{i = 1}(e_i-a_i)^2$, punishes high differences between output and expected output
- Cross Entropy: $-\frac{1}{n}\sum^{n}_{i = 1}e_iln(a_i)^2$, used for classification to heavily punish incorrect answers

The right loss function to use also depends heavily on the type of problem you are trying to solve, in this case we will aribtrarily choose one for demonstration purposes.

```
input layer: 3 neurons, No activation
hidden layer 1: 5 neurons, sigmoid
hidden layer 2: 4 neurons, sigmoid
output layer: 4 neurons, TanH
Loss: Mean Squared Error
```

Now, we have successfully defined a neural network!

### Initialization

The second step is to initialize the neural network, that is, set values for its weights and biases. These can be random, or determinate (e.g. 0). In our case, let's set our weights to be random normally distributed numbers. The exact numbers we drew were taken using NumPy and the output is as follows:

```
weights:

Hidden Layer 1:
 array([[ 1.81104822,  0.24283927, -0.45110688,  0.28471475, -1.04286668],
        [-0.40889313,  1.26018711,  0.33772571,  1.95839657,  0.72655273],
        [-0.59916835, -1.09624082, -0.18070728,  0.61468552, -1.93318226]])

Hidden Layer 2:
 array([[ 1.38659083, -0.21726771,  1.27987134,  1.02156397],
        [ 1.17187974, -0.66710825,  0.09730845,  0.45786302],
        [ 0.14123728, -1.57444339,  0.03482889, -0.92917211],
        [-0.54022096, -0.05935614, -1.56170309, -1.68928938],
        [ 0.81313641,  0.93783643,  0.15429917,  0.86244518]])

Output Layer:
 array([[-0.04549228, -0.18057839,  1.12795073,  0.33941836],
        [-0.65069875,  1.04148898, -0.85574674, -0.71034189],
        [-3.08452403,  0.98993457,  1.42463053,  0.59976305],
        [-0.98725731,  0.6758527 , -0.64810604,  0.20081795]])

biases:

Hidden Layer 1:
 array([ 0.49401179,  0.20384343,  0.06500383,  0.72427753, -1.37471834])

Hidden Layer 2:
 array([-0.53479357, -0.69246683,  0.87273291, -0.73714577])

Output Layer:
 array([-1.35273762,  1.82817138,  1.03168826, -1.00112174])
```

The convention is that weights are assigned by input position then neuron. For example, for hidden layer 1, since we have 3 input positions and 5 neurons, the shape of the weight array is (3, 5).

This gives the result:

```
input layer: {
    no weights or biases
}
hidden layer 1: {
    weights: [[1.811, 0.243, -0.451, 0.285, -1.043], [-0.409, 1.260, 0.338, 1.958, 0.727], [-0.599, -1.096, -0.181, 0.615, -1.933]],
    biases: [0.494, 0.204, 0.065, 0.724, -1.375]
}
hidden layer 2: {
    weights: [[1.386, -0.217,  1.280,  1.021], [1.172, -0.667, 0.097, 0.458], [0.141, -1.574, 0.034, -0.929], [-0.540, -0.059, -1.562, -1.689], [0.813, 0.938, 0.154, 0.862]],
    biases: [-0.535, -0.692, 0.873, -0.737]
}
output layer: {
    weights: [[-0.045, -0.180, 1.128, 0.339], [-0.651, 1.041, -0.856, -0.710], [-3.085, 0.990, 1.425, 0.600], [-0.987, 0.676, -0.648, 0.201]],
    biases: [-1.35, 1.828, 1.031, -1.001]
}
```

In order to establish the correctness of this example, we will also be replicating it using code. If our formulas are correct, we should see 2 things:

1. The results of the forward propagation in the code match our results.
2. The weights and biases in the code after 1 round of backpropagation match our own.

Below is the code needed for this example, implemented using Tensorflow and Keras.

In [None]:
from keras.layers import Dense, Input

weights = [[[1.811, 0.243, -0.451, 0.285, -1.043], [-0.409, 1.260, 0.338, 1.958, 0.727], [-0.599, -1.096, -0.181, 0.615, -1.933]],
           [[1.386, -0.217,  1.280,  1.021], [1.172, -0.667, 0.097, 0.458], [0.141, -1.574, 0.034, -0.929], [-0.540, -0.059, -1.562, -1.689], [0.813, 0.938, 0.154, 0.862]],
           [[-0.045, -0.180, 1.128, 0.339], [-0.651, 1.041, -0.856, -0.710], [-3.085, 0.990, 1.425, 0.600], [-0.987, 0.676, -0.648, 0.201]]]
biases = [[0.494, 0.204, 0.065, 0.724, -1.375],
          [-0.535, -0.692, 0.873, -0.737],
          [-1.35, 1.828, 1.031, -1.001]]

model = keras.Sequential([
    Input(3),
    Dense(5, activation="sigmoid"),
    Dense(4, activation="sigmoid"),
    Dense(4, activation="tanh")
])

for i in range(len(model.layers)):
    model.layers[i].set_weights([weights[i], biases[i]])

for i in range(len(model.layers)):
    print(model.layers[i].get_weights())

### Training

We now can begin training our neural network. Training is split into 2 parts- forward propagation and back propagation. The training loop on a single input looks like this:

- sample is inputted
- sample is run through the layers (forward propagation) to produce a prediction
- error against the output is calculated using the error function
- weights and biases are updated in reverse order (back propagation). Note that this may not happen after every sample, as discussed below.

This process is repeated for all samples, completing what is known as an epoch. Once we have completed the desired number of epochs, training ends. To define a training sequence, we specify a few parameters:

- number of epochs
- batch size (number of samples after which we run a single back propagation)
- optimizer (optional)- aids the training process and can boost performance
- etc.

For our example, we will run 1 epoch, with a batch size of 1, and no optimizer (note that we only have 1 input). Assuming our data is preprocessed correctly (discussed in code examples) we can now begin the training loop.

### Forward Propagation

We begin with our preprocessed input, which looks something like this:

```[0.1, 0.2, 0.3]```

We want our output to be:

```[1, 1, 1, 1]```

That is, our example neural network just outputs 1 for any input (but that's not relevant for our example to actually work).

Let us refresh ourselves on what the neural network looks like:

```
input layer: {
    input: [0, 0, 0]
    output: [0, 0, 0]
}
hidden layer 1: {
    input to each neuron: [0, 0, 0]
    output: [0, 0, 0, 0, 0]
    weights: [[1.811, 0.243, -0.451, 0.285, -1.043], [-0.409, 1.260, 0.338, 1.958, 0.727], [-0.599, -1.096, -0.181, 0.615, -1.933]],
    biases: [0.494, 0.204, 0.065, 0.724, -1.375]
}
hidden layer 2: {
    input to each neuron: [0, 0, 0, 0, 0]
    output: [0, 0, 0, 0]
    weights: [[1.386, -0.217,  1.280,  1.021], [1.172, -0.667, 0.097, 0.458], [0.141, -1.574, 0.034, -0.929], [-0.540, -0.059, -1.562, -1.689], [0.813, 0.938, 0.154, 0.862]],
    biases: [-0.535, -0.692, 0.873, -0.737]
}
output layer: {
    input to each neuron: [0, 0, 0, 0]
    output: [0, 0, 0, 0]
    weights: [[-0.045, -0.180, 1.128, 0.339], [-0.651, 1.041, -0.856, -0.710], [-3.085, 0.990, 1.425, 0.600], [-0.987, 0.676, -0.648, 0.201]],
    biases: [-1.35, 1.828, 1.031, -1.001]
}
```

We pass our input to the input layer. Note that for the *input layer only*, each neuron receives *exactly 1 value* from the input, read from left to right (that is, neuron 1 receives input[0], neuron 2 receives input[1], and neuron 3 receives input[2]). The input layer also does not do any computation, so it just spits out the input it received as its result. Our neural network looks like this after the input layer is run:

```
input layer: {
    input: [0.1, 0.2, 0.3]
    output: [0.1, 0.2, 0.3]
}
hidden layer 1: {
    input to each neuron: [0.1, 0.2, 0.3]
    output: [0, 0, 0, 0, 0]
    weights: [[1.811, 0.243, -0.451, 0.285, -1.043], [-0.409, 1.260, 0.338, 1.958, 0.727], [-0.599, -1.096, -0.181, 0.615, -1.933]],
    biases: [0.494, 0.204, 0.065, 0.724, -1.375]
}
hidden layer 2: {
    input to each neuron: [0, 0, 0, 0, 0]
    output: [0, 0, 0, 0]
    weights: [[1.386, -0.217,  1.280,  1.021], [1.172, -0.667, 0.097, 0.458], [0.141, -1.574, 0.034, -0.929], [-0.540, -0.059, -1.562, -1.689], [0.813, 0.938, 0.154, 0.862]],
    biases: [-0.535, -0.692, 0.873, -0.737]
}
output layer: {
    input to each neuron: [0, 0, 0, 0]
    output: [0, 0, 0, 0]
    weights: [[-0.045, -0.180, 1.128, 0.339], [-0.651, 1.041, -0.856, -0.710], [-3.085, 0.990, 1.425, 0.600], [-0.987, 0.676, -0.648, 0.201]],
    biases: [-1.35, 1.828, 1.031, -1.001]
}
```

Note that we also populate the input to each neuron for hidden layer 1, as this is equal to the output of the input layer. We can now begin the next step of forward propagation, which is hidden layer 1. Recall the formula for the forward propagation function for all neurons in layer 1: $f_{j} = \sigma(\overrightarrow{x} \cdot \overrightarrow{w_j} + b_j)$, $\sigma(x) = max(x, 0)$. We now need to construct $\overrightarrow{w_j}$. The shape of the list of weights in our neural network is (number of inputs, number of neurons). Thus, for the first neuron, its weights are the first item in each subarray. For layer 1, this is `[1.811, -0.409, -0.599]`.