# Neural Network - Theory

## Contents:

### [1. General Knowledge about Neural Network](#What-is-a-Neural-Network?)
####    [- Types of Neural Network](#Types-of-Neural-Network)
####    [- Applications of neural networks](#Applications-of-neural-networks)
####    [- Components of Artificial Neural Network](#Components-of-Artificial-Neural-Network)
####    [- How Artificial Neural Network work?](#How-Artificial-Neural-Network-work?)
####    [--- Weights](#Weights)
####    [--- Activation Function](#Activation-Function)
####    [--- Backpropagation](#Backpropagation)
####    [--- Loss Function](#Loss-Function)
####    [--- Gradient Descent](#Gradient-Descent)
####    [- How do we know the number of layers and their types?](#How-do-we-know-the-number-of-layers-and-their-types?)

### [2. Restricted Boltzmann Machines (RBM)](#Restricted-Boltzmann-Machines-(RBM))
####    [--- RBM - Scores](#RBM---Scores)
####    [--- Formula of RBM Energy Function](#Formula-of-RBM-Energy-Function)
####    [--- RBM - Probabilities](#RBM---Probabilities)
####    [--- Contrastive Divergence](#Training---Contrastive-Divergence)
####    [--- Gibbs Sampling](#Gibbs-Sampling)
####    [- Difference between RBM & a normal Feed Forward Network](#Difference-between-RBM-&-a-normal-Feed-Forward-Network)

# What is a Neural Network?

Neural Network forms the base of deep learning, which is a subset of the machine learning field.
- It is inspired by the structure of human brain.
- A real neuron in human brain has the following components:
  - Dendrite: Input to a Neuron
  - Cell body: Information processing happens here
  - Axon: Output to the neuron

# Types of Neural Network

### 1. Feedforward Neural Network
Simplest form of Artificial Neural Network, data travels only in 1 direction (input -> output)
- Applications: Vision and speech recognition

### 2. Radial Basis Function Neural Network
This model classifies the data point based on its distance from a center point.
- Applications: Power Restoration Systems
  
### 3. Kohonen Self Organizing Neural Network
Vectors of random dimensions are input to discrete map comprised of neurons.
- Applications: Used to recognizer patterns in data like in medical analysis.

### 4. Recurrent Neural Network (RNN)
The hidden layer saves its output to be used for future prediction
- Applications: Text to speech conversion model

### 5. Convolution Neural Network (CNN)
The input features are taken in batches like a filter. This allows the network to remember an image in parts!
- Applications: Used in signal and image processing

### 6. Modular Neural Network
It has a collection of different neural networks working together to get the output.

# Applications of neural networks
- Facial Recognition
- Handwriting Recognition
- Forecasting
    - Stock exchange prediction
- Music Composition
- Image Compression

# Components of Artificial Neural Network

1. Input layer: Receives input as an array of data
2. Output layer: Predicts final output
3. Hidden layers: A black box which perform most computations required
4. Neurons: Each node is represented as neurons, similar to neurons of brain.
5. Channels: The connections connecting 2 neurons.
6. Weights: Each channels is assigned a numerical value, for which the input from one neuron would be multipled with this weight and supply to the connecting neuron.
7. Bias: Neurons in the hidden layers which received inputs from input layer would be associated with a numerical value. This numerical value is called bias.
8. Activation Function: A mathematical function that would determine whether the threshold has been crossed for a neuron to be activated & translates the data to the next neurons.

## How to count layers?
There are disagreements on whether to count the input layer as a layer, but generally the convention is that they should not be counted (supported by the book "Neural Smithing".

Convention notation to summarize structure of Multilayered Neural Network

Example: 2/8/1 meaning this is a 2-layered MLP, with an input layer that has 2 nodes, and 1 hidden layer that has 8 nodes and an output layer with 1 node.

# How Artificial Neural Network work?
1. The input is being feed as arrays of data into the input layer.
2. Random weights are assigned to each interconnection between the input and hidden layer.
3. The weights are multipled with the inputs and a bias is added to form the transfer function.
$$z = \sum_{i=1}^n w_i x_i + b$$
	•	 z : The result of the summation (input to the activation function).\
	•	 w_i : Weight associated with each input  x_i.\
	•	 x_i : Individual inputs to the node.\
	•	 b : Bias term added to the summation.\
	•	 $\sum_{i=1}^n$ : Summation symbol, summing over  i  from 1 to  n  (total number of inputs).

4. Weights are assigned to the interconnection between the hidden layers.
5. The output of a transfer function (from 1 hidden layer) is fed as an input to the activation function (of the subsequent hidden layer).
6. At the end, the output layer would output the final form of a prediction, by applying suitable activation function to the output layer.

# Weights
The higher a weight of an artificial neuron is, the stronger the input which is
multiplied by it will be. Weights can also be negative, so we can say that the signal is
inhibited by the negative weight.

Depending on the weights, the computation of the neuron will be different. By adjusting the weights of an artificial neuron we can obtain the output we want for specific inputs. 

But when we have an ANN of hundreds or thousands of neurons, it would be quite complicated to find by hand all the necessary weights. But we can find algorithms which can adjust the weights of the ANN in order to obtain the desired output from the network. This process of adjusting the weights is called learning or training.

# Activation Function
There are multiple types of activation function that could be used.

1. the Sigmoid Function, $sigma(z) = \frac{1}{1 + e^{-z}}$, which is used when the model is predicting probability.
<img src="sigmoid_function_plot.png" alt="Sigmoid Function" width="500">

2.  the Threshold Function, $
\phi(x) =
\begin{cases}
1, & \text{if } x \geq 0 \\
0, & \text{if } x < 0
\end{cases}
$ , which is used when the output depends on a threshold value.
<img src="threshold_function_plot.png" alt="Threshold Function" width="500">

4. the ReLU(Rectified Linear Unit) Function, which gives an output x if x is positive, 0 otherwise.

$$\phi(x) = \max(0, x)$$

Where:
- \(x\): Input value.
- $(\phi(x))$: Output of the ReLU function.
<img src="relu_function_plot.png" alt="ReLU Function" width="500">

5. the Hyperbolic Tangent Function - similar to sigmoid function with a range of (-1,1).


$$\phi(x) = \frac{1 - e^{-2x}}{1 + e^{-2x}}$$

Where:
- $(x)$: Input value.
- $(\phi(x))$: Output of the tanh function.
<img src="tanh_function_plot.png" alt="Tanh Function" width="500">

# Backpropagation

Backpropagation is the process of updating the weights of the network in order to reduce the error in prediction.\
The backpropagation algorithm uses supervised learning, which means that we provide the algorithm with examples of the inputs and outputs we want the network to compute, and a cost function is calculated (taking into account the magnitude of loss at any point on our graph, combined with the slope).\
The output is compared with the original result and multiple iterations are done to get the maximum accuracy. \
For practical reasons, ANNs implementing the backpropagation algorithm do not
have too many layers, since the time for training the networks grows exponentially. Also, there are refinements to the backpropagation algorithm which allow a faster learning.

# Loss Function
Loss Function is a measurement of error which defines the precision lost on comparing the predicted output to the actual output.

Loss Function Formula:[(actual output) - (predicted output)]<sup>2</sup>

## Error of the entire network
The error of the network will simply be the sum of the errors of all the neurons in the output layer:
$\sum_{i=1}^n$[(actual output) - (predicted output)]<sup>2</sup>

## Cost Function
A function to know how far away the output of the network is from the target values.

The formula for the cost function is:

$$ C(w, b) = \frac{1}{2n} \sum_x \left(target_x - activation_x\right)^2 $$

Where:
- \(n\): Number of data points
- \(x\): Represents each output neuron

The activation of output neurons depends on the weights and biases of the network, so it could be thought as the cost function is a function of weights and biases of a network.

# Gradient Descent
A graphical method of finding the minimum of a function.\
A random point on this curve is chosen and the slope at this point is calculated.
- A +ve slope == an increase in weight
- A -ve slope == a decrease in weight
- A zero slope == appropriate weight
  
Our aim is to reach a point where the slope is zero.

The formula for gradient descent is: $$\Delta w_{ji} = - \eta \frac{\partial E}{\partial w_{ji}}$$\
	•	$\Delta w_{ji}$: The change in weight for the connection from neuron $j$ to neuron $i$.\
	•	$\eta$: The learning rate.\
	•	$\frac{\partial E}{\partial w_{ji}}$: The partial derivative of the error $E$ with respect to the weight $w_{ji}$, which represents the gradient.

This formula can be interpreted in the following way: the adjustment of each weight
${\Delta w_{ji}}$ will be the negative of a constant $\eta$ multiplied by the dependance of the
i previous weight on the error of the network, which is the derivative of E in respect to ${w_{i}}$.
The size of the adjustment will depend on $\eta$, and on the contribution of the weight to the error of the function. This is, if the weight contributes a lot to the error, the  adjustment will be greater than if it contributes in a smaller amount.

# How do we know the number of layers and their types?
Generally, you need a network large enough to capture the structure of the problem.
Often the best network structure is found through a process of trial and error experimentation.\
Some research findings:
- A multi-layered neural network with 2 hidden layers is sufficient for creating classification regions of any desired shape (Lippmann, "An introduction to computing with neural nets", 1987)
- With one hidden layer that has sufficiently large amount of nodes, an MLP can approximate any function that we require (P.98, Deep Learning, 2016)
- Yet, it is hard to know what is sufficiently large. It is more efficient to learn it with 2 (or more) hidden layers. (P.38, Neural Smithing)

### 5 Approaches
#### 1) Experimentation
- As number of layers and number of nodes are model parameters that you need to specify during configuration, no one can tell the answer whether any configuration is efficient or not. There is a need to use a controlled experiment.

#### 2) Intuition
- The intuition can come from experience with the domain, experience with modeling problems with neural networks, or some mixture of the two.
  
#### 3) Go for depth
- Deep neural networks appear to perform better (Goodfellow, Bengio, and Courville)
  
#### 4) Borrow Ideas
- To leverage findings reported in literature.
- Find research papers that describe the use of MLPs on instances of prediction problems similar in some way to your problem. Note the configuration of the networks used in those papers and use them as a starting point for the configurations to test on your problem.
  
#### 5)  Search
- Design an automated search to test different network configurations.
- Some popular search strategies include:
    - Random: Try random configurations of layers and nodes per layer.
    - Grid: Try a systematic search across the number of layers and nodes per layer.
    - Heuristic: Try a directed search across configurations such as a genetic algorithm or Bayesian optimization.
    - Exhaustive: Try all combinations of layers and the number of nodes; it might be feasible for small networks and datasets.

Some ideas to reduce or manage the computational burden include:
- Fit models on a smaller subset of the training dataset to speed up the search.
- Aggressively bound the size of the search space.
- Parallelize the search across multiple server instances (e.g. use Amazon EC2 service).

# Restricted Boltzmann Machines (RBM)

<img src="RBM.png" width="500">

Source: [1. Generative model that won the 2024 Physics Nobel Prize - Restricted Boltzmann Machines (RBM)](https://www.youtube.com/watch?v=Fkw0_aAtwIw&ab_channel=Serrano.Academy)

A Restricted Boltzmann Machines (RBM) has 2 layers, the hidden layer on the top and the visible layer at the bottom.\
Each layers have some number of nodes, and there are scores with them.\
Every node from the visible layer is connected to every node in the hidden layer.\
Each edge has a weight/value.


### RBM - Visible & Hidden Layer
Visible layer: the one that we see in our dataset, and have to explain the behaviour of.
Hidden layer: the one we dont see, the one that is being used to explain the behaviour of dataset.


### RBM - Scores
Every scenario of the nodes activated is considered.\
The score of a particular scenario is the summation of the scores of the nodes that is activated.\
All the scores of every possible scenario is being calculated.\
A higher score means the more likely scenario of which nodes are activated together, and a lower score or negative score means the unlikely scenario of those nodes being activated together.

### Formula of RBM Energy Function
$$E = - \sum_i b_i v_i - \sum_i a_i h_i - \sum_{i,j} W_{ij} v_i h_j$$

Where:
- $E$: Energy of the system.
- $b_i$: Bias for visible unit $v_i$.
- $a_i$: Bias for hidden unit $h_i$.
- $W_{ij}$: Weight between visible unit $v_i$ and hidden unit $h_j$.
- $v_i$: Visible unit.
- $h_i$: Hidden unit.

A negative of the sum of the scores of the weights * the states of the visible layer + same sum for the hidden layer + the sum of the scores of the hidden * weights of the edges

Energy = -score

The joint probability distribution and partition function for the RBM are given as:

$$ p(v, h) = \frac{1}{Z} e^{-E(v, h)}$$

$$Z = \sum_{v, h} e^{-E(v, h)}$$

Where:
- $p(v, h)$: Joint probability of visible (\(v\)) and hidden (\(h\)) states.
- $Z$: Partition function (normalizing constant).
- $E(v, h)$: Energy associated with the visible and hidden states.
- $\sum_{v, h}$: Summation over all possible visible and hidden state configurations.


### RBM - Probabilities
Turn the scores to probabilities.\
First, put score to the function of $e^{score}$.\
Then, add all of them to get sum.\
Then, divide all of them by the sum to normalize them. They will add to 1.\
The higher the score, the higher the probability.

### Training - Contrastive Divergence
1. Go through the 1st point of dataset. Find out the possible scenarios that would be activated, and increase their probabilities. Decrease the probabilities for the rest.
2. Continue on with the rest of points of dataset.
In terms of formula:

Find:  $$\text{arg max}_W \prod_{v \in V} P(v)$$
   - $ \text{arg max}_W $: Find the value of $ W $ that maximizes the expression.
   - $ \prod_{v \in V} P(v) $: Product over all $ v $ in $ V $.

Maximize:  $$\text{arg max}_W \mathbb{E}[\log P(v)]$$
   - $ \mathbb{E} $: Expectation operator.
   - $ \log P(v) $: Logarithm of the probability of $ v $.
     
Derivative: $$\frac{\partial}{\partial W} \log P(v_n)$$
   - $ \frac{\partial}{\partial W} $: Partial derivative with respect to $ W $.
   - $ \log P(v_n) $: Logarithm of the probability of $ v_n $.


$$\mathbb{E}\left[\frac{\partial}{\partial W} - E(v, h) \mid v = v_n \right] - \mathbb{E}\left[\frac{\partial}{\partial W} - E(v, h)\right]$$

1. Expectation Operator $\mathbb{E}$:
   - Represents the expectation of the terms inside the brackets.
2. Conditional Expectation $\mid v = v_n$:
   - Indicates the expectation is conditional on $v = v_n$.
3. Gradient $\frac{\partial}{\partial W}$:
   - Partial derivative with respect to $W$.
4. Energy Function $E(v, h)$:
   - Energy of the visible $(v$) and hidden $(h$) units.
  
### Gibbs Sampling
For hidden & visible layers of many nodes, the possibilities would be near endless. So there is a need for sampling.
1. Go through the 1st point of dataset. Pick 1 of the possible scenarios that would be activated, and increase its probability.
2. Then pick 1 of the rest at random and decrease the probabilities for that one.
3. Proceed with the rest of points of dataset.

### Updating Weights
Increase/decrease the weights of the nodes and edges by the amount of learning rate each time.

### Sampling
If the nodes of both layers are too huge, even picking randomly from the nodes is difficult. 
#### How to pick a sample that agrees with our data point?
Use Independent Sampling and consider the probabilities.  
Add the weights of the relevant edges and the interested hidden node, and do a sigmoid of it to find the probability. This probability is the probabilty of whether a node of hidden layer would be activated.

## Difference between RBM & a normal Feed Forward Network
1. An RBM sends signals both "forwards" and "backwards" during inference while a feedforward network only sends signals forwards during inference.
2. An RBM uses contrastive divergence for learning the weights and does not involve a loss function, whereas a feedforward network uses backpropagation and gradient descent for learning the weights, which requires a loss function.
3. An RBM is energy-based (hence it has an energy function which can be said to be instead of a loss function) and follows (a simplified version of) the Boltzmann distribution (that doesn't include k and T), so it is stochastic, while a feedforward network isn't energy-based, but is instead deterministic.
4. An RBM is an unsupervised learning method, while a feedforward network is supervised learning with targets to predict.
5. Objective: RBM -> learns a probabilty function, ANN -> learns a complex function. 
6. What does: RBM -> estimate probable group of variables (visible and latent), ANN -> predicts output
7. Training algorithm: ANN -> backpropagation, RBM -> contrastive divergence (Similar to 2)
8. Basic principle: ANN -> decreases a cost function, RBM -> decreases an energy function (probability function) (Similar to 2 & 3)
9. Weights and biases: ANN -> deterministic activation of units, RBM -> stochastic activation of units (Similar to 3)

## Introduction to CNN

The inputs expected are images.

# Sources

1. [Artificial Neural Networks for Beginners by Carlos Gershenson](https://arxiv.org/abs/cs/0308031)
2. [Neural Network Full Course | Neural Network Tutorial For Beginners | Neural Networks | Simplilearn](https://www.youtube.com/watch?v=ob1yS9g-Zcs&ab_channel=Simplilearn)
3. [Neural Networks in Python – A Complete Reference for Beginners](https://www.askpython.com/python/examples/neural-networks)
4. [How to Configure the Number of Layers and Nodes in a Neural Network](https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/)
5. [A Tutorial on Deep Neural Networks for Intelligent Systems](https://arxiv.org/pdf/1603.07249)
6. [Reducing the Dimensionality of Data with Neural Networks](https://www.cs.toronto.edu/~hinton/absps/science.pdf)