<a href="https://colab.research.google.com/github/MaralAminpour/IVM_supplementary_materials/blob/main/NN_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial Neural Networks

Neural Networks started off trying to mimic how the brain's neurons work, but they've since shifted towards more of an engineering approach to improve how we do machine learning. Even so, it's useful to have a quick look at the biological inspiration behind it all before we dive deeper.

# Biological Neuron Structure

Neural Networks started off trying to mimic how the brain's neurons work, but they've since shifted towards more of an engineering approach to improve how we do machine learning. Even so, it's useful to have a quick look at the biological inspiration behind it all before we dive deeper.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/Picture1.png" width = "400" >

- **Dendrites**: These are extensions of the neuron that act as the input channels. They receive chemical signals from other neurons and convert them into electrical signals.
- **Soma**: Also known as the cell body, this part of the neuron integrates the electrical signals received from dendrites to determine if the neuron will activate and send a signal along to other neurons.
- **Axon**: This is a long, slender projection that transmits the electrical signal (action potential) away from the neuron's soma toward other neurons.
- **Myelin Sheath**: Some axons are wrapped in this protective sheath, which helps speed up the transmission of the action potential over long distances.
- **Synapses**: These are the junctions where neurons communicate with each other, transferring information from one neuron to the next.
- **Chemical Synapses**: In this type, there's a small gap between neurons where the signal is transferred using chemical messengers called neurotransmitters, which are released from one neuron and bind to receptors on the next.
- **Neurotransmitters**: These chemicals can either excite the next neuron, prompting it to send a signal, or inhibit it from sending a signal.
- **Excitatory and Inhibitory Synapses**: These are the two types of chemical synapses. Excitatory synapses encourage the next neuron to send a signal, while inhibitory synapses discourage it from doing so.

Each component of the neuron plays a critical role in processing and transmitting information throughout the nervous system.


# Artificial neurons vs Biological neurons

The concept of artificial neural networks comes from biological neurons found in animal brains So they share a lot of similarities in structure and function wise.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/Artificial-Neural-Networks.webp" width = "600" >


**Structure:** The structure of artificial neural networks is inspired by biological neurons. A biological neuron has a cell body or soma **to process the impulses**, dendrites **to receive them**, and an axon that **transfers them to other neurons**.  The input nodes of artificial neural networks receive input signals, the hidden layer nodes compute these input signals, and the output layer nodes compute the final output by processing the hidden layer’s results using **activation functions**.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/table1.png" width = "300" >

**Synapses:** Synapses are the links between biological neurons that enable the **transmission of impulses from dendrites to the cell body**. Synapses are the **weights that join the one-layer nodes to the next-layer nodes** in artificial neurons. The strength of the links is determined by the weight value.

**Learning:** In biological neurons, learning happens in the cell body nucleus or soma, which has a nucleus that helps to **process the impulses**. An action potential is produced and travels through the axons if the impulses are powerful enough to reach the threshold. This becomes possible by synaptic plasticity, which represents the ability of synapses to become stronger or weaker over time in reaction to changes in their activity.

In artificial neural networks, **backpropagation** is a technique used for learning, which **adjusts the weights between nodes according to the error or differences between predicted and actual outcomes.**

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/table2.png" width = "300" >

**Activation:** In biological neurons, activation is the firing rate of the neuron which happens when the **impulses are strong enough to reach the threshold**. In artificial neural networks, A mathematical function known as an activation function maps the input to the output, and executes activations.

[Source](https://www.geeksforgeeks.org/artificial-neural-networks-and-its-applications/)

## Similarities between BNNs and ANNs

Biological Neural Networks (BNNs) and Artificial Neural Networks (ANNs) have distinct parts that correspond to each other, underpinning their conceptual similarities:

**Parts of Biological Neural Networks:**

- **Neurons**: The fundamental cells that process and transmit information through electrical and chemical signals.
- **Dendrites**: Receive signals from other neurons.
- **Soma (Cell Body)**: Integrates incoming signals to determine if the neuron will activate.
- **Axon**: Transmits the electrical signal to other neurons.
- **Synapses**: Junctions where neurons communicate, using neurotransmitters to send signals.
- **Myelin Sheath**: Insulation around some axons that speeds up signal transmission.
- **Neurotransmitters**: Chemicals that transmit signals across synapses.

**Parts of Artificial Neural Networks:**

- **Artificial Neurons (Nodes)**: Basic processing units that simulate biological neurons.
- **Inputs**: Analogous to dendrites, they receive data to be processed.
- **Weights**: Equivalent to the strength of synaptic connections, determining the influence of inputs.
- **Activation Function**: Serves a similar purpose as the soma, deciding the level of output signal based on input strength.
- **Outputs**: Correspond to the axon, transmitting the signal to the next layer or as a final output.
- **Layers**: Structured groupings of nodes; including input, hidden, and output layers.
- **Learning Algorithm (e.g., Backpropagation)**: Method for adjusting weights in the network, similar to how experiences rewire synaptic connections.

**Similarities:**

- **Signal Processing**: Both BNNs and ANNs process information through a network of interconnected units (neurons/nodes).
- **Adaptation**: Neurons in BNNs adapt through changes in synaptic strength, while ANNs adapt through changes in weights.
- **Integration and Activation**: Neurons integrate signals and fire based on a threshold; similarly, nodes calculate weighted sums and apply an activation function.
- **Transmission**: Just as axons transmit signals to other neurons, ANNs transmit processed data from one node to the next.
- **Learning**: Both networks learn from repeated exposure to stimuli (data), although the mechanisms differ (biological processes vs. computational algorithms).

The conceptual similarity is rooted in the inspiration ANNs take from BNNs, using an abstracted and simplified model to replicate the complex patterns of data processing and learning observed in biological systems.

# Biological motivation and connections

The basic computational unit of the brain is a neuron. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). Each neuron receives input signals from its dendrites and produces output signals along its (single) axon. The axon eventually branches out and connects via synapses to dendrites of other neurons.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/Picture2.png" width = "800" >


In a computational model, a neuron is represented as a node that processes incoming signals and produces an output. Here's how it works: signals travel through pathways analogous to axons in the biological sense, where each signal is assigned a variable (like \( x_0 \)). This signal is then altered by a corresponding weight (such as \( w_0 \)) before it reaches the next neuron. The weight \( w_0x_0 \) represents the strength and type of connection—whether it amplifies the signal (excitatory with a positive weight) or diminishes it (inhibitory with a negative weight). This mimics the way biological synapses control the influence of one neuron on another.

The weights—these crucial elements of the neural network—are adjustable. They are 'learned' through repeated exposure to data during the training process. As the network processes more data, it adjusts the weights to improve its predictions, much like how our brains strengthen or weaken synaptic connections based on experiences.

Once the signals reach a neuron, they are collected by structures akin to dendrites and brought together in the neuron's main body, just like the summing junction in biological neurons. Here, all the incoming weighted signals are totaled. If this total surpasses a specific threshold—a predefined limit that determines whether the neuron should activate—the neuron outputs a signal. This output then travels down what would be considered the neuron's axon in a biological context.

One simplification in this computational model is that we don't consider the exact timing of each signal. Instead, we focus on how often the neuron fires, or its firing rate. This rate is assumed to carry the essential information, rather than the precise pattern of spikes. This approach is based on the 'rate code' theory of neural communication, which postulates that it's the number of spikes over time, not the exact timing of them, that's most important for conveying information.

**Coarse model**: It's crucial to acknowledge that the way we model neurons in artificial neural networks is quite simplified compared to their biological counterparts. Real neurons come in various types, each with unique characteristics and functions. In a living brain, dendrites are responsible for intricate nonlinear calculations, far beyond the simple signal processing we typically model in artificial networks. Also, synapses in nature are not merely single values representing strength; they're dynamic and complex, involving a multitude of factors that affect their behavior.

Furthermore, in certain biological systems, the precise timing of a neuron's firing—down to the exact millisecond—is critical, which challenges the assumption that the frequency of firing is all that matters (the rate code model). Given these complexities, and many others we simplify or overlook, it's common for neuroscientists to express a bit of frustration when direct comparisons are made between the functioning of neural networks in AI and the workings of the human brain.

# An artificial neural network (ANN)

is a computational model that is inspired by the way biological neural networks in the human brain process information. It is a key tool used in machine learning and artificial intelligence, designed to simulate the way humans learn and recognize patterns.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/neuron_model.jpeg" width = "500" >

[Source](https://colab.research.google.com/github/MaralAminpour/IVM_supplementary_materials/blob/main/NN_Introduction.ipynb#scrollTo=IfO26Ajwd8PI&line=7&uniqifier=1)

Here’s a breakdown of its key components and how it functions:

- **Structure**: An ANN is composed of a large number of interconnected processing elements called neurons, which are organized in layers. There are typically three types of layers: an input layer, one or more hidden layers, and an output layer.

- **Neurons**: Each neuron in a network acts as a computational unit that receives inputs (either from raw data or from the outputs of other neurons), processes these inputs, and generates an output. This is typically done by summing the weighted inputs, adding a bias, and then passing the result through a nonlinear activation function.

- **Weights and Biases**: The 'synapses' of an ANN are represented by weights that adjust as learning proceeds. The strength and sign of these weights determine the influence one neuron has on another. Biases are additional parameters that shift the activation function to better fit the data.

- **Learning**: ANNs learn through a process called training, where the network is fed data, makes predictions, and then adjusts its weights and biases based on the error of its predictions. This is often done using algorithms such as gradient descent and backpropagation.

- **Activation Functions**: These functions determine whether a neuron should be activated or not, based on the weighted sum of the inputs. Common activation functions include the sigmoid, tanh, and ReLU (Rectified Linear Unit) functions.

- **Applications**: ANNs are used for a wide range of applications such as image and speech recognition, natural language processing, medical diagnosis, stock market prediction, and many forms of classification and regression analysis.

ANNs are powerful because they can approximate any continuous function given a sufficient number of neurons and layers, which is known as the universal approximation theorem. This flexibility allows ANNs to tackle complex problems that are difficult to solve using traditional programming techniques.

## How biases come to picture

Introducing a bias term into an Artificial Neural Network (ANN) is a crucial step in the network's ability to accurately model complex patterns. The role of the bias is **similar to the intercept in a linear regression **equation—it allows the model to better fit the data **by providing an additional degree of freedom**.

Here's a more detailed explanation:

In the learning process of an ANN, we aim to adjust the weights of the connections between neurons so that for a given set of inputs, the network produces the desired output. The weights control how much influence one neuron has over another. Initially, these weights are often set to small random values, and the network doesn't perform well.

As the network learns, it incrementally adjusts the weights based on the difference between the predicted output and the actual output—a process that's repeated many times over the entire dataset. However, if we only adjusted the weights, the model could **only represent linear relationships that pass through the origin of the coordinate system** (where the input value is zero).

This limitation is where the bias comes in. By adding a bias term to the equation, we effectively **shift the activation function to the left or right,** which allows the network to represent **linear relationships that do not pass through the origin**. The bias term gives the neuron the flexibility to activate (or not) even when all input signals are zero.

The equation for the output of a single neuron with a bias term looks like this:

$$
\text{output} = \text{activation function}\left(\sum (\text{weights} \times \text{inputs}) + \text{bias}\right)
$$

With the bias term, **the network can better capture the patterns in the data, regardless of whether those patterns are centered around the origin or not**. This is especially important because in **real-world data**, it's highly unlikely that all the relationships we want to model will conveniently intersect the origin point.

In summary, the bias term is essential for providing the model with the **full range of motion needed to find the best possible fit to the data**, beyond what adjusting weights alone could achieve. It's a small but powerful tweak that significantly enhances the neural network's capability to model complex functions.

# Modeling one neuron


# Single neuron as a linear classifier

The operational principle of a single neuron in a neural network is mathematically analogous to **linear classifiers** you may have encountered before. A neuron can exhibit a preference for certain input patterns, showing a high level of activation (close to 1) for favored patterns or a low level of activation (close to 0) for others. By coupling this neuron with a suitable **loss function**, we can mold it into a linear classifier.

**Binary Softmax classifier**: Take the Binary Softmax classifier as an instance. Here, the output of the neuron after applying the sigmoid function, denoted as $ \sigma(\sum_{i} w_i x_i + b) $, can be interpreted as the probability that a given class label $ y_i $ is 1, given the input features $ x_i $ and the learned parameters $ w $ (the weights) and $ b $ (the bias). The probability of the alternative class (where $ y_i $ is 0) is simply $ 1 - P(y_i=1 | x_i; w) $, ensuring that the total probability sums up to 1. The **cross-entropy loss function**, familiar from linear classification contexts, can then be applied to optimize this neuron, yielding a binary Softmax classifier, also recognized as **logistic regression**. The predictions hinge on whether the neuron's output is above or below the threshold of 0.5, given that the sigmoid function's output is confined within the range from 0 to 1.

**Binary SVM classifier**: Alternatively, if we choose to attach a **max-margin hinge loss** to the neuron's output, we steer the training process towards a binary Support Vector Machine (SVM) classifier. This type of classifier aims for the largest margin between the data points of different classes, which can be advantageous for generalization.

**Regularization interpretation**: From another angle, if we consider regularization, common to both SVM and Softmax classification, it can be likened to a process of **gradual forgetting within a biological framework**. Regularization penalizes large weights, effectively nudging them towards zero with each update during training, akin to how a biological system might gradually diminish the synaptic strengths that are not frequently used.

In essence, even a solitary neuron within a neural network has the potential to act as a binary classifier—whether it's a Softmax or an SVM—by drawing on these mathematical principles and interpretations.

# How do Artificial Neural Networks learn?

Artificial Neural Networks (ANNs) learn through a process called training, during which they adjust their internal parameters to make better predictions or decisions based on input data. Here's a step-by-step breakdown of how this learning process typically works:

1. **Initialization**: Before learning begins, the weights (the parameters that determine the importance of input signals) and biases (parameters that allow the model to fit better with training data) in the network are usually initialized with small random values.

2. **Feedforward**: During the feedforward phase, input data is passed through the network. Each neuron in the network processes the input by performing a weighted sum of the inputs, adds a bias, and then applies an activation function to the result. The activation function's output determines the neuron's output signal, which then becomes the input for the next layer in the network.

3. **Loss Calculation**: The output of the network is compared to the desired output, and the difference between them is calculated using a loss function. The loss function measures the error of the network's predictions and provides a single value that the network aims to minimize through training.

4. **Backpropagation**: Backpropagation is used to calculate the gradient of the loss function with respect to each weight and bias in the network. This process involves applying the chain rule from calculus to compute the gradients systematically from the output layer back through to the input layer.

5. **Weight Update**: Once the gradients are computed, the weights and biases are updated, typically using an optimization algorithm like gradient descent. This involves nudging the weights and biases in the opposite direction of the gradient by a small amount, proportional to a learning rate parameter. The learning rate controls how big a step is taken during each update and is crucial for the convergence and performance of the network.

6. **Iterate**: Steps 2-5 are repeated for many iterations over the training dataset, with the network continuing to adjust its weights and biases to reduce the loss.

7. **Evaluation**: After the training is complete, the network's performance is evaluated on a separate dataset not seen during training, called the validation set, to ensure that the network generalizes well to new data.

8. **Fine-tuning**: Based on the network's performance on the validation set, further fine-tuning of the model may occur, which can involve adjusting the learning rate, trying different architectures, or using regularization techniques to prevent overfitting.

Through these iterative processes, ANNs learn the complex relationships within the data they are trained on, allowing them to make predictions or decisions when presented with new, unseen data.


# Overview of a Neural Network’s Learning Process

The learning (training) process of a neural network is an iterative process in which the calculations are carried out forward and backward through each layer in the network until the loss function is minimized.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/learning_image1.webp" width = "700" >

The entire learning process can be divided into three main parts:


*   Forward propagation (Forward pass)
*   Calculation of the loss function
*   Backward propagation (Backward pass/Backpropagation)

We’ll begin with forward propagation.


## Forward propagation (Feed Forward Networks)

A feedforward network consists of an input layer, one or more hidden layers, and an output layer. The input layer receives the input into the neural network, and each input has a weight attached to it.

The weights associated with each input are numerical values. These weights are an indicator of the importance of the input in predicting the final output. For example, an input associated with a large weight will have a greater influence on the output than an input associated with a small weight.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/learning2.png" width = "700" >


When a neural network is first trained, it is first fed with input. Since the neural network isn’t trained yet, we don’t know which weights to use for each input. And so, each input is randomly assigned a weight. Since the weights are randomly assigned, the neural network will likely make the wrong predictions. It will give out the incorrect output.

When the neural network gives out the incorrect output, this leads to an output error. This error is the difference between the actual and predicted outputs. A cost function measures this error.

The cost function (J) indicates how accurately the model performs. It tells us how far-off our predicted output values are from our actual values. It is also known as the error. Because the cost function quantifies the error, we aim to minimize the cost function.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/minimum_loss.png" width = "400" >


What we want is to reduce the output error. Since the weights affect the error, we will need to readjust the weights. We have to adjust the weights such that we have a combination of weights that minimizes the cost function.


## This is where Backpropagation comes in…

Backpropagation allows us to readjust our weights to reduce output error. The error is propagated backward during backpropagation from the output to the input layer. This error is then used to calculate the gradient of the cost function with respect to each weight.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/back_propogation.png" width = "700" >

Essentially, backpropagation aims to calculate the negative gradient of the cost function. This negative gradient is what helps in adjusting of the weights. It gives us an idea of how we need to change the weights so that we can reduce the cost function.

Backpropagation uses the chain rule to calculate the gradient of the cost function. The chain rule involves taking the derivative. This involves calculating the partial derivative of each parameter. These derivatives are calculated by differentiating one weight and treating the other(s) as a constant. As a result of doing this, we will have a gradient.

Since we have calculated the gradients, we will be able to adjust the weights.

## Gradient Descent

The weights are adjusted using a process called gradient descent.

Gradient descent is an optimization algorithm that is used to find the weights that minimize the cost function. Minimizing the cost function means getting to the minimum point of the cost function. So, gradient descent aims to find a weight corresponding to the cost function’s minimum point.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/cost_min.png" width = "400" >


To find this weight, we must navigate down the cost function until we find its minimum point.

But first, to navigate the cost function, we need two things: the direction in which to navigate and the size of the steps for navigating.

### The Direction

The direction for navigating the cost function is found using the gradient.

### The Gradient

To know in which direction to navigate, gradient descent uses backpropagation. More specifically, it uses the gradients calculated through backpropagation. These gradients are used for determining the direction to navigate to find the minimum point.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/gradient.png" width = "500" >

Specifically, we aim to find the negative gradient. This is because a negative gradient indicates a decreasing slope. A decreasing slope means that moving downward will lead us to the minimum point. For example:

### The Step Size

The step size for navigating the cost function is determined using the learning rate.

###Learning Rate

The learning rate is a tuning parameter that determines the step size at each iteration of gradient descent. It determines the speed at which we move down the slope.

The step size plays an important part in ensuring a balance between optimization time and accuracy. The step size is measured by a parameter alpha (α). A small α means a small step size, and a large α means a large step size. If the step sizes are too large, we could miss the minimum point completely. This can yield inaccurate results. If the step size is too small, the optimization process could take too much time. This will lead to a waste of computational power.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/learning_rate.png" width = "600" >

The step size is evaluated and updated according to the behavior of the cost function. The higher the gradient of the cost function, the steeper the slope and the faster a model can learn (high learning rate). A high learning rate results in a higher step value, and a lower learning rate results in a lower step value. If the gradient of the cost function is zero, the model stops learning.

### Descending the Cost Function

Navigating the cost function consists of adjusting the weights. The weights are adjusted using the following formula:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/update.png" width = "250" >

This is the formula for gradient descent. As we can see, to obtain the new weight, we use the gradient, the learning rate, and an initial weight.

Adjusting the weights consists of multiple iterations. We take a new step down for each iteration and calculate a new weight. Using the initial weight and the gradient and learning rate, we can determine the subsequent weights.

Let’s consider a graphical example of this:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/gradient_descent.png" width = "600" >

From the graph of the cost function, we can see that:

1. To start descending the cost function, we first initialize a random weight.

2. Then, we take a step down and obtain a new weight using the gradient and learning rate. With the gradient, we can know which direction to navigate.

3. We can know the step size for navigating the cost function using the learning rate.

4. We are then able to obtain a new weight using the gradient descent formula.

5. We repeat this process until we reach the minimum point of the cost function.

6. Once we’ve reached the minimum point, we find the weights that correspond to the minimum of the cost function.




### Summarizing Gradient Descent

Gradient descent is an optimization algorithm used to find the weights corresponding to the cost function. It needs to descend the cost function until its minimum point to find these weights. It needs the gradient and the learning rate to descend the cost function. The gradient helps find the direction for reaching the minimum point of the cost function. The learning rate helps determine the speed at which to reach the minimum point. Upon reaching the minimum point, gradient descent finds weights corresponding to the minimum point.

### Summarizing Backpropagation

Backpropagation is the algorithm of calculating the gradients of the cost function with respect to the weights. Backpropagation is used to improve the output of neural networks. It does this by propagating the error in a backward direction and calculating the gradient of the cost function for each weight. These gradients are used in the process of gradient descent.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/difference.png" width = "600" >

### Conclusion

To put it plainly, gradient descent is the process of using gradients to find the minimum value of the cost function, while backpropagation is calculating those gradients by moving in a backward direction in the neural network. Judging from this, it would be safe to say that gradient descent relies on backpropagation.

It would also be plausible to say that the neural network is trained using gradient descent and that backpropagation is only used to assist in the process of calculating the gradients.

Although gradient descent is often paired with backpropagation to reduce the error in neural networks, they each perform different functions.

### Key Takeaways:

- Gradient descent relies on backpropagation. Gradient descent uses gradients to help it find the minimum value of the cost function.

- Backpropagation calculates these gradients using the chain rule.
Gradient descent is used to find a weight combination that minimizes the cost function.  Backpropagation propagates the error backward and calculates the gradient for each error.

- Gradient descent requires the learning rate and the gradient. The gradient helps find the direction to the minimum point of the cost function. The learning rate helps find the speed at which to navigate the cost function.

- Together, backpropagation and gradient descent improve the prediction accuracy of neural networks. Backpropagation propagates the error backward and calculates the gradient for each weight. This gradient is used in the process of gradient descent. Gradient descent involves adjusting the weights of the neural network. Adjusting the weights helps minimize the output error of the neural network.


## About backward propagation

In the first iteration, the predicted values are far from the ground truth values and the distance score will be high. This is because we initially assigned arbitrary values to the network’s parameters (weights and biases). Those values are not optimal values. So, we need to update the values of these parameters in order to minimize the loss function. The process of updating network parameters is called parameter learning or optimization which is done using an optimization algorithm (optimizer) that implements backpropagation.

The objective of the optimization algorithm is to find the global minima where the loss function has its minimum value. However, it is a real challenge for an optimization algorithm to find the global minimum of a complex loss function by avoiding all the local minima. If the algorithm is stopped at a local minimum, we’ll not get the minimum value for the loss function. Therefore, our model will not perform well.

Here is a list of commonly used optimizers in neural network training.

- Gradient Descent
- Stocasticc Gradeint Descent (SGD)
- Adam
- Adagrad
- Adadelta
- Adamax
- Nadam
- Ftrl
- Root Mean Squared Propagation (RMSProp)

In the backward propagation, the partial derivatives (gradients) of the loss function with respect to the model parameters in each layer are calculated. This is done by applying the chain rule of calculus.

The derivative of the loss function is its slope which provides us with the direction that we should need to consider for updating (changing) the values of the model parameters.

The neural network libraries in Keras provide automatic differentiation. This means, after you define the neural network architecture, the libraries automatically calculate all of the derivates needed for backpropagation.
In the backward propagation, calculations are made from the output layer to the input layer (right to left) through the network.

## The batch size and epochs

We do not usually use all training samples (instances/rows) in one iteration during the neural network training. Instead, we specify the batch size which determines the number of training samples to be propagated (forward and backward) during training.

An epoch is an iteration over the entire training dataset.
For example, let’s say we have a dataset of 1000 training samples and we choose a batch size of 10 and epochs of 20. In this case, our dataset will be divided into 100 (1000/10) batches each with 10 training samples.

According to this setting, the algorithm takes the first 10 training samples from the dataset and trains the model. Next, it takes the second 10 training samples and trains the model and so on. Since there is a total of 100 batches, the model parameters will be updated 100 times in each epoch of optimization. This means that one epoch involves 100 batches or 100 times parameter updates. Since the number of epochs is 20, the optimizer passes through the entire training dataset 20 times giving a total of 2000 (100x20) iterations!

**Recommended resource:**

[All You Need to Know about Batch Size, Epochs and Training Steps in a Neural Network](https://medium.com/data-science-365/all-you-need-to-know-about-batch-size-epochs-and-training-steps-in-a-neural-network-f592e12cdb0a)


# Setting number of layers and their sizes

How do we decide on what architecture to use when faced with a practical problem? Should we use no hidden layers? One hidden layer? Two hidden layers? How large should each layer be? First, note that as we increase the size and number of layers in a Neural Network, the capacity of the network increases. That is, the space of representable functions grows since the neurons can collaborate to express many different functions. For example, suppose we had a binary classification problem in two dimensions. We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/layer_sizes.jpeg" width = "600" >

**caption:** *Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by their class, and the decision regions by a trained neural network are shown underneath.*

In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions. However, this is both a blessing (since we can learn to classify more complicated data) and a curse (since it is easier to overfit the training data). Overfitting occurs when a model with high capacity fits the noise in the data instead of the (assumed) underlying relationship. For example, the model with **20 hidden neurons** fits all the training data but **at the cost of segmenting the space** into many disjoint red and green decision regions. The model with **3 hidden neurons** only has the representational power to classify the data in broad strokes. It models the data as two blobs and interprets the few red points inside the green cluster as outliers (noise). In practice, this could lead to better generalization on the test set.

Based on our discussion above, **it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting**.

However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later such as

- L2 regularization

- dropout

- input noise


**In practice, it is always better to use these methods to control overfitting instead of the number of neurons.**

The subtle reason behind this is that **smaller networks are harder to train with local methods such as Gradient Descent**: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss).

Conversely, bigger neural networks contain **significantly more local minima**, but these minima turn out to be much better in terms of their **actual loss**.

Since Neural Networks are **non-convex**, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper **The Loss Surfaces of Multilayer Networks**. In practice, what you find is that if you train a small network the **final loss can display a good amount of variance** - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you’ll start **to find many different solutions**, but the **variance** in the final achieved loss will be much **smaller**. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.

To reiterate, the regularization strength is the preferred way to control the overfitting of a neural network. We can look at the results achieved by three different settings:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/reg_strengths.jpeg" width = "600" >

***caption***: The effects of regularization strength: Each neural network above has 20 hidden neurons, but changing the **regularization strength** makes its final decision regions smoother with a higher regularization.

The takeaway is that you should not be using smaller networks because you are afraid of overfitting. Instead, you should use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting.

# Non-linear activation function

Activation functions are applied to the weighted sum of inputs called z (here the input can be raw data or the output of a previous layer) at every node in the hidden layer(s) and the output layer.

## Activation Function: Sigmoid Function

In the realm of artificial neural networks, each neuron—or rather, each node—calculates the dot product of its inputs and the corresponding weights. This process is akin to taking several pairs of numbers, multiplying each pair together, and then summing up all the products. This sum is then adjusted by adding a bias, a unique value for each neuron that helps to fine-tune the output.

Following this, the neuron applies an activation function, which in many introductory cases is the sigmoid function. The sigmoid function, expressed mathematically as $ \sigma(x) = \frac{1}{1 + e^{-x}} $, has a distinctive 'S' shaped curve. **It smoothly maps the input values, which can be any real number, to a range between 0 and 1.** This is useful because it **converts the dot product—a potentially large or small value—into something manageable** that tells the network how 'activated' or 'fired up' the neuron should be.

**The choice of the sigmoid function isn't arbitrary. It's historically popular because it closely resembles the way biological neurons seem to work: they either fire or they don't, with a gradual buildup as inputs increase.** However, it's not the only activation function used in neural networks. Towards the end of this section, we'll explore other activation functions that can be employed, each with its own mathematical characteristics and use cases, tailored to different aspects of learning and pattern recognition that the network aims to achieve.

# Activations functions in different layers in a neural network

A neural network typically consists of three types of layers: Input Layer, Hidden Layer(s) and Output Layer.

The input layer just holds the input data and no calculation is performed. Therefore, no activation function is used there.

We must use a **non-linear activation function inside hidden layers** in a neural network. This is because we need to **introduce non-linearity** to the network to **learn complex patterns**. Without non-linear activation functions, a neural network with many hidden layers would **become a giant linear regression** model that is useless for learning complex patterns from real-world data.

The **performance** of a neural network model will vary significantly depending on the **type of activation functio**n we use inside the **hidden layers**.

We must also use an activation function inside the **output layer** in a neural network. The choice of the activation function depends on the type of problem that we want to solve.

# Linear vs non-linear functions

Most of the activation functions are non-linear. However, we also use linear activation functions in neural networks. For example, **we use a linear activation function in the output layer of a neural network model that solves a regression problem**. Some activation functions are made up of **two or three linear components**. Those functions are also classified as non-linear functions.

It will be useful to distinguish between linear and non-linear functions. A linear function (called $f$) takes the input, $z$ and returns the output, $cz$ which is the multiplication of the input by the constant, $c$. Mathematically, this can be expressed as **$f(z) = cz$**. When $c=1$, the function returns the input as it is and no change is made to the input. **The graph of a linear function is a single straight line.**

Any function that is not linear can be classified as a non-linear function. **The graph of a non-linear function is not a single straight line. It can be a complex pattern or a combination of two or more linear components.**

# Different types of activation functions

We’ll discuss commonly-used activation functions in neural networks.

## 1. Sigmoid activation function


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/sigmond.webp" width = "400" >

**Key features:**

- This is also called the logistic function used in logistic regression models.

- The sigmoid function has an s-shaped graph.

- Clearly, this is a non-linear function.

- The sigmoid function converts its input into a probability value between 0 and 1.

- It converts large negative values towards 0 and large positive values towards 1.

- It returns 0.5 for the input 0. The value 0.5 is known as the threshold value which can decide that a given input belongs to what type of two classes.

**Usage:**

- In the early days, the sigmoid function was used as an activation function for the hidden layers in MLPs, CNNs and RNNs.

- However, the sigmoid function is still used in RNNs.

- Currently, we do not usually use the sigmoid function for the hidden layers in MLPs and CNNs. Instead, we use ReLU or Leaky ReLU there.

- The sigmoid function must be used in the output layer when we build a binary classifier in which the output is interpreted as a class label depending on the probability value of input returned by the function.


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/sigmond2.webp" width = "500" >

The sigmoid function is used when we build a multilabel classification model in which each mutually inclusive class has two outcomes. Do not confuse this with a multiclass classification model.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/sigmond3.webp" width = "600" >

**Drawbacks:**

- We do not usually use the sigmoid function in the hidden layers because of the following drawbacks.

- The sigmoid function has the vanishing gradient problem. This is also known as saturation of the gradients.

- The sigmoid function has slow convergence.

- Its outputs are not zero-centered. Therefore, it makes the optimization process harder.

- This function is computationally expensive as an e^z term is included.


## Tanh activation function

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/tanh.webp" width = "400" >

**Key features:**

- The output of the tanh (tangent hyperbolic) function always ranges between -1 and +1.

- Like the sigmoid function, it has an s-shaped graph. This is also a non-linear function.

- One advantage of using the tanh function over the sigmoid function is that the tanh function is zero centered. This makes the optimization process much easier.

- The tanh function has a steeper gradient than the sigmoid function has.

**Usage:**

- Until recently, the tanh function was used as an activation function for the hidden layers in MLPs, CNNs and RNNs.

- However, the tanh function is still used in RNNs.

- Currently, we do not usually use the tanh function for the hidden layers in MLPs and CNNs. Instead, we use ReLU or Leaky ReLU there.

- We never use the tanh function in the output layer.

**Drawbacks:**

- We do not usually use the tanh function in the hidden layers because of the following drawback.

- The tanh function has the vanishing gradient problem.

- This function is computationally expensive as an e^z term is included.




## Softmax activation function

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/softmax_activation.webp" width = "300" >

**Key features:**

*   This is also a non-linear activation function.

*   The softmax function calculates the probability value of an event (class) over K different events (classes). It calculates the probability values for each class. The sum of all probabilities is 1 meaning that all events (classes) are mutually exclusive.

**Usage:**

We must use the softmax function in the output layer of a multiclass classification problem.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/softmax2.webp" width = "600" >

We **never use the softmax function** in the **hidden layers**.


## ReLU activation function

**Key features:**

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/ReLu.webp" width = "400" >

- The ReLU (Rectified Linear Unit) activation function is a great alternative to both sigmoid and tanh activation functions.

- Inventing ReLU is one of the most important breakthroughs made in deep learning.

- This function does not have the vanishing gradient problem.

- This function is computationally inexpensive. It is considered that the convergence of ReLU is 6 times faster than sigmoid and tanh functions.

- If the input value is 0 or greater than 0, the ReLU function outputs the input as it is. If the input is less than 0, the ReLU function outputs the value 0.

- The ReLU function is made up of two linear components. Because of that, the ReLU function is a piecewise linear function. In fact, the ReLU function is a non-linear function.

- The output of the ReLU function can range from 0 to positive infinity.

- The convergence is faster than sigmoid and tanh functions. This is because the ReLU function has a fixed derivate (slope) for one linear component and a zero derivative for the other linear component. Therefore, the learning process is much faster with the ReLU function.

- Calculations can be performed much faster with ReLU because no exponential terms are included in the function.

**Usage:**

- The ReLU function is the default activation function for hidden layers in modern MLP and CNN neural network models.

- We do not usually use the ReLU function in the hidden layers of RNN models. Instead, we use the sigmoid or tanh function there.

- We never use the ReLU function in the output layer.

**Drawbacks:**

- The main drawback of using the ReLU function is that it has a dying ReLU problem.

- The value of the positive side can go very high. That may lead to a computational issue during the training.

## Leaky ReLU activation function

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/leaky.webp" width = "400" >

**Key features:**

- The leaky ReLU activation function is a modified version of the default ReLU function.

- Like the ReLU activation function, this function does not have the vanishing gradient problem.

- If the input value is 0 greater than 0, the leaky ReLU function outputs the input as it is like the default ReLU function does. However, if the input is less than 0, the leaky ReLU function outputs a small negative value defined by $αz$ (where $α$ is a small constant value, usually 0.01 and z is the input value).

- It does not have any linear component with zero derivatives (slopes). Therefore, it can avoid the dying ReLU problem.

- The learning process with leaky ReLU is faster than the default ReLU.

**Usage:**

The same usage of the ReLU function is also valid for the leaky ReLU function.


## Binary step activation function

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/binary_step.webp" width = "400" >

**Key features:**

- This function is also known as the threshold activation function. We can set any value to the threshold and here we specify the value 0.

- If the input is greater than the threshold value, this function outputs the value 1. If the input is equal to the threshold value or less than it, this function outputs the value 0.

- This outputs a binary value, either 0 or 1.

- The binary step function is made up of two linear components. Because of that, this function is a piecewise linear function. In fact, the binary step function is a non-linear function.

- This function is not a smooth function.

**Usage:**

- In practice, we do not usually use this function in modern neural network models.

- However, we can use this function to explain theoretical concepts such as “firing a neuron”, “inner workings of a perceptron”. Therefore, the step function is theoretically important.

## Identity activation function

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/idendity.webp" width = "400" >

**Key features:**

- This is also known as the linear activation function.

- This is the only function that is considered as a linear function when we talk about activation functions.

- This function outputs the input value as it is. No changes are made to the input.

**Usage:**

This function is only used in the output layer of a neural network model that solves a regression problem.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/regression%20idendity.webp" width = "400" >




# Some commonly used non-linear activation functions

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/non_linear_activation_function.webp" width = "600" >

[Source: Introduction To Artificial Intelligence-Part 1](https://dilanbakr.medium.com/introduction-to-artificial-intelligence-part-1-db89f5e81a22)


## How to Choose the Right Activation Function for Neural Networks

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/activation_functions.webp" width = "600" >


[Source](https://towardsdatascience.com/how-to-choose-the-right-activation-function-for-neural-networks-3941ff0e6f9c)

## A Single Neuron Logistic Regression Classifier

To understand how neural networks work, it serves to consider a single neuron as a logistic regression classifier:


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/singleneuronclassifier.png" width = "700" >

Here the line $z =0$ defines a separating hyperplane, where the bias term $w_0$ has shifted this from the origin, and all data points with $z >0$ are assigned to the positive class, and all data points with $z < 0$ are assigned the negative class.

The vector $\mathbf{W}$ runs perpendicular to the line $z =0$ and defines the direction in which data classes are maximally separated when projected onto it.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/W_separating_hyperplane.png" width = "450" >

#Exercise 1: Implementing a single neuron classifier through logistic regression

### Model

Our predictions for a single input can be written:
$$ f= f(z) = \dfrac{1}{1+e^{-z}} $$

Where, for logistic regression, $f$ is the sigmoid function and:

$$z=w_0 + w_1x_1 + w_2x_2 +w_3x_3....+w_m x_m$$

Here $w_0$ is the bias term, $w_1,w_2....w_m$ are the weights;, $m$ is the number of features and $\mathbf{x}$ is a single example (i.e. one column) from our training set $X \in \mathbb{R}^{m\times n}$ .

### Implementation of the forward pass

We could calculate $f$ in one line of code, but it will come in handy when considering backpropagation later to consider the computation in stages, with each stage consisting of a simple module:

$$
\begin{align}
\mathbf{Z} &= \mathbf{W} \mathbf{X} \\
\mathbf{F}=f(\mathbf{Z}) &= \dfrac{1}{1+e^{-\mathbf{Z}}}
\end{align}
$$

Implemented using vectorisation.



### Task 1.1 Initialise $\mathbf{W}$

**To do** Create a matrix of zeros to initialise $\mathbf{W}$ (note initialisation by zero is ok for a single neuron).

- If $\mathbf{X}$ has shape $(m_{features} \times n_{examples})$, and we know that $\mathbf{Z}$ (and thus $\mathbf{F}$) should return _one_ scalar prediction _per example_, what shape should $\mathbf{W}$ be?

### Task 1.2 Estimate $\mathbf{Z}$:

**To do** Write a function $z(w,x)$ that uses vectorisation to linearly transform data matrix $\mathbf{X}$ using the weights matrix $\mathbf{W}$.

**Hint** implement $\mathbf{Z} = \mathbf{W} \mathbf{X}$; print out the shape - is it what you would expect?


### Task 1.3 Implement Sigmoid function f:

**To do** Now write a function to compute $f(\mathbf{Z})=\dfrac{1}{1+e^{-\mathbf{Z}}} $, our logistic regression function:

**Hint** don't forget to implement with numpy functions - to support vectorisation

### Task 1.4 Implement Cross Entropy Loss:

Accuracy is easy to intepret, but can't be optimised using gradient descent. We need a measure of our prediction quality that can be. A typical loss function used in  classification problems is cross-entropy:

$$L(y_i,f(z_i)) = - y_i \ln(f(z_i)) - (1-y_i) \ln(1-f(z_i))$$

This may be implemented using vectorisation as:

$$L(\mathbf{Y},\mathbf{F}) = - \mathbf{Y} \ln(\mathbf{F} + \epsilon) - (1-\mathbf{Y}) \ln(1-\mathbf{F} + \epsilon)$$

This returns a vector of losses $(L_1,L_2....L_n)$ estimated for all training examples n. The $\epsilon$ is added for numerical stability. We require the total cost estimated as:

$$ J(\mathbf{W})= \frac{1}{n} \sum_i L_i(y_i,f(z_i)) $$

**To do Implement the Cross-Entropy loss and return the total cost**

**hint** using numpy functions for vectorisation.

## The Computation Graph

Now we have our functions for $\mathbf{L}$ and $\mathbf{Z}$, and initialised $\mathbf{W}$, we are finally in a position to compute a forward and backward pass. Computation graphs can help us to this by tracking the order of operations. The computation graph for logistic regression is:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/computation_graph_log_reg.png" width = "700" >

We can estimate the backwards pass using the chain rule:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/The_chain_rule.png" width = "300" >

Working backwards from the right side, this determines that to calculate the gradient of the loss with respect to the parameters we need;

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/computation_backwards.png" width = "700" >


And don't forget that the full cost equates to the mean of the loss over all examples $J=\frac{1}{n_T}\sum_i L_I$ , $\dfrac{dJ}{dW}=\frac{1}{n_T} \sum_i \dfrac{dL_i}{dW} $ . All calculations should be vectorised.

### Task 1.5 Implement Forward Pass


We now have all the components of the forward pass for our logistic regression. Write a full forward pass that takes data, targets and a weight matrix and performs the forward pass, with vectorisation calculating the loss:

### Task 1.6 Implement backwards pass

We're now ready to try and adjust our parameters $\mathbf{W}$ in order to optimise our predictions. To do this we need to calculate the change in our loss function with respect to our parameters, $\dfrac{\partial L}{\partial \mathbf{W}}$.

Recalling our staged calculation of the logistic regression (in vectorised form):

$$
\mathbf{Z} = \mathbf{W} \mathbf{X} \\
\mathbf{F}= \dfrac{1}{1+e^{- \mathbf{Z}}} \\
\mathbf{L}  =  - \mathbf{Y} \ln(\mathbf{F}) - (1-\mathbf{Y}) \ln(1-\mathbf{F})
$$

We can write the vectorised gradients for each individual stage (see lecture slides and keats quiz):

$$
\dfrac{\partial L}{\partial f} = \dfrac{\mathbf{F} - \mathbf{Y}}{\mathbf{F}(1-\mathbf{F})}\\
\dfrac{\partial f}{\partial z} = \mathbf{F}(1-\mathbf{F}) \\
\dfrac{\partial z}{\partial w} = \mathbf{X}^T
$$

And compose through the chain rule:

$$
\dfrac{\partial L}{\partial w} = \dfrac{\partial L}{\partial f} \cdot \dfrac{\partial f}{\partial z} \cdot\dfrac{\partial z}{\partial w} \\
\dfrac{\partial L}{\partial w} = \dfrac{\mathbf{F} - \mathbf{Y}}{\mathbf{F}(1-\mathbf{F})} \cdot \mathbf{F}(1-\mathbf{F}) \cdot \mathbf{X}^T
$$

Which can be simplified by cancelling $ \mathbf{F}(1-\mathbf{F})$ terms in both the numerator and the denominator:

$$ \dfrac{\partial L}{\partial w} = (\mathbf{F} - \mathbf{Y}) \mathbf{X}^T $$

Let's calculate the gradient of our loss, $\dfrac{\partial L}{\partial \mathbf{W}}$, for a **single** input, $\mathbf{x}$.

**To do** Fill in the calculations of the backward pass in the following code:

### Task 1.7 -  Putting it all together: the training loop

We now have everything we need to train a logistic regression classifier using backprop.

**To do** Fill out the training loop below referencing the code you have already written (or been given) in this notebook.

###  Task 1.8 Now Testing on left out set

**To do** test the performance of your logistic regression on your left out test set by running below code cell

## Exercise 2 - The multi-layer perceptron (MLP)

We Now want to extend this model to create a single hidden layer neural network:

>  > > >  <img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/MLP.png" width = "700" >

The forward pass through such a network may be written as

$$ \hat{y} = f_2 \left( \mathbf{W_2} f_1 \left(\mathbf{W_1}\mathbf{X}\right) \right) $$

where $f_2$ is a non-linear activation function for the hidden layer (we use ReLu),

$$ \text{Relu}(x) = \text{max}(0,x)$$

$f_1$ is  a non-linear activation function for the output layer (we use sigmoid for classification) and  $\mathbf{W_1}$ and $\mathbf{W_2}$ are the weights matrices for each layer. The generic shapes of each matrix are demonstrated in the figure

**In this toy example we ask you to instead create a network with 5 hidden neurons**


**Question** Given the shape of our input data, and the fact that we are still seeking the solution to a binary classification what are the number of input and output units for this problem (answer below)?


We now go about implementing our simple network from scratch with gradient descent based optimisation

### The forward pass

Once again, we can write the forward pass as a staged computation:

$$
\mathbf{Z}_1 = \mathbf{W}_1 \mathbf{X} \\
\mathbf{F}_1 = \text{max}(0,\mathbf{Z_1}) \\
\mathbf{Z}_2 = \mathbf{W}_2 \mathbf{F}_1 \\
\mathbf{F}_2 = \dfrac{1}{1+e^{- \mathbf{Z_2}}} \\
\mathbf{L}  =  - \mathbf{Y} \ln(\mathbf{F_2}) - (1-\mathbf{Y}) \ln(1-\mathbf{F_2})
$$

we give you the code for the ReLU:

Let's implement the forward pass.


### Task 2.1 Implement a forward pass of the MLP below:

Use the vectorised expressions detailed above.  What dimension must  $\mathbf{W_1}$ and $\mathbf{W_2}$ be?

### The backwards pass

The vectorised gradients of our MLP computation graph are, in reverse order, as follows:

$$\frac{\delta L}{\delta \mathbf{F}_2}=\frac{\mathbf{F}_2-\mathbf{Y}}{\mathbf{F}_2(1-\mathbf{F}_2)} \\
\frac{\delta  \mathbf{F}_2}{\delta  \mathbf{Z}_2}=\mathbf{F}_2(1-\mathbf{F}_2) \\
\frac{\delta  \mathbf{Z}_2}{\delta  \mathbf{W}_2}=\mathbf{X} \\
\frac{\delta  \mathbf{Z}_2}{\delta  \mathbf{F}_1}=\mathbf{W}^T_2\\
\frac{\delta  \mathbf{F}_1}{\delta  \mathbf{Z}_1}=1(\mathbf{Z}_1 >0)\\
\frac{\delta  \mathbf{Z}_1}{\delta  \mathbf{W}_1}=\mathbf{X}\\
$$


Combining these together using the chain rule we get (from lecture)

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/MLPbackprop.png" width = "700" >


**Task** implement the backward pass of the MLP in numpy code, and copy in the forward pass from above.

**Hint** carefully consider the order in which the stages are combined (covered in the lecture). Check the dimensions of the outputs are as expected

### Testing the performance of the MLP

**To do** test the performance of your logistic regression by running the code on your left out test set

## Exercise 3 (week 2) Adding Regularisation

One problem with neural networks is that they can involve the training of very high numbers of parameters (defined by the total number of elements in all our weights matrices). The more parameters we can chose from the greater the chance of overfitting

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-8-Neural-Networks/imgs/overfitting.png" width = "700" >

There are several ways of controlling the capacity of Neural Networks to prevent overfitting. These include

1. L1 and L2 regularisation – penalise the network through addition of a penalty term i.e.

$$ J =\frac{1}{n} (\sum_i L_i + \lambda <\textrm{penalty term}>)$$

2. Dropout - during training keep only a subset of neurons active (with probability 𝑝); set to zero otherwise.

Dropout will be considered in more detail in lecture 4. Here, we will consider the inclusion of a penalty term. Of these L2 is the most common.  This requires a penalty of $\lambda/2 ‖\mathbf{W}‖^2$ (where the 1/2  term is used to make gradient $\lambda ‖\mathbf{W}‖ $ rather than $2 \lambda ‖\mathbf{W}‖ $). L2 regularisation encourages the network to learn diffuse weights (small weights spread across all units). On the other hand, L1 has a penalty $\lambda ‖\mathbf{W}‖ $ and this encourages the learning of sparse weights, where many individual weights are set to zero.

**To do** lets try adding L2 regularisation to our MLP network. First write a new loss function which estimates a L2 regularised loss


https://www.itechcreations.in/artificial-intelligence/artificial-neural-network-for-dummies-an-introduction/

https://www.upgrad.com/blog/neural-networks-for-dummies-a-comprehensive-guide/

https://ai.plainenglish.io/neural-networks-for-dummies-841a404be413

https://talendor.io/neural-networks-for-dummies

https://www.freecodecamp.org/news/neural-networks-for-dummies-a-quick-intro-to-this-fascinating-field-795b1705104a/

https://vidyaesampally1998.medium.com/artificial-neural-network-v-s-biological-neural-network-a0862d12e9a8

https://cs231n.github.io/neural-networks-1/

https://dilanbakr.medium.com/introduction-to-artificial-intelligence-part-1-db89f5e81a22


https://www.geeksforgeeks.org/artificial-neural-networks-and-its-applications/

https://www.mdpi.com/2076-3425/12/7/863

https://www.xenonstack.com/blog/artificial-neural-network-applications

