# Forward and backward pass (MLP)

## Team members:

* Project Manager - Kairat Kabdushev 
* Technical writer - Aruzhan Omarova 
* Author of executable content - Arnur Nurov
* Designer of interactive plots - Zhanibek Saduakas
* Designer of quizzes - Kairat Kabdushev 

## Learning objectives

- The structure of Neural network
- Forward pass
- Backward pass
- Batch training

### The structure of neural network 

Allow us to briefly review the fundamental architecture of a neural network.


<img src="https://www.researchgate.net/publication/355373442/figure/fig1/AS:1080268202999810@1634567418319/Basic-structure-of-MLP-network-5.ppm" alt="Alt text" width="600">

More specifically, a neural network is a composition of **layers**:

$$
    F_{\boldsymbol \theta}(\boldsymbol x) = \big(f^{(L)}_{\boldsymbol \theta_L}  \circ \ldots \circ f^{(1)}_{\boldsymbol \theta_1}\big)(\boldsymbol x), \quad f^{(i)}_{\boldsymbol \theta_i}  \colon \mathbb R^{n_{i-1}} \to \mathbb R^{n_i}
$$

* $n_i$ is the size of $i$-th layer, $n_0 = n_{\mathrm{in}}$, $n_L = n_{\mathrm{out}}$
* $\boldsymbol \theta_i$ are parameters of $i$-th layer, $\boldsymbol \theta = (\boldsymbol \theta_1,\ldots, \boldsymbol \theta_L)$
* $\boldsymbol x_i = f^{(i)}_{\boldsymbol \theta_{i}}(\boldsymbol x_{i-1})$ is the representation of $i$-th layer, $\boldsymbol x_i \in \mathbb R^{n_i}$

* Input layer: $\boldsymbol x_0 = \boldsymbol x_{\mathrm{in}}$
* Output layer: $y = \boldsymbol x_L = \boldsymbol x_{\mathrm{out}} = f^{(L)}_{\boldsymbol \theta_L}(\boldsymbol x_{L-1})$
* Hidden layers: $\boldsymbol h_1,\ldots, \boldsymbol h_{L-1}$
* If $L=1$ there are no hidden layers

### Let's test what we have read

<img src="https://media.licdn.com/dms/image/C5112AQFOFj93r-blFg/article-cover_image-shrink_600_2000/0/1579777335853?e=2147483647&v=beta&t=02CL7iY48zpr9cQpGEyxXlsRGge-2KUArlRa8_4aJMM" alt="Alt text" width="600">

<span style="display:none" id="q_n_layers">W3sicXVlc3Rpb24iOiAiSG93IG1hbnkgaGlkZGVuIGxheWVycyBpbiB0aGUgTmV1cmFsIE5ldHdvcmsgYWJvdmU/IiwgInR5cGUiOiAibnVtZXJpYyIsICJhbnN3ZXJzIjogW3sidHlwZSI6ICJ2YWx1ZSIsICJ2YWx1ZSI6IDIsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QhIFRoZXJlIGFyZSAyIGhpZGRlbiBsYXllcnMhIn0sIHsidHlwZSI6ICJkZWZhdWx0IiwgImZlZWRiYWNrIjogIk9vcHMuLi4gU29tZXRoaW5nIHdlbnQgd3JvbmchIn1dfV0=</span>

In [4]:
from jupyterquiz import display_quiz
display_quiz("#q_n_layers")

<IPython.core.display.Javascript object>

### Forward pass

<img src="forward_pass.png" alt="Alt text" width="600">


#### Since the forward pass process starts from inputs and go all the way until outputs. Let's consider every step of this flow and start with the oponents of each step:

1) #### Input Data:

- The raw data fed into the neural network.

2) #### Weights ($\boldsymbol W_i$):

- $\boldsymbol W_i$ has shape $n_{i-1}\times n_i$. Represents the weights connecting layer $i-1$ to layer $i$.

3) #### Biases ($\boldsymbol b_i$):

- $\boldsymbol b_i$ has length $n_i$. Represents the biases for layer $i$.

4) #### Activation Functions: Introduce non-linearity to the network.

- Common functions include ReLU, Sigmoid, Tanh, etc.

### The forward pass steps in matrix form:

#### Compute the output of the first hidden layer:


<img src="forward_2.jpg" alt="Alt text" width="600">

$\boldsymbol W^{(1)}$ has shape $n_{\text{input}}\times n_{\text{hidden}_1}$.

For each subsequent hidden layer $i$:

<img src="forward_3.jpg" alt="Alt text" width="600">

$\boldsymbol W^{(i)}$ has shape $n_{\text{hidden}{i-1}}\times n{\text{hidden}_i}$.

#### Output Layer Calculation:

Compute the final output of the network using activations of the last hidden layer:

<img src="forward_4.jpg" alt="Alt text" width="600">

$\boldsymbol W^{(n)}$ has shape $n_{\text{hidden}{n-1}}\times n{\text{output}}$.
$\text{Output}_{\text{final}}$ represents the final output of the network.


<img src="https://static.independent.co.uk/2023/05/09/15/sats.jpg" alt="Alt text" width="500">

<span style="display:none" id="q_forward_process">W3sicXVlc3Rpb24iOiAiSG93IG1hbnkgbGVhcm5hYmxlIHBhcmFtZXRlcnMgaW4gdGhlIE1MUCB3b3VsZCBiZSBpZiBMID0gMSwgd2l0aCBhbiBpbnB1dCBsYXllciBvZiAyIG5ldXJvbnMsIGEgaGlkZGVuIGxheWVyIG9mIDMgbmV1cm9ucywgYW5kIGFuIG91dHB1dCBsYXllciBvZiAxIG5ldXJvbj8iLCAidHlwZSI6ICJudW1lcmljIiwgImFuc3dlcnMiOiBbeyJ0eXBlIjogInZhbHVlIiwgInZhbHVlIjogMTEsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QhIFRoZXJlIGFyZSAxMSBsZWFybmFibGUgcGFyYW1ldGVycyEifSwgeyJ0eXBlIjogImRlZmF1bHQiLCAiZmVlZGJhY2siOiAiT29wcy4uLiBTb21ldGhpbmcgd2VudCB3cm9uZywgdHJ5IGFnYWluISJ9XX1d</span>

In [5]:
from jupyterquiz import display_quiz
display_quiz("#q_forward_process")

<IPython.core.display.Javascript object>

### Back propagation

<img src="backprop.png" alt="Alt text" width="600">

Propagate gradients backward through the network from the output to the input layers to update the weights and biases.

$\nabla_{\boldsymbol X_i}\mathcal L$: Denotes the gradients of the loss function with respect to the activations or hidden representations at layer $i$. These gradients are computed successively backward through the layers.

$\nabla_{\boldsymbol W_i} \mathcal L$: Represents the gradients of the loss function with respect to the weights $\boldsymbol W_i$ connecting the neurons between layer $i$ and $i+1$. These gradients are used to update the weights in order to minimize the loss.

$\nabla_{\boldsymbol B_i}\mathcal L$: Denotes the gradients of the loss function with respect to the biases $\boldsymbol B_i$ for layer $i+1$. These gradients help adjust the biases during the training process.

<span style="display:none" id="q_backward_process">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcHJpbWFyeSBwdXJwb3NlIG9mIHRoZSBiYWNrd2FyZCBwYXNzIGluIGJhY2twcm9wYWdhdGlvbiwgYW5kIHdoYXQgaW5mb3JtYXRpb24gZG9lcyBpdCBjb21wdXRlIGZvciBlYWNoIGxheWVyIG9mIHRoZSBuZXVyYWwgbmV0d29yaz8iLCAidHlwZSI6ICJtdWx0aXBsZV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJUaGUgYmFja3dhcmQgcGFzcyBpcyB1c2VkIHRvIGNvbXB1dGUgdGhlIGxvc3MgZnVuY3Rpb24gb25seS4iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiSW5jb3JyZWN0LiBXaGlsZSB0aGUgYmFja3dhcmQgcGFzcyBpcyBpbnZvbHZlZCBpbiBjYWxjdWxhdGluZyBncmFkaWVudHMgcmVsYXRlZCB0byB0aGUgbG9zcyBmdW5jdGlvbiwgaXRzIHNvbGUgcHVycG9zZSBpcyBub3QgbGltaXRlZCB0byBjb21wdXRpbmcgdGhlIGxvc3MgZnVuY3Rpb24gaXRzZWxmLiJ9LCB7ImFuc3dlciI6ICJUaGUgYmFja3dhcmQgcGFzcyBjb21wdXRlcyBncmFkaWVudHMgb2YgdGhlIGxvc3MgZnVuY3Rpb24gd2l0aCByZXNwZWN0IHRvIHRoZSBuZXR3b3JrJ3MgcGFyYW1ldGVycyAoYWN0aXZhdGlvbnMsIHdlaWdodHMsIGJpYXNlcykgdG8gYWRqdXN0IHRoZW0gZm9yIG1pbmltaXppbmcgdGhlIG92ZXJhbGwgZXJyb3IuIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdC4gVGhlIHVwZGF0ZSBwcm9jZXNzIG9jY3VycyB1c2luZyB0aGVzZSBncmFkaWVudHMgaW4gY29uanVuY3Rpb24gd2l0aCBhbiBvcHRpbWl6YXRpb24gYWxnb3JpdGhtIChlLmcuLCBncmFkaWVudCBkZXNjZW50KSBkdXJpbmcgdGhlIHRyYWluaW5nIHBoYXNlLCBub3QgZGlyZWN0bHkgZHVyaW5nIHRoZSBiYWNrd2FyZCBwYXNzLiJ9LCB7ImFuc3dlciI6ICJUaGUgYmFja3dhcmQgcGFzcyBjb21wdXRlcyBncmFkaWVudHMgb2YgdGhlIGxvc3MgZnVuY3Rpb24gd2l0aCByZXNwZWN0IHRvIHRoZSBuZXR3b3JrJ3MgcGFyYW1ldGVycyAoYWN0aXZhdGlvbnMsIHdlaWdodHMsIGJpYXNlcykgdG8gYWRqdXN0IHRoZW0gZm9yIG1pbmltaXppbmcgdGhlIG92ZXJhbGwgZXJyb3IuIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdCEgVGhlIHByaW1hcnkgcHVycG9zZSBvZiB0aGUgYmFja3dhcmQgcGFzcyBpbiBiYWNrcHJvcGFnYXRpb24gaXMgdG8gY2FsY3VsYXRlIGdyYWRpZW50cyBvZiB0aGUgbG9zcyBmdW5jdGlvbiB3aXRoIHJlc3BlY3QgdG8gdGhlIG5ldHdvcmsncyBwYXJhbWV0ZXJzLiJ9LCB7ImFuc3dlciI6ICJUaGUgYmFja3dhcmQgcGFzcyBoZWxwcyBpbiB0aGUgaW5pdGlhbCBjb21wdXRhdGlvbiBvZiB0aGUgZm9yd2FyZCBwYXNzIGJ5IHByb3BhZ2F0aW5nIGlucHV0IGRhdGEgYmFja3dhcmQgdGhyb3VnaCB0aGUgbmV0d29yay4iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiSW5jb3JyZWN0LiBUaGUgYmFja3dhcmQgcGFzcyBvY2N1cnMgYWZ0ZXIgdGhlIGZvcndhcmQgcGFzcyBhbmQgaW52b2x2ZXMgcHJvcGFnYXRpbmcgZ3JhZGllbnRzIGJhY2t3YXJkIHRocm91Z2ggdGhlIG5ldHdvcmsuICJ9XX1d</span>

In [6]:
from jupyterquiz import display_quiz
display_quiz("#q_backward_process")

<IPython.core.display.Javascript object>

### Optimization techniques

The goal is to minimize the loss function with respect to parameters $\boldsymbol \theta$,

$$
\mathcal L = \mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \to \min\limits_{\boldsymbol \theta}
$$

where
$$
\boldsymbol \theta = (\boldsymbol W_1, \boldsymbol b_1, \boldsymbol W_2, \boldsymbol b_2, \ldots, \boldsymbol W_L, \boldsymbol b_L)
$$

Let's use the standard technique — the gradient descent!

1. Start from some random parameters $\boldsymbol \theta_0$


2. Given a training sample $(\boldsymbol x, \boldsymbol y)$, do the **forward pass** and get the output $\boldsymbol {\widehat y} = F_{\boldsymbol \theta}(\boldsymbol x)$

3. Calculate the loss function
$\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y)$ and its gradient
$$
\nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y)
$$

4. Update the parameters:

$$
    \boldsymbol \theta = \boldsymbol \theta - \eta \nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y)
$$

5. Go to step 2 with next training sample

### Practical example

Suppose we have a neural network with two parameters $\theta_1 = 2$ and $\theta_2 = -3$. Implement one step of gradient descent for these parameters using a learning rate $\eta = 0.1$. Assume that the gradient of the loss function with respect to these parameters is $\nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta} = [4, -5]$. Perform the parameter update using the gradient descent formula:

What are the updated values of $\theta_1$ and $\theta_2$ after one step of gradient descent?

Options:
- $\theta_1 = 1.6$, $\theta_2 = -3.5$

- $\theta_1 = 1.8$, $\theta_2 = -3.5$

- $\theta_1 = 1.6$, $\theta_2 = -2.5$

- $\theta_1 = 1.8$, $\theta_2 = -2.5$

Correct Answer:
$\theta_1 = 1.6$, $\theta_2 = -3.5$

Explanation:

Using the gradient descent formula, the updated values are calculated as follows:
- $\theta_1 = 2- 0.1 * 4 = 1.6$

- $\theta_2 = -3 - 0.1 * (-5) = -3 + 0.5 = -3.5$

Thus, after one step of gradient descent, $\theta_1$ is updated to 1.6, and $\theta_2$ is updated to -3.5.

In [20]:
import plotly.graph_objs as go

# Case 1 parameters and gradients
theta1_case1, theta2_case1 = 2, -3
gradient1_case1, gradient2_case1 = 4, -5

# Case 2 parameters and gradients
theta1_case2, theta2_case2 = 1, -2
gradient1_case2, gradient2_case2 = 3, -4

# Learning rate
learning_rate = 0.1

# Gradient descent updates for Case 1
updated_theta1_case1 = theta1_case1 - learning_rate * gradient1_case1
updated_theta2_case1 = theta2_case1 - learning_rate * gradient2_case1

# Gradient descent updates for Case 2
updated_theta1_case2 = theta1_case2 - learning_rate * gradient1_case2
updated_theta2_case2 = theta2_case2 - learning_rate * gradient2_case2

# Plotting the updates for both cases
fig = go.Figure()

# Add updates for Case 1
fig.add_trace(go.Scatter(x=[theta1_case1, updated_theta1_case1],
                         y=[theta2_case1, updated_theta2_case1],
                         mode='lines+markers',
                         name='Case 1',
                         line=dict(color='blue', width=2),
                         marker=dict(color='blue', size=10)))

# Add updates for Case 2
fig.add_trace(go.Scatter(x=[theta1_case2, updated_theta1_case2],
                         y=[theta2_case2, updated_theta2_case2],
                         mode='lines+markers',
                         name='Case 2',
                         line=dict(color='red', width=2),
                         marker=dict(color='red', size=10)))

# Customize layout
fig.update_layout(title='Parameter Updates after One Step of Gradient Descent',
                  xaxis=dict(title='Theta 1'),
                  yaxis=dict(title='Theta 2'),
                  showlegend=True)

# Display plot
fig.show()


Each line starts at a point corresponding to the initial values of the parameters $\theta_1$ and $\theta_2$ for the respective cases.
The lines show the updates made to these parameters after one step of gradient descent. The new values of $\theta_1$ and $\theta_2$ are indicated by the ending point of each line.
The slope of each line indicates the magnitude and direction of the update for the parameters in one step of gradient descent.
