# Back propagation

## Definition

In machine learning, backpropagation is a gradient computation method commonly used for training a neural network in computing parameter updates based on the chain rule. Backpropagation computes the gradient of a loss function with respect to the weights of the network for a single input–output example, calculating the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.

https://en.wikipedia.org/wiki/Backpropagation

## Initial parameters

In [50]:
import numpy as np

inputs = np.array([1, 0])
y_expected = 0

weights_hidden = np.array([
    [0.2, 0.4, 0.7, 0.5],
    [0.3, 0.5, 0.6, 0.9]
])
weights_output = np.array(
    [0.2, 0.4, 0.6, 0.8]
)

bias_hidden = 1
bias_output = 1

n_learning = 0.8

## FORWARD FEED

### Hidden layer pre-activation

Weighted sum of inputs for each hidden neuron:

$$
Z_{hidden} = \sum_{i=1}^{N} x_i w_{hidden} + b_{hidden} \tag{1}
$$

In [51]:
Z_hidden = np.dot(inputs, weights_hidden) + bias_hidden
print(Z_hidden)

[1.2 1.4 1.7 1.5]


### Hidden layer activation (sigmoid)

$$
y_{\text{hidden}} = \frac{1}{1 + e^{-Z_{\text{hidden}}}} \tag{2}
$$

In [52]:
sigmoid = lambda x: 1 / (1 + np.exp(-x))

y_hidden = sigmoid(Z_hidden)
print(y_hidden)

[0.76852478 0.80218389 0.84553473 0.81757448]


### Output layer pre-activation

$$
Z_{output} = \sum_{i=1}^{N} y_{hidden_i} w_{output_i} + b_{output} \tag{3}
$$

In [53]:
Z_output = np.sum(y_hidden * weights_output) + bias_output  # one output neuron
print(Z_output)

2.6359589340280305


### Output layer activation

$$
y_{\text{output}} = \frac{1}{1 + e^{-Z_{\text{output}}}} \tag{4}
$$

In [54]:
y_output = sigmoid(Z_output)
print(y_output)

0.9331402852352827


### Loss function calculation

Square error:

$$
E = \frac{1}{2}(y_{\text{expected}} - y_{\text{output}})^2 \tag{5}
$$

In [55]:
E = pow(y_expected - y_output, 2) / 2
print(E)

0.4353753959644924


## OUTPUT LAYER ADJUSTMENT

### Gradient of the loss function with respect to the hidden-output weights

$$
\frac{\partial E}{\partial w_{output_i}} = \frac{\partial E}{\partial y_{\text{output}}} \frac{\partial y_{\text{output}}}{\partial z_{\text{output}}} \frac{\partial z_{\text{output}}}{\partial w_{output_i}} \tag {6}
$$

<br>

$$\frac{\partial E}{\partial y_{\text{output}}} = -(y_{\text{очік}} - y_{\text{output}}) \tag{7}$$
$$\frac{\partial y_{\text{output}}}{\partial z_{\text{output}}} = y_{\text{output}}(1 - y_{\text{output}}) \tag{8}$$
$$\frac{\partial z_{\text{output}}}{\partial w_k} = y_{\text{hidden}_i} \tag{9}$$

In [56]:
dE_to_dY_output = -(y_expected - y_output)
dY_output_to_dZ_output = y_output * (1 - y_output)
dZ_output_to_dW_output = y_hidden

weights_output_gradient = dE_to_dY_output * dY_output_to_dZ_output * dZ_output_to_dW_output
print(weights_output_gradient)

[0.04474209 0.04670166 0.04922547 0.04759767]


### Gradient of the loss function with respect to the output bias

$$
\frac{\partial E}{\partial b_{output}} = \frac{\partial E}{\partial y_{\text{output}}} \frac{\partial y_{\text{output}}}{\partial z_{\text{output}}} \frac{\partial z_{\text{output}}}{\partial b_{output}} \tag{10}
$$

The first two derivatives of the previous equation are defined as **(7)** and **(8)**.

Due to the linear dependence of Z on b (Z directly increases by b):

$$\frac{\partial z_{\text{output}}}{\partial b_{output}} = 1 \tag{11}$$

In [57]:
dZ_output_to_dB_output = 1 # TODO: check
# dB does not impact result - added only for clarity
bias_output_gradient = dE_to_dY_output * dY_output_to_dZ_output * dZ_output_to_dB_output
print(bias_output_gradient)

0.05821814957952363


### Hidden-output weights adjustment

$$\widehat{w}_i = w_i - \eta \frac{\partial E}{\partial w_i} \tag{12}$$

In [58]:
weights_output_adjusted = weights_output - n_learning * weights_output_gradient
print(weights_output_adjusted)

[0.16420633 0.36263867 0.56061963 0.76192186]


### Output bias adjustment

$$\widehat{b} = b - \eta \frac{\partial E}{\partial b} \tag{13}$$

In [59]:
bias_output_adjusted = bias_output - n_learning * bias_output_gradient
print(bias_output_adjusted)

0.9534254803363811


## HIDDEN LAYER ADJUSTMENT

### Gradient of the loss function with respect to the input-hidden weights

$$\frac{\partial E}{\partial w_{hidden_i}} = \frac{\partial E}{\partial y_{hidden_i}} \frac{\partial y_{hidden_i}}{\partial z_{hidden_i}} \frac{\partial z_{hidden_i}}{\partial w_{hidden_k}} \tag{14}$$

<br>

$$\frac{\partial E}{\partial y_{hidden_i}} = \frac{\partial E}{\partial y_{\text{output}}} \cdot \frac{\partial y_{\text{output}}}{\partial z_{\text{output}}} \cdot \frac{\partial z_{\text{output}}}{\partial y_{hidden_i}}    ;   → \frac{\partial z_{\text{output}}}{\partial y_{\text{hidden}_i}} = w_{{output_i}} \tag{15; 16}$$
The first two derivatives of the previous equation are defined as **(7)** and **(8)**.

$$\frac{\partial y_{hidden_i}}{\partial z_{hidden_i}} = y_{hidden_i} (1 - y_{hidden_i}) \tag{17}$$

$$\frac{\partial z_{hidden_i}}{\partial w_{hidden_k}} = x_i \tag{18}$$

In [60]:
# TODO: dZ_output/dY_hidden = weights_input vs weights_output?
# dZ_output_to_dY_hidden = weights_input  # (2, 4)
dZ_output_to_dY_hidden = weights_output  # (4, 1)

dE_to_dY_hidden = dE_to_dY_output * dY_output_to_dZ_output * dZ_output_to_dY_hidden
dY_hidden_to_dZ_hidden = y_hidden * (1 - y_hidden)  # (4, 1)
dZ_hidden_to_dW_hidden = inputs  # (2, 1)

# dE/dw_hidden = dE/dY_hidden * dY_hidden/dZ_hidden * dZ_hidden/dW_hidden
# inputs have different shape than other components, so we need to multiply each input by each element of (4, 1)
# np.outer(A, B) - multiply each element of A by each element of B (element-wise)
weights_hidden_gradient = np.outer(dZ_hidden_to_dW_hidden, dE_to_dY_hidden * dY_hidden_to_dZ_hidden)
print(weights_hidden_gradient)

[[0.00207134 0.00369534 0.00456217 0.00694642]
 [0.         0.         0.         0.        ]]


### Gradient of the loss function with respect to the hidden bias

$$\frac{\partial E}{\partial b_{hidden}} = \frac{\partial E}{\partial y_{hidden_i}} \frac{\partial y_{hidden_i}}{\partial z_{hidden_i}} \frac{\partial z_{hidden_i}}{\partial b_{hidden}} \tag{19}$$

Due to the linear dependence of Z on b (Z directly increases by b):

$$\frac{\partial z_{hidden_i}}{\partial b_{hidden}} = 1 \tag{20}$$

In [61]:
dZ_hidden_to_dB_hidden = 1 # TODO: check

# dB does not impact result - added only for clarity
bias_hidden_gradient = dE_to_dY_hidden * dY_hidden_to_dZ_hidden * dZ_hidden_to_dB_hidden
print(bias_hidden_gradient)

[0.00207134 0.00369534 0.00456217 0.00694642]


### Input-hidden weights adjustment

$$\widehat{w}_i = w_i - \eta \frac{\partial E}{\partial w_i}$$

The same formula as **(12)**.

In [62]:
weights_hidden_adjusted = weights_hidden - n_learning * weights_hidden_gradient
print(weights_hidden_adjusted)

[[0.19834293 0.39704373 0.69635026 0.49444286]
 [0.3        0.5        0.6        0.9       ]]


### Hidden bias adjustment

$$\widehat{b} = b - \eta \frac{\partial E}{\partial b}$$

The same formula as **(13)**.

Hidden bias gradient has individual value per neuron (4 values in this case), so we use cumulative gradient by adding all gradients.

In [63]:
bias_hidden_cumulative_gradient = bias_hidden_gradient.sum() # TODO: check if sum is right here
bias_hidden_adjusted = bias_hidden - n_learning * bias_hidden_cumulative_gradient
print(bias_hidden_adjusted)

0.9861797817737953
