#### Implementing a Backward Pass

In [1]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def forward_pass(x, weights_input_to_hidden, weights_hidden_to_output):
    """
    Make a forward pass through the network
    """
    # Calculate the input to the hidden layer.
    hidden_layer_in = np.dot(x, weights_input_to_hidden)
    # Calculate the hidden layer output.
    hidden_layer_out = sigmoid(hidden_layer_in)

    # Calculate the input to the output layer.
    output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
    # Calculate the output of the network.
    output_layer_out = sigmoid(output_layer_in)

    return hidden_layer_out, output_layer_out

def backward_pass(x, target, learnrate, hidden_layer_out, \
                  output_layer_out, weights_hidden_to_output):
    """
    Make a backward pass through the network
    """
    # Calculate output error
    # Error is the difference between target and actual output
    error = target - output_layer_out
    
    # Calculate error term for output layer
    # For sigmoid activation, error term = error * output * (1 - output)
    output_error_term = error * output_layer_out * (1 - output_layer_out)

    # Calculate error term for hidden layer
    # Error for hidden layer is output_error_term * weights_hidden_to_output
    # For sigmoid activation, hidden_error_term = hidden_error * hidden_output * (1 - hidden_output)
    hidden_error = np.dot(output_error_term, weights_hidden_to_output)
    hidden_error_term = hidden_error * hidden_layer_out * (1 - hidden_layer_out)
    
    # Calculate change in weights for hidden layer to output layer
    # delta_w_h_o = learnrate * output_error_term * hidden_layer_out
    delta_w_h_o = learnrate * output_error_term * hidden_layer_out
    
    # Calculate change in weights for input layer to hidden layer
    # delta_w_i_h = learnrate * hidden_error_term * input_values
    # Need to reshape for proper outer product dimensions
    delta_w_i_h = learnrate * np.outer(x, hidden_error_term)
    
    return delta_w_h_o, delta_w_i_h

# Create data to run through the network
x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5
weights_input_to_hidden = np.array([
    [0.5, -0.6],
    [0.1, -0.2],
    [0.1, 0.7]
])
weights_hidden_to_output = np.array([0.1, -0.3])

# Forward pass
hidden_layer_out, output_layer_out = forward_pass(
    x, weights_input_to_hidden, weights_hidden_to_output
)

# Backward pass
delta_w_h_o, delta_w_i_h = backward_pass(
    x, target, learnrate, hidden_layer_out, output_layer_out, \
    weights_hidden_to_output
)

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)

Change in weights for hidden layer to output layer:
[0.00804047 0.00555918]
Change in weights for input layer to hidden layer:
[[ 1.77005547e-04 -5.11178506e-04]
 [ 3.54011093e-05 -1.02235701e-04]
 [-7.08022187e-05  2.04471402e-04]]


##### Neural Network Backpropagation: A Detailed Explanation

The backward pass in neural networks, also known as backpropagation, is the heart of how neural networks learn. Let me walk through this implementation step by step, explaining both the mathematical principles and their practical application in the code.

##### Overview of Backpropagation

Backpropagation is an algorithm that calculates gradients of the loss function with respect to the network's weights. These gradients guide weight updates to minimize prediction errors. The algorithm works by propagating the error backward through the network, using the chain rule from calculus to determine how each weight contributes to the error.

##### Step 1: Calculating the Output Error

```python
error = target - output_layer_out
```

This first step computes the raw error by subtracting the network's prediction from the target value. This simple difference measures how far off our prediction is from the desired output. In more complex networks with multiple outputs, this would be a vector of errors.

##### Step 2: Computing the Output Layer Error Term

```python
output_error_term = error * output_layer_out * (1 - output_layer_out)
```

This step calculates the "error term" for the output layer, which incorporates both the raw error and the derivative of the activation function. For the sigmoid function, the derivative is: σ(x) * (1 - σ(x)).

The multiplication of three terms:
- `error`: How far off our prediction is
- `output_layer_out * (1 - output_layer_out)`: The derivative of the sigmoid activation

This error term represents how much the output node's activation should change to reduce the error, considering the characteristics of the sigmoid function.

##### Step 3: Calculating the Hidden Layer Error

```python
hidden_error = np.dot(output_error_term, weights_hidden_to_output)
```

This step propagates the error backward to the hidden layer. We multiply the output error term by the weights connecting the hidden layer to the output layer. This determines how much each hidden node contributed to the output error, weighted by the connection strengths between layers.

The dot product here is interesting: since `output_error_term` is a scalar and `weights_hidden_to_output` is a vector, this operation distributes the error across the hidden nodes based on their connection weights to the output.

##### Step 4: Computing the Hidden Layer Error Term

```python
hidden_error_term = hidden_error * hidden_layer_out * (1 - hidden_layer_out)
```

Similar to Step 2, we calculate the error term for the hidden layer by multiplying:
- The propagated error from the output layer (`hidden_error`)
- The derivative of the activation function for the hidden layer outputs

This error term represents how much each hidden node's activation should change to reduce the overall network error.

##### Step 5: Calculating Weight Updates for Hidden-to-Output Connections

```python
delta_w_h_o = learnrate * output_error_term * hidden_layer_out
```

Now we calculate how much to adjust the weights between the hidden and output layers. Each weight adjustment is:
- Proportional to the learning rate (`learnrate`)
- Proportional to the output error term (`output_error_term`)
- Proportional to the activation of the hidden node (`hidden_layer_out`)

This encapsulates the principle that weights should be adjusted more when:
- The error is large
- The learning rate is high
- The input to that connection (the hidden node activation) is strong

##### Step 6: Calculating Weight Updates for Input-to-Hidden Connections

```python
delta_w_i_h = learnrate * np.outer(x, hidden_error_term)
```

Finally, we calculate the weight adjustments for connections between the input and hidden layers. The outer product creates a matrix where each element (i,j) represents how much to adjust the weight connecting input node i to hidden node j.

The adjustment for each weight is:
- Proportional to the learning rate
- Proportional to the corresponding hidden node's error term
- Proportional to the activation of the input node

##### Mathematical Foundation

The entire backpropagation process is derived from the chain rule in calculus. For a weight w, we want to compute ∂E/∂w (how the error changes with respect to the weight). By applying the chain rule, we can decompose this into:

∂E/∂w = ∂E/∂o × ∂o/∂net × ∂net/∂w

Where:
- E is the error
- o is the output of a neuron
- net is the weighted sum input to a neuron
- w is the weight

The algorithm calculates these partial derivatives layer by layer, moving backward through the network.

##### Practical Significance

This implementation demonstrates several important principles:
1. **Local Computation**: Each node only needs information about its direct connections and error term
2. **Weight-Error Relationship**: Weights connecting to nodes with larger errors receive larger updates
3. **Activity-Dependent Learning**: Connections between more active nodes receive larger updates
4. **Supervised Learning**: The entire process depends on having a target value to calculate the initial error

The completed backward pass function returns the calculated weight adjustments (`delta_w_h_o` and `delta_w_i_h`), which would then be applied to update the network's weights:

```
weights_hidden_to_output += delta_w_h_o
weights_input_to_hidden += delta_w_i_h
```

This process would be repeated over many examples, gradually improving the network's performance by minimizing prediction errors.

##### Understanding Weight Update Terms in Neural Network Backpropagation

In the neural network code you shared, `delta_w_h_o` and `delta_w_i_h` represent the calculated adjustments that should be made to the network's weights during learning. Let me explain what each of these terms means and how they function in the training process.

##### What is delta_w_h_o?

`delta_w_h_o` stands for "delta weights hidden to output." This variable represents the changes that should be applied to the weights connecting the hidden layer to the output layer.

In the code, it's calculated as:
```python
delta_w_h_o = learnrate * output_error_term * hidden_layer_out
```

Breaking this down:
- `learnrate` (0.5 in your example) controls how big the weight updates should be. A larger learning rate means larger steps in the weight space.
- `output_error_term` represents how much the output was wrong, scaled by the derivative of the activation function. This tells us which direction to move in the weight space.
- `hidden_layer_out` represents the activations of the hidden layer neurons. This ensures that connections from more active neurons receive proportionally larger updates.

The shape of `delta_w_h_o` matches the shape of `weights_hidden_to_output`. In your example, `hidden_layer_out` is a vector with 2 elements (because you have 2 hidden neurons), so `delta_w_h_o` will also be a vector with 2 elements.

Each element tells you how much to adjust the corresponding weight from a hidden neuron to the output neuron.

##### What is delta_w_i_h?

`delta_w_i_h` stands for "delta weights input to hidden." This variable represents the changes that should be applied to the weights connecting the input layer to the hidden layer.

In the code, it's calculated as:
```python
delta_w_i_h = learnrate * np.outer(x, hidden_error_term)
```

This calculation is more complex because it involves connections between multiple input neurons and multiple hidden neurons. Breaking it down:
- `learnrate` again controls the size of the updates.
- `x` is the input vector (with values [0.5, 0.1, -0.2] in your example).
- `hidden_error_term` represents how much each hidden neuron contributed to the output error, scaled by the derivative of its activation function.
- `np.outer()` creates a matrix where each element (i,j) is the product of the ith element of the first array and the jth element of the second array.

The shape of `delta_w_i_h` matches the shape of `weights_input_to_hidden`. In your example, this will be a 3×2 matrix because you have 3 input features and 2 hidden neurons.

Each element (i,j) in this matrix tells you how much to adjust the weight connecting input neuron i to hidden neuron j.

##### How These Deltas Are Used

After calculating these delta terms, the actual weight update would typically happen with:
```python
weights_hidden_to_output += delta_w_h_o
weights_input_to_hidden += delta_w_i_h
```

This is the core of how neural networks learn. By iteratively:
1. Making predictions (forward pass)
2. Calculating errors
3. Computing weight adjustments (backward pass)
4. Updating weights

The network gradually improves its predictions by adjusting its internal parameters.

##### Concrete Example

Let's say after running the code, we get:
- `delta_w_h_o = [0.01, -0.02]`
- `delta_w_i_h = [[0.005, -0.008], [0.001, -0.0016], [-0.002, 0.0032]]`

This would mean:
- The weight connecting the first hidden neuron to the output should increase by 0.01
- The weight connecting the second hidden neuron to the output should decrease by 0.02
- The weights connecting inputs to hidden neurons should change according to the `delta_w_i_h` matrix

These small adjustments, repeated over many training examples, enable the network to gradually learn patterns in the data and make increasingly accurate predictions.

##### Analysis of the Calculated Weight Adjustments

When you ran the backpropagation code, it produced these weight adjustment values:

For hidden layer to output layer:
```
delta_w_h_o = [0.00804047, 0.00555918]
```

For input layer to hidden layer:
```
delta_w_i_h = [[ 1.77005547e-04, -5.11178506e-04]
               [ 3.54011093e-05, -1.02235701e-04]
               [-7.08022187e-05,  2.04471402e-04]]
```

Let me explain what these specific values tell us about the learning process in your network:

##### Hidden to Output Layer Adjustments

The `delta_w_h_o` values [0.00804047, 0.00555918] indicate that:

1. The weight connecting the first hidden neuron to the output neuron should increase by approximately 0.008.
2. The weight connecting the second hidden neuron to the output neuron should increase by approximately 0.0056.

Both adjustments are positive, suggesting that strengthening these connections will help reduce the prediction error. The first hidden neuron's connection requires a slightly larger adjustment than the second, indicating it has more influence on correcting the current error.

##### Input to Hidden Layer Adjustments

The `delta_w_i_h` matrix shows smaller adjustments, mostly in the order of 10^-4 or 10^-5:

```
[[ 1.77005547e-04, -5.11178506e-04]
 [ 3.54011093e-05, -1.02235701e-04]
 [-7.08022187e-05,  2.04471402e-04]]
```

Reading this matrix row by row:

1. For the first input neuron (with value 0.5):
   - Its connection to the first hidden neuron should increase by 0.000177
   - Its connection to the second hidden neuron should decrease by 0.000511

2. For the second input neuron (with value 0.1):
   - Its connection to the first hidden neuron should increase by 0.0000354
   - Its connection to the second hidden neuron should decrease by 0.000102

3. For the third input neuron (with value -0.2):
   - Its connection to the first hidden neuron should decrease by 0.0000708
   - Its connection to the second hidden neuron should increase by 0.000204

The pattern of positive and negative adjustments reveals the complex interplay between inputs and hidden layer activations in correcting the network's prediction.

##### Interpretation of the Values

These relatively small weight adjustments are typical in neural network training. A few observations:

1. The hidden-to-output weight adjustments are larger than the input-to-hidden adjustments. This is common in backpropagation, where gradients often diminish as they propagate backward (known as the "vanishing gradient problem" in deeper networks).

2. The signs of the adjustments (positive or negative) indicate the direction needed to reduce error. Positive adjustments strengthen connections, while negative ones weaken them.

3. The magnitudes reveal which connections are most important to adjust. For example, the largest adjustment is 0.00804047 for the first hidden-to-output weight.

If you were to apply these adjustments to the original weights:

```python
# Original weights
weights_hidden_to_output = np.array([0.1, -0.3])
weights_input_to_hidden = np.array([
    [0.5, -0.6],
    [0.1, -0.2],
    [0.1, 0.7]
])

# Updated weights
weights_hidden_to_output += delta_w_h_o  # [0.10804047, -0.29444082]
weights_input_to_hidden += delta_w_i_h   # Small adjustments to each weight
```

After many iterations of this process across multiple training examples, these small adjustments would accumulate and help the network converge toward weights that minimize prediction errors.

##### Explanation of the Backward Pass Implementation:

1. **Calculate output error**: I computed the error as the difference between the target value and the actual output from our network.

2. **Calculate error term for output layer**: For a sigmoid activation function, the error term is calculated as error * output * (1 - output). This formula comes from the derivative of the sigmoid function combined with the chain rule from calculus.

3. **Calculate error term for hidden layer**: This is a two-step process:
   - First, I propagated the error backward from the output layer to the hidden layer by multiplying the output error term with the weights connecting the hidden layer to the output layer.
   - Then, I applied the sigmoid derivative to get the hidden layer error term.

4. **Calculate weight changes for hidden-to-output layer**: The weight changes are computed as the learning rate multiplied by the output error term and the hidden layer outputs.

5. **Calculate weight changes for input-to-hidden layer**: Here, I used the outer product of the input values and the hidden error terms, scaled by the learning rate. This gives us the appropriate weight update matrix with the same dimensions as the original weight matrix.

This implementation follows the standard backpropagation algorithm, which adjusts weights based on the calculated error to minimize the difference between predicted and actual outputs.