Cross-entropy is a loss (cost) function that measures how well a set of predicted probabilities P matches the true labels Y.

For binary classification (labels 0 or 1) the formula is

L(Y,P) = - Σ [ Y · log P + (1-Y) · log (1-P) ]

where
· Y is the true label (1 = positive, 0 = negative)
· P is the model's predicted probability that the label is 1
· The sum Σ runs over all samples.

Key points
· If the prediction is perfect (P ≈ 1 when Y = 1, or P ≈ 0 when Y = 0) the loss is near 0.
· Confident wrong predictions (e.g. P ≈ 1 when Y = 0) incur a very large loss because log(0) → -∞.
· Cross-entropy is convex for logistic regression, giving smooth gradients for optimisation.

The code you posted implements this exactly:

```python
return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
```

It converts Y and P to floats, takes element-wise logs, multiplies by the appropriate label term, sums over all samples, and returns the negative of that sum (so lower is better).

In [40]:
import numpy as np

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y  *np.log(P) + (1 - Y)*  np.log(1 - P))


```mermaid
flowchart LR
    subgraph "Input Layer"
        I0["Input 0 | (0.5)"]
        I1["Input 1 | (0.1)"]
        I2["Input 2 | (-0.2)"]
    end
    
    subgraph "Hidden Layer"
        H0["Hidden 0 | (σ = 0.632)"]
        H1["Hidden 1 | (σ = 0.456)"]
    end
    
    subgraph "Output Layer"
        O0["Output | (σ = 0.540) | Target = 0.6"]
    end
    
    I0 -->|0.5| H0
    I0 -->|-0.6| H1
    I1 -->|0.1| H0
    I1 -->|-0.2| H1
    I2 -->|0.1| H0
    I2 -->|0.7| H1
    
    H0 -->|0.1| O0
    H1 -->|-0.3| O0
    
    
    style I0 fill:#9AE4F5
    style I1 fill:#9AE4F5
    style I2 fill:#9AE4F5
    style H0 fill:#BCFB89
    style H1 fill:#BCFB89
    style O0 fill:#FA756A
```

This diagram shows:

1. The input layer with 3 nodes (blue)
2. The hidden layer with 2 nodes (green)
3. The output layer with 1 node (coral)

The connections between the input and hidden layers show the weights from your weight matrix:
- Input 0 connects to Hidden 0 with weight 0.5
- Input 0 connects to Hidden 1 with weight -0.6
- Input 1 connects to Hidden 0 with weight 0.1
- Input 1 connects to Hidden 1 with weight -0.2
- Input 2 connects to Hidden 0 with weight 0.1
- Input 2 connects to Hidden 1 with weight 0.7

The connections between the hidden layer and output layer are shown as w1 and w2, as these weren't specified in your example.

This network structure performs the matrix multiplication you described: when a 3-element input vector is multiplied by the 3×2 weight matrix, it produces a 2-element vector that serves as input to the hidden layer neurons.

In [None]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# x is a single feature vector for the training that feeds into the network’s input layer. 
# It has three features—[0.5, 0.1, -0.2]—matching the three input nodes used in the notebook’s 
# weight matrices.
x = np.array([0.5, 0.1, -0.2])

# target is the desired (ground-truth) value for that training example.
target = 0.6

# learnrate is the learning rate, a hyperparameter that controls how much we adjust the weights in each iteration.
learnrate = 0.5

# weights_input_hidden is the weight matrix connecting the input layer to the hidden layer.
weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass

# Calculate output error
error = target - output      
print("Output Error:", error)

# Calculate error term for output layer

# output * (1 - output) is the derivative of the sigmoid activation
# σ'(z) = σ(z)(1 - σ(z)).
# In back-propagation you multiply that derivative by the error coming from
# the loss:

# error = (y_true - output) # ∂L/∂output
# output_error_term = error * output * (1 - output) # ∂L/∂z

# So output_error_term is not the derivative of the sigmoid by itself; it is the
# gradient of the loss with respect to the neuron's pre-activation input z.
# In other words, it's the product of:

# 1. the derivative of the loss w.r.t. the neuron's output (error), and
# 2. the derivative of the sigmoid w.r.t. its input (output * (1-output)).
# Calculate error term for output layer
output_error_term = error * output * (1 - output) # ∂L/∂z
print("Output Error Term:", output_error_term)

# Calculate error term for hidden layer
# hidden_error_term = np.dot(output_error_term, weights_hidden_output) * \
#                     hidden_layer_output * (1 - hidden_layer_output)


# Calculate error term for hidden layer
hidden_error_term = np.dot(output_error_term, weights_hidden_output) * \
                    hidden_layer_output * (1 - hidden_layer_output) # ∂L/∂z
print("Hidden Error Term:", hidden_error_term)


# Calculate change in weights for hidden layer to output layer
# delta_w_h_o = learnrate  *output_error_term*  hidden_layer_output
# Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output # <-- LINE 4
print("Delta W H O:", delta_w_h_o)


# Calculate change in weights for input layer to hidden layer
# delta_w_i_h = learnrate  *hidden_error_term*  x[:, None]

# Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * x[:, None] # <-- LINE 5
print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)

Well done!
Well done!
Well done!
Well done!
Change in weights for hidden layer to output layer:
[0.00804047 0.00555918]
Change in weights for input layer to hidden layer:
[[ 1.77005547e-04 -5.11178506e-04]
 [ 3.54011093e-05 -1.02235701e-04]
 [-7.08022187e-05  2.04471402e-04]]
Well done!
Change in weights for hidden layer to output layer:
[0.00804047 0.00555918]
Change in weights for input layer to hidden layer:
[[ 1.77005547e-04 -5.11178506e-04]
 [ 3.54011093e-05 -1.02235701e-04]
 [-7.08022187e-05  2.04471402e-04]]
