# Classification Neural Net - NUMPY

In this notebook I will develop a simple classification neural network from scratch using pythons NUMPY, instead of relying on libaries like pytorch.

In [144]:
import numpy as np
from mpmath import sigmoid

First we will define the constants that will be used throughout the notebook.

In [145]:
bias1 = 0
bias2 = 0
learning_rate = 0.1
truth = 1 # the value we expect (the actual value that's labeled )

Next, lets create some sample data to work with.
- features: is (n_samples, n_features)
- labels: is (n_samples, 1)


In [146]:
np.random.seed(42)
n_samples = 8
n_features = 3
features_matrix = np.random.rand(n_samples, n_features)
labels = np.array([[0], [1], [0], [1], [0], [1], [0], [1]])
weights1 = np.random.randn(n_features, 4)
weights2 = np.random.randn(4, 1)
bias1 = np.zeros((1, weights1.shape[1]))
bias2 = np.zeros((1, weights2.shape[1]))

print(f"feature matrix shape: {features_matrix.shape}")
print(features_matrix)
print(f"weights matrix shape: {weights1.shape}")
print(weights1)
print(f"labels matrix shape: {labels.shape}")
print(labels)

feature matrix shape: (8, 3)
[[0.37454012 0.95071431 0.73199394]
 [0.59865848 0.15601864 0.15599452]
 [0.05808361 0.86617615 0.60111501]
 [0.70807258 0.02058449 0.96990985]
 [0.83244264 0.21233911 0.18182497]
 [0.18340451 0.30424224 0.52475643]
 [0.43194502 0.29122914 0.61185289]
 [0.13949386 0.29214465 0.36636184]]
weights matrix shape: (3, 4)
[[ 1.46564877 -0.2257763   0.0675282  -1.42474819]
 [-0.54438272  0.11092259 -1.15099358  0.37569802]
 [-0.60063869 -0.29169375 -0.60170661  1.85227818]]
labels matrix shape: (8, 1)
[[0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]]


### Neural Network ***Forward Pass*** – 1 Hidden Layer

We will create functions for each part of the forward pass:
1. **Hidden layer linear transformation** – multiply the feature matrix by the weight matrix, add bias, and produce the pre-activation values for the hidden layer.
2. **Hidden layer activation (ReLU)** – introduce non-linearity so the network can learn complex patterns.
3. **Output layer linear transformation** – take the hidden layer activations, multiply by the output layer weights, add bias, and produce the output logits.
4. **Output layer activation (Sigmoid)** – squash the logits into the range (0, 1) to get probabilities.
5. **Loss (MSE)** – measure how far the predicted values are from the target labels.

---

#### 1. Hidden layer linear transformation
$$
Z1 = \text{features\_matrix} \cdot \text{weights\_matrix} + \text{bias}
$$

Where:
- `features_matrix` = input data `(n_samples, n_features)`
- `weights_matrix` = hidden layer weights `(n_features, n_hidden)`
- `bias` = hidden layer bias `(1, n_hidden)`

---

#### 2. Hidden layer activation (ReLU)
$$
A1 = \max(0, Z1)
$$

Where:
- `A1` = hidden layer activation output `(n_samples, n_hidden)`

---

#### 3. Output layer linear transformation
$$
Z2 = A1 \cdot W2 + b2
$$

Where:
- `W2` = output layer weights `(n_hidden, 1)`
- `b2` = output layer bias `(1, 1)`

---

#### 5. Binary Cross-Entropy (BCE) Loss – from logits
$$
\text{loss} = \frac{1}{n_{\text{samples}}} \sum_{i=1}^{n_{\text{samples}}}
\left[ \max(z_i, 0) - z_i \cdot \text{labels}_i + \log\left( 1 + e^{-\lvert z_i \rvert} \right) \right]
$$

Where:
- `z` = logits `(n_samples, 1)` from the output layer transformation (before sigmoid)
- `labels` = true labels `(n_samples, 1)`
- `n_samples` = number of rows in `features_matrix`
- This formulation is **numerically stable** and does **not** require applying the sigmoid in the forward pass.







We will implement each formula above in order of how they're applied during the forward pass:

In [147]:
def hidden_layer_output_transformation(features_matrix, weights1, bias):
    return features_matrix @ weights1 + bias

def hidden_ReLU_activation(hidden_layer_output):
    return np.maximum(0, hidden_layer_output)

def output_layer_transformation(hidden_ReLU_activation, weights2, bias2):
    return hidden_ReLU_activation @ weights2 + bias2

def BCE_loss(logits, labels):
    return np.mean(
        np.maximum(logits, 0)
        - logits * labels
        + np.log1p(np.exp(-np.abs(logits)))
    )
print("Hidden layer transformation output: ")
hidden_layer_output = hidden_layer_output_transformation(features_matrix, weights1, bias1)
print(hidden_layer_output)

print("Hidden layer ReLU activation: ")
hidden_ReLU_activation_output = hidden_ReLU_activation(hidden_layer_output)
print(hidden_ReLU_activation(hidden_layer_output))

print("Output layer transformation output logits: ")
output_layer_transformation_logits = output_layer_transformation(hidden_ReLU_activation_output, weights2, bias2)
print(output_layer_transformation_logits)

print("Final BCE loss:")
forward_pass_output = BCE_loss(output_layer_transformation_logits, labels)
print(forward_pass_output)


Hidden layer transformation output: 
[[-0.40827206 -0.19262465 -1.50941963  1.17941254]
 [ 0.69879287 -0.16335953 -0.23301305 -0.50537645]
 [-0.74745409 -0.09237689 -1.35473578  1.35609836]
 [ 0.44401448 -0.44049936 -0.55947892  0.79545129]
 [ 0.99526368 -0.21742982 -0.29759288 -0.76945534]
 [-0.21200664 -0.16072923 -0.65354531  0.82499286]
 [ 0.10703705 -0.24369272 -0.67419033  0.6273231 ]
 [-0.17464059 -0.10595443 -0.54727919  0.58961859]]
Hidden layer ReLU activation: 
[[0.         0.         0.         1.17941254]
 [0.69879287 0.         0.         0.        ]
 [0.         0.         0.         1.35609836]
 [0.44401448 0.         0.         0.79545129]
 [0.99526368 0.         0.         0.        ]
 [0.         0.         0.         0.82499286]
 [0.10703705 0.         0.         0.6273231 ]
 [0.         0.         0.         0.58961859]]
Output layer transformation output logits: 
[[-1.4398783 ]
 [-0.00943176]
 [-1.65558408]
 [-0.97711462]
 [-0.0134333 ]
 [-1.00718729]
 [-0.7673081

### Neural Network ***Backpropagation*** – 1 Hidden Layer (BCE from logits)

We will compute gradients for each parameter using the chain rule, then update the weights and biases.
1. **Gradient of loss w.r.t. output logits** – Figure out how much each output logit (before sigmoid) is pushing the loss up or down, so we know the direction to adjust them.
2. **Gradient w.r.t. output layer weights & bias** – See how much each connection from the hidden layer to the output contributed to the error, so we can strengthen or weaken them.
3. **Gradient w.r.t. hidden layer activations** – Work backwards to see how much the hidden neurons themselves are responsible for the error at the output.
4. **Gradient w.r.t. hidden layer weights & bias** – Determine how much each connection from the inputs to the hidden neurons needs to be adjusted to fix the error.
5. **Update weights & biases** – Apply the changes (scaled by the learning rate) so the network gets a bit better at predicting next time.
---

#### 1. Gradient of loss w.r.t. output layer logits
$$
dZ2 \;=\; \frac{\sigma(Z2) - \text{labels}}{n_{\text{samples}}}
$$

Where:
- `Z2` = output logits `(n_samples, 1)`
- `labels` = true labels `(n_samples, 1)`
- `σ(·)` = sigmoid applied element-wise
- `n_samples` = number of rows in `features_matrix`

---

#### 2. Output layer parameter gradients
$$
dW2 \;=\; A1^\top \cdot dZ2
\qquad\qquad
db2 \;=\; \sum_{i=1}^{n_{\text{samples}}} (dZ2)_i
$$

Where:
- `A1` = hidden layer ReLU output `(n_samples, n_hidden)`
- `W2` = output layer weights `(n_hidden, 1)`
- `b2` = output layer bias `(1, 1)`

---

#### 3. Backprop into hidden activations
$$
dA1 \;=\; dZ2 \cdot W2^\top
\qquad\qquad
dZ1 \;=\; dA1 \odot \mathbf{1}(Z1 > 0)
$$

Where:
- `W2.T` = transpose of `W2` `(1, n_hidden)`
- `⊙` denotes element-wise product
- `Z1` = hidden pre-activations `(n_samples, n_hidden)`

---

#### 4. Hidden layer parameter gradients
$$
dW1 \;=\; X^\top \cdot dZ1
\qquad\qquad
db1 \;=\; \sum_{i=1}^{n_{\text{samples}}} (dZ1)_i
$$

Where:
- `X` = input features `(n_samples, n_features)`
- `W1` = hidden layer weights `(n_features, n_hidden)`
- `b1` = hidden layer bias `(1, n_hidden)`

---

#### 5. Parameter updates (gradient descent)
$$
W1 \leftarrow W1 - \eta \cdot dW1
\qquad
b1 \leftarrow b1 - \eta \cdot db1
$$
$$
W2 \leftarrow W2 - \eta \cdot dW2
\qquad
b2 \leftarrow b2 - \eta \cdot db2
$$

Where:
- `η` = learning rate `(scalar)`


In [148]:
# 1) How much each logit is pushing the loss up or down
loss_gradient_wrt_logits = (
    1.0 / (1.0 + np.exp(-output_layer_transformation_logits)) - labels
) / n_samples
print("Loss gradient w.r.t logits (output layer pre-activation):\n", loss_gradient_wrt_logits)

# 2) How much each output layer weight and bias contributed to the error
output_layer_weight_gradients = hidden_ReLU_activation_output.T @ loss_gradient_wrt_logits
output_layer_bias_gradients = np.sum(loss_gradient_wrt_logits, axis=0, keepdims=True)
print("\nOutput layer weight gradients:\n", output_layer_weight_gradients)
print("\nOutput layer bias gradients:\n", output_layer_bias_gradients)

# 3) How much each hidden activation was responsible for the output error ──> STUB

# 4) How much each hidden layer weight and bias contributed to the error ──> STUB

# 5) Apply the changes to weights and biases (gradient descent) ──> STUB



Loss gradient w.r.t logits (output layer pre-activation):
 [[ 0.02394552]
 [-0.06279474]
 [ 0.02004446]
 [-0.09081691]
 [ 0.06208022]
 [-0.09155867]
 [ 0.03963271]
 [-0.08407126]]

Output layer weight gradients:
 [[-0.01817619]
 [ 0.        ]
 [ 0.        ]
 [-0.11705923]]

Output layer bias gradients:
 [[-0.18353867]]
