# Classification Neural Net - NUMPY

In this notebook I will develop a simple classification neural network from scratch using pythons NUMPY, instead of relying on libaries like pytorch.

In [321]:
import numpy as np

First we will define the hyperparameters that will be used throughout the notebook.

In [322]:
learning_rate = 0.05

Next, lets create some sample data to work with.
- features: is (n_samples, n_features)
- labels: is (n_samples, 1)


In [323]:
# features: [height (m), weight (kg), score (0-10)]
input_feature_matrix = np.array([
    [1.80, 80, 8],   # good
    [1.65, 70, 6],   # good
    [1.75, 95, 5],   # bad
    [1.60, 60, 4],   # bad
    [1.82, 77, 9],   # good
    [1.55, 50, 3],   # bad
    [1.78, 85, 7],   # good
    [1.62, 65, 2],   # bad
], dtype=float)

# normalize features (z-score)
X = input_feature_matrix
X = (X - X.mean(axis=0, keepdims=True)) / (X.std(axis=0, keepdims=True) + 1e-8)
input_feature_matrix = X

# labels: 1 = Good, 0 = Bad
target_labels = np.array([
    [1],
    [1],
    [0],
    [0],
    [1],
    [0],
    [1],
    [0]
], dtype=float)

# random initial weights and biases
np.random.seed(42)
hidden_layer_weights = np.random.randn(input_feature_matrix.shape[1], 4)
output_layer_weights = np.random.randn(4, 1)
hidden_layer_bias = np.zeros((1, hidden_layer_weights.shape[1]))
output_layer_bias = np.zeros((1, output_layer_weights.shape[1]))

num_samples = input_feature_matrix.shape[0]

print(f"feature matrix shape: {input_feature_matrix.shape}")
print(input_feature_matrix)
print(f"hidden weights matrix shape: {hidden_layer_weights.shape}")
print(hidden_layer_weights)
print(f"output weights matrix shape: {output_layer_weights.shape}")
print(output_layer_weights)
print(f"labels matrix shape: {target_labels.shape}")
print(target_labels)

feature matrix shape: (8, 3)
[[ 1.07448419  0.53602696  1.09108945]
 [-0.47898693 -0.20332057  0.21821789]
 [ 0.55666048  1.64504827 -0.21821789]
 [-0.99681063 -0.94266811 -0.65465367]
 [ 1.28161367  0.3142227   1.52752522]
 [-1.51463434 -1.68201564 -1.09108945]
 [ 0.86735471  0.90570073  0.65465367]
 [-0.78968115 -0.57299434 -1.52752522]]
hidden weights matrix shape: (3, 4)
[[ 0.49671415 -0.1382643   0.64768854  1.52302986]
 [-0.23415337 -0.23413696  1.57921282  0.76743473]
 [-0.46947439  0.54256004 -0.46341769 -0.46572975]]
output weights matrix shape: (4, 1)
[[ 0.24196227]
 [-1.91328024]
 [-1.72491783]
 [-0.56228753]]
labels matrix shape: (8, 1)
[[1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]]


### Neural Network ***Forward Pass*** – 1 Hidden Layer

We will create functions for each part of the forward pass:
1. **Hidden layer linear transformation** – multiply the feature matrix by the weight matrix, add bias, and produce the pre-activation values for the hidden layer.
2. **Hidden layer activation (ReLU)** – introduce non-linearity so the network can learn complex patterns.
3. **Output layer linear transformation** – take the hidden layer activations, multiply by the output layer weights, add bias, and produce the output logits.
4. **Output layer activation (Sigmoid)** – squash the logits into the range (0, 1) to get probabilities.
5. **Loss (MSE)** – measure how far the predicted values are from the target labels.

---

#### 1. Hidden layer linear transformation
$$
Z1 = \text{features\_matrix} \cdot \text{weights\_matrix} + \text{bias}
$$

Where:
- `features_matrix` = input data `(n_samples, n_features)`
- `weights_matrix` = hidden layer weights `(n_features, n_hidden)`
- `bias` = hidden layer bias `(1, n_hidden)`

---

#### 2. Hidden layer activation (ReLU)
$$
A1 = \max(0, Z1)
$$

Where:
- `A1` = hidden layer activation output `(n_samples, n_hidden)`

---

#### 3. Output layer linear transformation
$$
Z2 = A1 \cdot W2 + b2
$$

Where:
- `W2` = output layer weights `(n_hidden, 1)`
- `b2` = output layer bias `(1, 1)`

---

#### 5. Binary Cross-Entropy (BCE) Loss – from logits
$$
\text{loss} = \frac{1}{n_{\text{samples}}} \sum_{i=1}^{n_{\text{samples}}}
\left[ \max(z_i, 0) - z_i \cdot \text{labels}_i + \log\left( 1 + e^{-\lvert z_i \rvert} \right) \right]
$$

Where:
- `z` = logits `(n_samples, 1)` from the output layer transformation (before sigmoid)
- `labels` = true labels `(n_samples, 1)`
- `n_samples` = number of rows in `features_matrix`
- This formulation is **numerically stable** and does **not** require applying the sigmoid in the forward pass.







We will implement each formula above in order of how they're applied during the forward pass:

In [324]:
def hidden_layer_output_transformation(input_feature_matrix, hidden_layer_weights, hidden_layer_bias):
    return input_feature_matrix @ hidden_layer_weights + hidden_layer_bias

def hidden_ReLU_activation(hidden_layer_linear_output):
    return np.maximum(0, hidden_layer_linear_output)

def output_layer_transformation(hidden_layer_activation_output, output_layer_weights, output_layer_bias):
    return hidden_layer_activation_output @ output_layer_weights + output_layer_bias

def BCE_loss(output_layer_linear_output, target_labels):
    return np.mean(
        np.maximum(output_layer_linear_output, 0)
        - output_layer_linear_output * target_labels
        + np.log1p(np.exp(-np.abs(output_layer_linear_output)))
    )

### Neural Network ***Backpropagation*** – 1 Hidden Layer (BCE from logits)

We will compute gradients for each parameter using the chain rule, then update the weights and biases.
1. **Gradient of loss w.r.t. output logits** – Figure out how much each output logit (before sigmoid) is pushing the loss up or down, so we know the direction to adjust them.
2. **Gradient w.r.t. output layer weights & bias** – See how much each connection from the hidden layer to the output contributed to the error, so we can strengthen or weaken them.
3. **Gradient w.r.t. hidden layer activations** – Work backwards to see how much the hidden neurons themselves are responsible for the error at the output.
4. **Gradient w.r.t. hidden layer weights & bias** – Determine how much each connection from the inputs to the hidden neurons needs to be adjusted to fix the error.
5. **Update weights & biases** – Apply the changes (scaled by the learning rate) so the network gets a bit better at predicting next time.
---

#### 1. Gradient of loss w.r.t. output layer logits
$$
dZ2 \;=\; \frac{\sigma(Z2) - \text{labels}}{n_{\text{samples}}}
$$

Where:
- `Z2` = output logits `(n_samples, 1)`
- `labels` = true labels `(n_samples, 1)`
- `σ(·)` = sigmoid applied element-wise
- `n_samples` = number of rows in `features_matrix`

---

#### 2. Output layer parameter gradients
$$
dW2 \;=\; A1^\top \cdot dZ2
\qquad\qquad
db2 \;=\; \sum_{i=1}^{n_{\text{samples}}} (dZ2)_i
$$

Where:
- `A1` = hidden layer ReLU output `(n_samples, n_hidden)`
- `W2` = output layer weights `(n_hidden, 1)`
- `b2` = output layer bias `(1, 1)`

---

#### 3. Backprop into hidden activations
$$
dA1 \;=\; dZ2 \cdot W2^\top
\qquad\qquad
dZ1 \;=\; dA1 \odot \mathbf{1}(Z1 > 0)
$$

Where:
- `W2.T` = transpose of `W2` `(1, n_hidden)`
- `⊙` denotes element-wise product
- `Z1` = hidden pre-activations `(n_samples, n_hidden)`

---

#### 4. Hidden layer parameter gradients
$$
dW1 \;=\; X^\top \cdot dZ1
\qquad\qquad
db1 \;=\; \sum_{i=1}^{n_{\text{samples}}} (dZ1)_i
$$

Where:
- `X` = input features `(n_samples, n_features)`
- `W1` = hidden layer weights `(n_features, n_hidden)`
- `b1` = hidden layer bias `(1, n_hidden)`

---

#### 5. Parameter updates (gradient descent)
$$
W1 \leftarrow W1 - \eta \cdot dW1
\qquad
b1 \leftarrow b1 - \eta \cdot db1
$$
$$
W2 \leftarrow W2 - \eta \cdot dW2
\qquad
b2 \leftarrow b2 - \eta \cdot db2
$$

Where:
- `η` = learning rate `(scalar)`


In [325]:
def stable_sigmoid(x):
    pos = x >= 0
    out = np.empty_like(x, dtype=float)
    out[pos] = 1.0 / (1.0 + np.exp(-x[pos]))
    e = np.exp(x[~pos])
    out[~pos] = e / (1.0 + e)
    return out

def compute_loss_gradient_wrt_logits(output_layer_linear_output, target_labels):
    num_samples = output_layer_linear_output.shape[0]
    return (stable_sigmoid(output_layer_linear_output) - target_labels) / num_samples

def output_layer_param_gradients(hidden_layer_ReLU_output, loss_gradient_wrt_logits):
    output_layer_weight_gradients = hidden_layer_ReLU_output.T @ loss_gradient_wrt_logits
    output_layer_bias_gradients   = np.sum(loss_gradient_wrt_logits, axis=0, keepdims=True)
    return output_layer_weight_gradients, output_layer_bias_gradients

def backprop_to_hidden(loss_gradient_wrt_logits, output_layer_weights, hidden_layer_linear_output):
    hidden_layer_activation_gradients     = loss_gradient_wrt_logits @ output_layer_weights.T
    hidden_layer_pre_activation_gradients = hidden_layer_activation_gradients * (hidden_layer_linear_output > 0)
    return hidden_layer_pre_activation_gradients

def hidden_layer_param_gradients(input_feature_matrix, hidden_layer_pre_activation_gradients):
    hidden_layer_weight_gradients = input_feature_matrix.T @ hidden_layer_pre_activation_gradients
    hidden_layer_bias_gradients   = np.sum(hidden_layer_pre_activation_gradients, axis=0, keepdims=True)
    return hidden_layer_weight_gradients, hidden_layer_bias_gradients

def apply_parameter_updates(hidden_layer_weights, hidden_layer_bias,
                            output_layer_weights, output_layer_bias,
                            hidden_layer_weight_gradients, hidden_layer_bias_gradients,
                            output_layer_weight_gradients, output_layer_bias_gradients,
                            learning_rate):
    output_layer_weights -= learning_rate * output_layer_weight_gradients
    output_layer_bias    -= learning_rate * output_layer_bias_gradients
    hidden_layer_weights -= learning_rate * hidden_layer_weight_gradients
    hidden_layer_bias    -= learning_rate * hidden_layer_bias_gradients
    return hidden_layer_weights, hidden_layer_bias, output_layer_weights, output_layer_bias


### Training

Here we will train the neural net to get the trained weights

In [326]:
num_epochs = 100
for epoch in range(1, num_epochs + 1):
    # 1) forward pass
    hidden_layer_linear_output = hidden_layer_output_transformation(input_feature_matrix, hidden_layer_weights, hidden_layer_bias)
    hidden_layer_ReLU_output = hidden_ReLU_activation(hidden_layer_linear_output)
    output_layer_linear_output = output_layer_transformation(hidden_layer_ReLU_output, output_layer_weights, output_layer_bias)

    # 2) compute loss
    forward_pass_loss = BCE_loss(output_layer_linear_output, target_labels)
    print(f"Forward pass loss: {forward_pass_loss}")

    # 3) backward pass (compute gradients)
    loss_gradient_wrt_logits = compute_loss_gradient_wrt_logits(output_layer_linear_output, target_labels)
    output_layer_weight_gradients, output_layer_bias_gradients = output_layer_param_gradients(hidden_layer_ReLU_output, loss_gradient_wrt_logits)
    hidden_layer_pre_activation_gradients = backprop_to_hidden(loss_gradient_wrt_logits, output_layer_weights, hidden_layer_linear_output)
    hidden_layer_weight_gradients, hidden_layer_bias_gradients = hidden_layer_param_gradients(input_feature_matrix, hidden_layer_pre_activation_gradients)

    # 4) parameter update
    apply_parameter_updates(hidden_layer_weights, hidden_layer_bias,
                            output_layer_weights, output_layer_bias,
                            hidden_layer_weight_gradients, hidden_layer_bias_gradients,
                            output_layer_weight_gradients, output_layer_bias_gradients,
                            learning_rate)


Forward pass loss: 1.672443319506854
Forward pass loss: 1.4576551623423253
Forward pass loss: 1.2916749313892337
Forward pass loss: 1.1566915082738294
Forward pass loss: 1.060687634863001
Forward pass loss: 0.9771363823392609
Forward pass loss: 0.9225672280312348
Forward pass loss: 0.8863186738641593
Forward pass loss: 0.8525682030873303
Forward pass loss: 0.8211639804930416
Forward pass loss: 0.7919575913373185
Forward pass loss: 0.7648046745215458
Forward pass loss: 0.7395656346300895
Forward pass loss: 0.7161063356619484
Forward pass loss: 0.694298706547937
Forward pass loss: 0.6773253259494607
Forward pass loss: 0.6642707813501343
Forward pass loss: 0.6448822325261164
Forward pass loss: 0.6268146012497637
Forward pass loss: 0.6099223186249237
Forward pass loss: 0.5940820523461696
Forward pass loss: 0.579188621124995
Forward pass loss: 0.5651517344764907
Forward pass loss: 0.5519005278851353
Forward pass loss: 0.5393705278894347
Forward pass loss: 0.5274884259662774
Forward pass los

### Evaluation

Lets evaluate and make sure the loss is decreasing and the trained weights will produce accurate classifications

In [327]:
# Verify loss decreased
hidden_layer_linear_output = hidden_layer_output_transformation(input_feature_matrix, hidden_layer_weights, hidden_layer_bias)
hidden_layer_ReLU_output = hidden_ReLU_activation(hidden_layer_linear_output)
output_layer_linear_output = output_layer_transformation(hidden_layer_ReLU_output, output_layer_weights, output_layer_bias)

print("BCE loss first forward pass:", forward_pass_loss)

# probs & preds
probs = 1.0 / (1.0 + np.exp(-output_layer_linear_output))
preds = (probs >= 0.5).astype(float)

acc = (preds == target_labels).mean()
print("final loss:", BCE_loss(output_layer_linear_output, target_labels))
print("final accuracy:", acc)
print("probs:\n", probs.ravel())
print("preds:\n", preds.ravel())
print("labels:\n", target_labels.ravel())


BCE loss first forward pass: 0.2177540982813407
final loss: 0.2162249904805389
final accuracy: 1.0
probs:
 [0.76974516 0.5956664  0.1752454  0.13338143 0.76476179 0.02156242
 0.7811936  0.07437165]
preds:
 [1. 1. 0. 0. 1. 0. 1. 0.]
labels:
 [1. 1. 0. 0. 1. 0. 1. 0.]


As you can see, the model is learning and converging nicely and our predictions math the expected labels.