## Backpropagation task

Bronwyn Bowles-King

In [1]:
# Import libraries
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
# Define functions

# Sigmoid activation function and its derivative

def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))


# Tanh activation function and its derivative

def tanh(x):
    return np.tanh(x)


def tanh_derivative(x):
    return 1 - np.tanh(x)**2


In [3]:
np.random.seed(42)

# Generate synthetic data for classification
X = np.random.uniform(-2, 2, size=(1000, 1))  # Creates array X of 1000 random values in range [-2, 2].
y = (X >= 0).astype(int)  # Binary classification label added for X non-negative (class 1) or negative (class 0).

# Preview data
print(X[0:5])
print(y[0:5])

[[-0.50183952]
 [ 1.80285723]
 [ 0.92797577]
 [ 0.39463394]
 [-1.37592544]]
[[0]
 [1]
 [1]
 [1]
 [0]]


In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [5]:
# Initialise model parameters
input_size = 1
hidden_size = 4
output_size = 1

# Weights and biases
W1 = np.random.randn(input_size, hidden_size) * 0.01
W2 = np.random.randn(hidden_size, output_size) * 0.01

b1 = np.zeros((1, hidden_size))
b2 = np.zeros((1, output_size))

# Training parameters
num_iter = 1000
learning_rate = 0.01

In [6]:
# Training loop on 75% of the data

losses = []

for iter in range(num_iter):
    # Forward pass 
    Z1 = np.dot(X_train, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    y_pred = sigmoid(Z2)

    # Compute loss with binary cross-entropy (BCE)
    loss = -np.mean(y_train * np.log(y_pred + 1e-8) + (1 - y_train) * np.log(1 - y_pred + 1e-8))
    losses.append(loss)

    # Backward pass for BCE + sigmoid
    dZ2 = (y_pred - y_train) / len(y_train) 
    dW2 = np.dot(A1.T, dZ2)
    db2 = np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(Z1)
    dW1 = np.dot(X_train.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    # Update weights and biases
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

    if iter % 100 == 0:
        print(f'Iteration {iter}, Loss: {loss:.4f}')

Iteration 0, Loss: 0.6931
Iteration 100, Loss: 0.6931
Iteration 200, Loss: 0.6931
Iteration 300, Loss: 0.6931
Iteration 400, Loss: 0.6931
Iteration 500, Loss: 0.6931
Iteration 600, Loss: 0.6931
Iteration 700, Loss: 0.6931
Iteration 800, Loss: 0.6930
Iteration 900, Loss: 0.6930


In [7]:
# Make predictions on 25% test set

Z1_test = np.dot(X_test, W1) + b1
A1_test = sigmoid(Z1_test)
Z2_test = np.dot(A1_test, W2) + b2
y_pred_test = sigmoid(Z2_test)

y_pred_labels = (y_pred_test >= 0.5).astype(int)

# Calculate loss on test set
test_loss = -np.mean(y_test * np.log(y_pred_test + 1e-8) + (1 - y_test) * np.log(1 - y_pred_test + 1e-8))

# Calculate accuracy
accuracy = np.mean(y_pred_labels == y_test)

print(f'Test loss: {test_loss:.4f}')
print(f'Test accuracy: {accuracy * 100:.2f}%')

# Display the first 10 predictions and actual values
print("\nPredictions:", y_pred_labels.flatten()[:11])
print("Actual:", y_test.flatten()[:11])

Test loss: 0.6936
Test accuracy: 47.60%

Predictions: [1 1 1 1 1 1 1 1 1 1 1]
Actual: [0 1 0 0 1 1 0 1 0 1 1]


After running this baseline test, we can see the neural network has performed poorly in predicting class 0 (negative values) as well as in class 1 (positive values). It predicts all classes as 1 or positive values, meaning that it cannot really distinguish between 0 or 1 properly. The accuracy score (47.6%) thus remains around 50%. 

The model above applied the sigmoid function for the forward pass and its derivative for the backwards pass. The parameters were an input size of 1 unit or neuron, hidden layer of 4, and output size of 1. The weights (W1 and W2) are randomised (random.randn) small numbers (0.01) based on the size of the input, hidden and output layers. For the training aspect, the parameters are 1 000 epochs or training loops with a learning rate of 0.01.

We now test the impact of different values for the learning rate and number of iterations. So we will do a few tests below to see how changing one parameter at a time affects the outcome. All the other parameters will be kept the same as the baseline test above. 

**Test with different numbers of epochs**

The MLP training was run again using the same process as before. The results when doubling the number of training epochs to 2 000, as shown below, were the same as the baseline of 47.6% accuracy. For the test using half as many epochs as the baseline (500), the results were again the same (47.6%), but this is not shown below. This indicates that the number of training runs are not the main problem here, at least not on their own. 

In [8]:
# Initialise model parameters
input_size = 1
hidden_size = 4
output_size = 1

# Weights and biases
W1 = np.random.randn(input_size, hidden_size) * 0.01
W2 = np.random.randn(hidden_size, output_size) * 0.01

b1 = np.zeros((1, hidden_size))
b2 = np.zeros((1, output_size))

# Training parameters
num_iter = 2000  # Increase epochs x 2
learning_rate = 0.01

# Training loop on 75% of the data

losses = []

for iter in range(num_iter):
    # Forward pass 
    Z1 = np.dot(X_train, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    y_pred = sigmoid(Z2)

    # Compute loss with binary cross-entropy (BCE)
    loss = -np.mean(y_train * np.log(y_pred + 1e-8) + (1 - y_train) * np.log(1 - y_pred + 1e-8))
    losses.append(loss)

    # Backward pass for BCE + sigmoid
    dZ2 = (y_pred - y_train) / len(y_train) 
    dW2 = np.dot(A1.T, dZ2)
    db2 = np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(Z1)
    dW1 = np.dot(X_train.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    # Update weights and biases
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

    if iter % 100 == 0:
        print(f'Iteration {iter}, Loss: {loss:.4f}')

# Make predictions on the 25% test set

Z1_test = np.dot(X_test, W1) + b1
A1_test = sigmoid(Z1_test)
Z2_test = np.dot(A1_test, W2) + b2
y_pred_test = sigmoid(Z2_test)

y_pred_labels = (y_pred_test >= 0.5).astype(int)

# Calculate loss on test set
test_loss = -np.mean(y_test * np.log(y_pred_test + 1e-8) + (1 - y_test) * np.log(1 - y_pred_test + 1e-8))

# Calculate accuracy
accuracy = np.mean(y_pred_labels == y_test)

print(f'Test loss: {test_loss:.4f}')
print(f'Test accuracy: {accuracy * 100:.2f}%')

Iteration 0, Loss: 0.6931
Iteration 100, Loss: 0.6931
Iteration 200, Loss: 0.6931
Iteration 300, Loss: 0.6931
Iteration 400, Loss: 0.6930
Iteration 500, Loss: 0.6930
Iteration 600, Loss: 0.6930
Iteration 700, Loss: 0.6929
Iteration 800, Loss: 0.6929
Iteration 900, Loss: 0.6928
Iteration 1000, Loss: 0.6927
Iteration 1100, Loss: 0.6926
Iteration 1200, Loss: 0.6925
Iteration 1300, Loss: 0.6923
Iteration 1400, Loss: 0.6920
Iteration 1500, Loss: 0.6917
Iteration 1600, Loss: 0.6913
Iteration 1700, Loss: 0.6908
Iteration 1800, Loss: 0.6902
Iteration 1900, Loss: 0.6893
Test loss: 0.6888
Test accuracy: 61.20%


**Test with different learning rates**

We will now try a slower learning rate of 0.001 and a faster rate of 0.1. The training runs will return to 1 000. Instead of improving, the neura network continues to predict all classes as positive numbers with a slower learning rate of 0.001. However, a faster learning rate, as shown below, helps as the accuracy is now very high at 98.8%. 

The network thus benefits from a faster rate as it is able to find a good solution more quickly, but loses accuracy when training is too slow. This may be because the dataset is fairly simple. The learning rate of 0.1 is not necessarily the best one for this network when it comes to unseen data as other factors can affect the real performance of the network. 

In [9]:
input_size = 1
hidden_size = 4
output_size = 1

W1 = np.random.randn(input_size, hidden_size) * 0.01
W2 = np.random.randn(hidden_size, output_size) * 0.01

b1 = np.zeros((1, hidden_size))
b2 = np.zeros((1, output_size))

num_iter = 1000
learning_rate = 0.1  # Faster learning rate

losses = []

for iter in range(num_iter):
    Z1 = np.dot(X_train, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    y_pred = sigmoid(Z2)

    loss = -np.mean(y_train * np.log(y_pred + 1e-8) + (1 - y_train) * np.log(1 - y_pred + 1e-8))
    losses.append(loss)

    dZ2 = (y_pred - y_train) / len(y_train) 
    dW2 = np.dot(A1.T, dZ2)
    db2 = np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(Z1)
    dW1 = np.dot(X_train.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

    if iter % 100 == 0:
        print(f'Iteration {iter}, Loss: {loss:.4f}')

Z1_test = np.dot(X_test, W1) + b1
A1_test = sigmoid(Z1_test)
Z2_test = np.dot(A1_test, W2) + b2
y_pred_test = sigmoid(Z2_test)

y_pred_labels = (y_pred_test >= 0.5).astype(int)

test_loss = -np.mean(y_test * np.log(y_pred_test + 1e-8) + (1 - y_test) * np.log(1 - y_pred_test + 1e-8))

accuracy = np.mean(y_pred_labels == y_test)

print(f'Test loss: {test_loss:.4f}')
print(f'Test accuracy: {accuracy * 100:.2f}%')

Iteration 0, Loss: 0.6932
Iteration 100, Loss: 0.6931
Iteration 200, Loss: 0.6925
Iteration 300, Loss: 0.6864
Iteration 400, Loss: 0.6333
Iteration 500, Loss: 0.4581
Iteration 600, Loss: 0.2952
Iteration 700, Loss: 0.2075
Iteration 800, Loss: 0.1598
Iteration 900, Loss: 0.1311
Test loss: 0.1188
Test accuracy: 98.80%


**Test with different numbers of nodes**

The learning rate is returned to 0.01 as is the case for the baseline network. When different numbers of nodes are used for the hidden layer, whether more or fewer than 4, the network does not necessarily perform better, if at all. 

With twice as many (8) nodes, the score is still low at 47.6%, and the same as the previous baseline test, as shown below. Increasing the number of nodes even further does not improve the accuracy above around 50%. With half as many nodes as the baseline (2), the accuracy score is slightly better but not encouraging at 51.2% (not shown below). 

In [10]:
input_size = 1
hidden_size = 8  # Adjusted nodes
output_size = 1

W1 = np.random.randn(input_size, hidden_size) * 0.01
W2 = np.random.randn(hidden_size, output_size) * 0.01

b1 = np.zeros((1, hidden_size))
b2 = np.zeros((1, output_size))

num_iter = 1000
learning_rate = 0.01

losses = []

for iter in range(num_iter):
    Z1 = np.dot(X_train, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    y_pred = sigmoid(Z2)

    loss = -np.mean(y_train * np.log(y_pred + 1e-8) + (1 - y_train) * np.log(1 - y_pred + 1e-8))
    losses.append(loss)

    dZ2 = (y_pred - y_train) / len(y_train) 
    dW2 = np.dot(A1.T, dZ2)
    db2 = np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(Z1)
    dW1 = np.dot(X_train.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

    if iter % 100 == 0:
        print(f'Iteration {iter}, Loss: {loss:.4f}')

Z1_test = np.dot(X_test, W1) + b1
A1_test = sigmoid(Z1_test)
Z2_test = np.dot(A1_test, W2) + b2
y_pred_test = sigmoid(Z2_test)

y_pred_labels = (y_pred_test >= 0.5).astype(int)

test_loss = -np.mean(y_test * np.log(y_pred_test + 1e-8) + (1 - y_test) * np.log(1 - y_pred_test + 1e-8))

accuracy = np.mean(y_pred_labels == y_test)

print(f'Test loss: {test_loss:.4f}')
print(f'Test accuracy: {accuracy * 100:.2f}%')

Iteration 0, Loss: 0.6931
Iteration 100, Loss: 0.6931
Iteration 200, Loss: 0.6931
Iteration 300, Loss: 0.6931
Iteration 400, Loss: 0.6930
Iteration 500, Loss: 0.6930
Iteration 600, Loss: 0.6930
Iteration 700, Loss: 0.6929
Iteration 800, Loss: 0.6929
Iteration 900, Loss: 0.6928
Test loss: 0.6933
Test accuracy: 47.60%


**Change the activation function in the hidden layer** 

The hyperbolic tangent (tanh) function is an alternative to the sigmoid function. It is similar to the sigmoid, but gives output values between -1 and 1, and it has a simple derivative calculation (Geeks4Geeks, 2025; Khan, 2024). The tanh function is tested for the hidden layer, but the outer layer still uses the sigmoid function. All other parameters are returned to the baseline. 

The network performs surprisingly well, showing an accuracy of 98.8%, despite the slower learning rate that has so far not been successful (0.01). The network thus benefits from combining two different activation functions, sigmoid and tanh. Alternatively, a faster learning rate can be used for the network. In the last code chunk, I will combine both the faster learning rate and use tanh in the hidden layer. 

In [11]:
input_size = 1
hidden_size = 4
output_size = 1

W1 = np.random.randn(input_size, hidden_size) * 0.01
W2 = np.random.randn(hidden_size, output_size) * 0.01

b1 = np.zeros((1, hidden_size))
b2 = np.zeros((1, output_size))

num_iter = 1000
learning_rate = 0.01

losses = []

for iter in range(num_iter):
    Z1 = np.dot(X_train, W1) + b1
    A1 = tanh(Z1)  # tanh for hidden layers
    Z2 = np.dot(A1, W2) + b2
    y_pred = sigmoid(Z2)  # sigmoid for output

    # Clip predictions to ensure values are kept in a stable range
    y_pred = np.clip(y_pred, 1e-8, 1 - 1e-8)

    loss = -np.mean(y_train * np.log(y_pred) + (1 - y_train) * np.log(1 - y_pred))
    losses.append(loss)

    # Backward pass for BCE + tanh
    dZ2 = (y_pred - y_train) / len(y_train) 
    dW2 = np.dot(A1.T, dZ2)
    db2 = np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * tanh_derivative(A1)  
    dW1 = np.dot(X_train.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

    if iter % 100 == 0:
        print(f'Iteration {iter}, Loss: {loss:.4f}')

Z1_test = np.dot(X_test, W1) + b1
A1_test = tanh(Z1_test)
Z2_test = np.dot(A1_test, W2) + b2
y_pred_test = sigmoid(Z2_test)

y_pred_labels = (y_pred_test >= 0.5).astype(int)

test_loss = -np.mean(y_test * np.log(y_pred_test + 1e-8) + (1 - y_test) * np.log(1 - y_pred_test + 1e-8))

accuracy = np.mean(y_pred_labels == y_test)

print(f'Test loss: {test_loss:.4f}')
print(f'Test accuracy: {accuracy * 100:.2f}%')

Iteration 0, Loss: 0.6928
Iteration 100, Loss: 0.6923
Iteration 200, Loss: 0.6907
Iteration 300, Loss: 0.6866
Iteration 400, Loss: 0.6758
Iteration 500, Loss: 0.6501
Iteration 600, Loss: 0.5983
Iteration 700, Loss: 0.5191
Iteration 800, Loss: 0.4293
Iteration 900, Loss: 0.3490
Test loss: 0.2940
Test accuracy: 98.80%


**Faster learning rate and tanh activation function in the hidden layer** 

Combining the faster learning rate of 0.1 and applying tanh as the activation function in the hidden layer has now achieved 99.6% accuracy for the network. This is not a major improvement over using only one of these at a time, but still shows that the two parameters work well for this small network.

In [12]:
input_size = 1
hidden_size = 4
output_size = 1

W1 = np.random.randn(input_size, hidden_size) * 0.01
W2 = np.random.randn(hidden_size, output_size) * 0.01

b1 = np.zeros((1, hidden_size))
b2 = np.zeros((1, output_size))

num_iter = 1000
learning_rate = 0.1  # Faster learning rate

losses = []

for iter in range(num_iter):
    Z1 = np.dot(X_train, W1) + b1
    A1 = tanh(Z1)  # tanh for hidden layers
    Z2 = np.dot(A1, W2) + b2
    y_pred = sigmoid(Z2)  # sigmoid for output

    # Clip predictions to ensure values are kept in a stable range
    y_pred = np.clip(y_pred, 1e-8, 1 - 1e-8)

    loss = -np.mean(y_train * np.log(y_pred) + (1 - y_train) * np.log(1 - y_pred))
    losses.append(loss)

    # Backward pass for BCE + tanh
    dZ2 = (y_pred - y_train) / len(y_train) 
    dW2 = np.dot(A1.T, dZ2)
    db2 = np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * tanh_derivative(A1)  
    dW1 = np.dot(X_train.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

    if iter % 100 == 0:
        print(f'Iteration {iter}, Loss: {loss:.4f}')

Z1_test = np.dot(X_test, W1) + b1
A1_test = tanh(Z1_test)
Z2_test = np.dot(A1_test, W2) + b2
y_pred_test = sigmoid(Z2_test)

y_pred_labels = (y_pred_test >= 0.5).astype(int)

test_loss = -np.mean(y_test * np.log(y_pred_test + 1e-8) + (1 - y_test) * np.log(1 - y_pred_test + 1e-8))

accuracy = np.mean(y_pred_labels == y_test)

print(f'Test loss: {test_loss:.4f}')
print(f'Test accuracy: {accuracy * 100:.2f}%')

Iteration 0, Loss: 0.6931
Iteration 100, Loss: 0.3803
Iteration 200, Loss: 0.1064
Iteration 300, Loss: 0.0694
Iteration 400, Loss: 0.0548
Iteration 500, Loss: 0.0467
Iteration 600, Loss: 0.0415
Iteration 700, Loss: 0.0377
Iteration 800, Loss: 0.0349
Iteration 900, Loss: 0.0327
Test loss: 0.0352
Test accuracy: 99.60%


**References**

Geeks4Geeks. (2025). Tanh Activation in Neural Network. https://www.geeksforgeeks.org/deep-learning/tanh-activation-in-neural-network

HyperionDev. (2025). Learning Algorithms. Private repository, GitHub.

HyperionDev. (2025). Neural Networks. Private repository, GitHub.

Khan, M. A. (2024). The Heart of Neural Networks: Understanding Activation Functions. Medium. https://medium.com/@mohammedashfaqkhan000/the-heart-of-neural-networks-understanding-activation-functions-298b86cddd99