## **Proof of BP4**

BP3 relates the partial derivative of the cost function with respect to the biases in the network to the error term of the corresponding neuron. Specifically, it establishes that this derivative is equal to the error term itself:

$$\frac{\partial C}{\partial b^l_j} = \delta^l_j$$


This equation simplifies how we can compute the gradient of the cost function concerning the biases, leveraging the computed error terms from the backpropagation algorithm.

### **Derivation**

To derive BP3, we start with the definition of the error term $\delta_j^l$ for a neuron in layer $l$:

$$\delta^l_j = \frac{\partial C}{\partial z^l_j}$$

where $z_j^l$ is the weighted input to the neuron. Given that $z_j^l$ is directly affected by the bias $b_j^l$ (since $z_j^l = \sum_k w_{jk}^l a_k^{l-1} + b_j^l$), the derivative of $z_j^l$ with respect to $b_j^l$ is 1:

$$\frac{\partial z^l_j}{\partial b^l_j} = 1$$

Applying the chain rule, we find that:

$$\frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j \cdot 1 = \delta^l_j$$

This completes the proof that the gradient of the cost function concerning any bias in the network is equal to the error term for the corresponding neuron.

## **Proof of BP4**

BP4 explains how to compute the gradient of the cost function concerning any weight in the network. It shows that this gradient can be expressed as the product of the error term of the neuron receiving the weight and the activation of the neuron providing the input:

$$\frac{\partial C}{\partial w_{jk}^l} = a_k^{l-1} \delta_j^l$$

This relationship is crucial for updating the weights in the network during the learning process.

### **Derivation**

Consider the effect of a weight $w_{jk}^l$ on the cost function $C$. This weight influences $C$ through its effect on the weighted input $z_j^l$, which then affects the activation $a_j^l$, and subsequently the overall cost $C$:

$$z_j^l = \sum_k w_{jk}^l a_k^{l-1} + b_j^l$$

Given the activation function $\sigma$, the activation $a_j^l$ is $\sigma(z_j^l)$. The derivative of $z_j^l$ with respect to $w_{jk}^l$ isolates the activation $a_k^{l-1}$ from the previous layer:

$$\frac{\partial z_j^l}{\partial w_{jk}^l} = a_k^{l-1}$$


Using the definition of the error term $\delta_j^l = \frac{\partial C}{\partial z_j^l}$, and applying the chain rule, we can express the partial derivative of $C$ with respect to $w_{jk}^l$ as:

$$\frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l} \cdot \frac{\partial z_j^l}{\partial w_{jk}^l} = \delta_j^l \cdot a_k^{l-1}$$


This establishes BP4, showing that the rate of change of the cost function concerning a weight is determined by the product of the error term for the neuron receiving that weight and the activation of the neuron providing the input.

In [None]:
# Matrix-based approach to backpropagation over a mini-batch

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Activation function and its derivative
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z) * (1 - sigmoid(z))

# Loading and preprocessing the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# One-hot encode the targets
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y.reshape(-1, 1))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_onehot, test_size=0.2, random_state=42)

# Augment the input data with a bias term
X_train_augmented = np.hstack((np.ones((X_train.shape[0], 1)), X_train))
X_test_augmented = np.hstack((np.ones((X_test.shape[0], 1)), X_test))

class MatrixBasedNN:
    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for w in self.weights:
            a = sigmoid(np.dot(a, w.T))
        return a

    def backprop(self, x, y):
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        activation = x
        activations = [x]
        zs = []
        for w in self.weights:
            z = np.dot(activation, w.T)
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)

        # Backward pass
        delta = (activations[-1] - y) * sigmoid_prime(zs[-1])
        nabla_w[-1] = np.dot(delta.T, activations[-2])

        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(delta, self.weights[-l+1]) * sp
            nabla_w[-l] = np.dot(delta.T, activations[-l-1])

        return nabla_w

    def update_mini_batch(self, mini_batch, eta):
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_w = self.backprop(x, y)
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)]

    def train(self, X, y, epochs, mini_batch_size, eta):
        n = len(X)
        for j in range(epochs):
            permutation = np.random.permutation(n)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]
            mini_batches = [(X_shuffled[k:k+mini_batch_size], y_shuffled[k:k+mini_batch_size])
                            for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)

# Initialize and train the network
net = MatrixBasedNN([5, 3, 3])  # Including the bias input
net.train(X_train_augmented, y_train, epochs=1000, mini_batch_size=10, eta=1.0)

# Example: Evaluate the network
predictions = net.feedforward(X_test_augmented)
predicted_classes = np.argmax(predictions, axis=1)
true_classes = np.argmax(y_test, axis=1)
accuracy = np.mean(predicted_classes == true_classes)
print(f"Test Accuracy: {accuracy * 100:.2f}%")
