1. **Proof for Equations (BP3)**

Equation to prove: ∂C/∂blj = δlj



**Proof:**

In backpropagation, the error term δlj at the jth neuron in layer l is defined as:

In [None]:
δLj = ∂C / ∂zlj σ′(zlj)


∂C / ∂zlj represents the rate of change of the cost with respect to the weighted input zlj.

σ′(zlj) is the derivative of the activation function with respect to zlj .

Take the partial derivative of the cost function with respect to the bias term blj in layer l:


In [None]:
∂C/∂blj = ∂C/ ∂zlj * ∂zlj/∂blj

Second term, ∂zlj /∂blj, represents the derivative of the weighted input zlj with respect to the bias blj.

Since zlj is a linear combination of weights and biases, derivative is 1.


Substitute ∂C/∂zlj for ∂C/∂blj :

In [None]:
∂C/∂blj = ∂C/∂zlj


Since the partial derivative of the cost with respect to the bias term is the same as the error term at the corresponding neuron:

In [None]:
∂C / ∂blj = δlj

2. **Proof for Equations (BP4)**

Equation to prove: ∂C / ∂wljk = al-1k δlj

Proof:

In backpropagation, the error term δlj at the jth neuron in layer l is defined as:

In [None]:
δLj = ∂C / ∂zlj σ′(zlj)

∂C / ∂zlj represents the rate of change of the cost with respect to the weighted input zlj.

σ′(zlj) is the derivative of the activation function with respect to zlj .

We can express zlj in Terms of Weights and Biases

In [None]:
zlj = ∑k wljk al−1k + blj


This expression represents the weighted sum of inputs from the neurons in layer l−1, plus the bias.

Take the partial derivative of the cost with respect to the weight wljk :

In [None]:
∂C / ∂wljk  = ∂ / ∂wljk (1/2 (alj − yj )^2)

Using the chain rule, this expression can be expanded as:

In [None]:
∂C / ∂wljk  =(alj −yj ) ∂alj / ∂wljk

The term ∂alj / ∂wljk   involves the input to the activation function at the jth neuron in layer l:


In [None]:
∂alj / ∂wljk   =al−1k

Substituting back into the original expression

In [None]:
∂C / ∂wljk = (alj −yj ) al−1k

alj − yj is often represented as δlj , the error term at the jth neuron in layer l:

In [None]:
∂C / ∂wljk  = δlj al−1k


3. The fully matrix-based approach to backpropagation across a mini-batch involves conducting operations on entire matrices of data rather than iterating through individual training examples as in the loop-based method. This method utilizes vectorized operations to process the entire mini-batch simultaneously, enhancing efficiency and speed.

Load the Iris Dataset: load the iris dataset and add a constant column x0 = 1 to input features.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [10]:
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

In [11]:
# Add a constant column to X
X = np.column_stack((np.ones(X.shape[0]), X))

In [12]:
# Convert labels to one-hot encoding
y_one_hot = np.eye(3)[y]


In [13]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_one_hot, test_size=0.2, random_state=42)

Modify Neural Network Class


The update_mini_batch method is where the backpropagation algorithm is implemented.

In the matrix-based approach, the input mini-batch is transposed to facilitate matrix multiplication. Assume that the input variables are augmented with a "column" of "1"s, and the weights w0.

In [14]:
import numpy as np

class Network(object):
     def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]

     def update_mini_batch_matrix(self, mini_batch, eta):
        #Update the network's weights using matrix-based backpropagation for a single mini batch.
        mini_batch_size = len(mini_batch)

        # Extract features (inputs) and labels (outputs) from the mini-batch
        X = np.array([x for x, y in mini_batch]).T
        Y = np.array([y for x, y in mini_batch]).T

        # Forward propagation
        Z = np.dot(self.weights, X)
        A = sigmoid(Z)

        # Backward propagation
        delta = self.cost_derivative(A, Y) * sigmoid_prime(Z)
        nabla_w = np.dot(delta, X.T)

        # Update weights
        self.weights -= (eta / mini_batch_size) * nabla_w

     def backprop(self, x, y):
        #Return a tuple "(nabla_b, nabla_w)" representing the gradient for the cost function C_x.  "nabla_b" and
        #"nabla_w" are layer-by-layer lists of numpy arrays, similar to "self.biases" and "self.weights"."""
        #nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights]

        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())

        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

     def cost_derivative(self, output_activations, y):
        #Return the vector of partial derivatives
        return (output_activations - y)

      # Define sigmoid and sigmoid_prime functions
     def sigmoid(z):
        return 1.0 / (1.0 + np.exp(-z))

     def sigmoid_prime(z):
        return sigmoid(z) * (1 - sigmoid(z))

training_data = [(np.random.randn(3, 1), np.random.randn(2, 1)) for _ in range(100)]

The function update_mini_batch_matrix receives a mini-batch and a learning rate (eta) as parameters. It proceeds by extracting features (inputs) X and labels (outputs) Y from the mini-batch. Then, it conducts forward propagation using matrix multiplication. Afterward, it calculates the derivative of the cost function concerning the activations (delta) and the gradient of the weights (nabla_w) utilizing matrix operations. Finally, it updates the weights using the computed gradient. This method enhances efficiency by leveraging matrix operations, thereby enhancing the speed of the backpropagation algorithm.