# HW2: Problem 2: Working out Backpropagation

Read Chapter 2 of Michael Nielsen's article/book from top to bottom:

* [http://neuralnetworksanddeeplearning.com/chap2.html](http://neuralnetworksanddeeplearning.com/chap2.html)

He outlines a few exersizes in that article which you must complete. Do the following a, b, c:

a. He invites you to write out a proof of equation BP3

b. He invites you to write out a proof of equation BP4

c. He proposes that you code a fully matrix-based approach to backpropagation over a mini-batch. Implement this with explanation where you change the notation so that instead of having a bias term, you assume that the input variables are augmented with a "column" of "1"s, and that the weights $w_0$.

Your submission should be a single jupyter notebook. Use markdown cells with latex for equations of a jupyter notebook for each proof for "a." and "b.". Make sure you include text that explains your steps. Next for "c" this is an implementation problem. You need to understand and modify the code the Michael Nielsen provided so that instead it is using a matrixed based approach. Again don't keep the biases separate. After reading data in (use the iris data set), create a new column corresponding to $x_0=1$, and as mentioned above and discussed in class (see notes) is that the bias term can then be considered a weight $w_0$. Again use markdown cells around your code and comments to explain your work. Test the code on the iris data set with 4 node input (5 with a constant 1), three hidden nodes, and three output nodes, one for each species/class.

## a. Proof of Michael Nielsons equation BP3

## Proof of BP3

To prove equation BP3, we start with the definition of the error $\delta^l$ for layer $l$:

\[
\delta^l = \nabla_a C \odot \sigma'(z^l)
\]

where $\nabla_a C$ is the gradient of the cost function with respect to the activations, and $\sigma'(z^l)$ is the derivative of the activation function.

Recall that the activation $a^l$ is given by:

\[
a^l = \sigma(z^l)
\]

Therefore, the derivative of $a^l$ with respect to $z^l$ is:

\[
\frac{\partial a^l}{\partial z^l} = \sigma'(z^l)
\]

Substituting the activation function derivative into the chain rule, we get:

\[
\frac{\partial C}{\partial z^l} = \frac{\partial C}{\partial a^l} \odot \sigma'(z^l)
\]

By definition of $\delta^l$:

\[
\delta^l = \frac{\partial C}{\partial z^l} = \frac{\partial C}{\partial a^l} \odot \sigma'(z^l)
\]

For the backpropagation step from layer $l+1$ to $l$, we have:

\[
\frac{\partial C}{\partial a^l} = (w^{l+1})^T \delta^{l+1}
\]

Substituting this into the definition of $\delta^l$, we get:

\[
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)
\]

Thus, equation BP3 is proven.


## b. Proof of Michael Nielsons equation BP4

## Proof of BP4

To prove equation BP4, we start with the definition of the partial derivative of the cost with respect to the weights:

\[
\frac{\partial C}{\partial w_{jk}^l} = a_k^{l-1} \delta_j^l
\]

where $a_k^{l-1}$ is the activation from the previous layer, and $\delta_j^l$ is the error term for the current layer.

Recall that $\delta_j^l$ represents the error in the activations $a^l$ at neuron $j$ in layer $l$. We need to show that:

\[
\frac{\partial C}{\partial w_{jk}^l} = a_k^{l-1} \delta_j^l
\]

Proof:
1. From the chain rule, we have:

\[
\frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l} \frac{\partial z_j^l}{\partial w_{jk}^l}
\]

2. Recall that $z_j^l = \sum_k w_{jk}^l a_k^{l-1} + b_j^l$, thus:

\[
\frac{\partial z_j^l}{\partial w_{jk}^l} = a_k^{l-1}
\]

3. Substituting into the chain rule, we get:

\[
\frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l} a_k^{l-1}
\]

4. By definition of $\delta_j^l$:

\[
\delta_j^l = \frac{\partial C}{\partial z_j^l}
\]

5. Substituting $\delta_j^l$ into the partial derivative expression, we get:

\[
\frac{\partial C}{\partial w_{jk}^l} = a_k^{l-1} \delta_j^l
\]

Thus, equation BP4 is proven.

## c. Using both markdown cells and code cells implement that you code a fully matrix-based approach to backpropagation over a mini-batch. Implement this with explanation where you change the notation so that instead of having a bias term, you assume that the input variables are augmented with a "column" of "1"s, and that the weights $w_0$.

In [None]:
# Code cell for part c.

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# One-hot encode the labels
encoder = OneHotEncoder(sparse_output=False)  # Use sparse_output instead of sparse
y = encoder.fit_transform(y.reshape(-1, 1))

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Add bias term (column of ones)
X = np.hstack([X, np.ones((X.shape[0], 1))])

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights with small random values
        self.w1 = np.random.randn(input_size, hidden_size) * 0.01
        self.w2 = np.random.randn(hidden_size, output_size) * 0.01

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def sigmoid_prime(self, z):
        return self.sigmoid(z) * (1 - self.sigmoid(z))

    def forward(self, X):
        self.z1 = np.dot(X, self.w1)
        self.a1 = self.sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.w2)
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y, output):
        m = X.shape[0]
        
        # Calculate the output error
        self.output_error = output - y
        self.output_delta = self.output_error * self.sigmoid_prime(self.z2)
        
        # Calculate the hidden layer error
        self.z1_error = np.dot(self.output_delta, self.w2.T)
        self.z1_delta = self.z1_error * self.sigmoid_prime(self.z1)
        
        # Update weights
        self.w2 -= np.dot(self.a1.T, self.output_delta) / m
        self.w1 -= np.dot(X.T, self.z1_delta) / m

    def train(self, X, y, epochs=1000, learning_rate=0.1):
        for epoch in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            
    def predict(self, X):
        output = self.forward(X)
        return np.argmax(output, axis=1)

In [5]:
# Initialize and train the neural network
nn = NeuralNetwork(input_size=5, hidden_size=3, output_size=3)
nn.train(X_train, y_train)

In [6]:
# Predict on the test set
predictions = nn.predict(X_test)

In [7]:
# Calculate accuracy
accuracy = np.mean(predictions == np.argmax(y_test, axis=1))
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 96.67%
