<a href="https://colab.research.google.com/github/SafalThapa17/Applied-Machine-Learning/blob/main/HW2_Problem_6_Safal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name: Safal Thapa

Empid (last 5 digit): 30789

HW2-Problem 6: Backpropagation Proofs and Implementation (Nielsen)

Part a: Proof of Equation BP3

Write out a complete proof of equation BP3 from Nielsen's chapter (the backpropagation equation for the output layer).



<br/>
<center> Answer <center/>

An equation for the rate of change of the cost with respect to any bias in the network:

$$
\frac{\partial C}{\partial b^l_{j}} = \delta^l_{j}
$$

For Proof:
We consider a neuron j in layer l.
We have Weighted input:
$$
z_j^{\,l} = \sum_k w_{jk}^{\,l}\, a_k^{\,l-1} + b_j^{\,l}
$$

Activation: $
a_j^{\,l} = \sigma\!\left(z_j^{\,l}\right)
$

Cost function: C

The bias $b^l_{j}$ affects the cost only through $z^l_{j}$  so we will apply the chain rule:

$$
\frac{\partial C}{\partial b_j^{\,l}}
=
\frac{\partial C}{\partial z_j^{\,l}}
\frac{\partial z_j^{\,l}}{\partial b_j^{\,l}}
\tag{a}
$$    (a)


From the definition, we have
$
z_j^{\,l} = \sum_k w_{jk}^{\,l}\, a_k^{\,l-1} + b_j^{\,l}
$.

Taking the derivative with respect to $b_j^{\,l}$. The sum $b_j^{\,l} = \sum_k w_{jk}^{\,l}\, a_k^{\,l-1}$ does not involve the bias $b_j^{\,l}$ so the derivative is 0. The bias term $b_j^{\,l}$ is added directly and its derivative is 1. Hence,
$$
\frac{\partial z_j^{\,l}}{\partial b_j^{\,l}} = 1
\tag{b}$$     

Recalling the definition, the error for neuron j in layer l is

$$
\delta_j^{\,l} = \frac{\partial C}{\partial z_j^{\,l}}
\tag{c}$$

Substituting (b) and (c) into equation (a), we get
$$
\begin{aligned}
\frac{\partial C}{\partial b_j^{\,l}}
&= \frac{\partial C}{\partial z_j^{\,l}} \frac{\partial z_j^{\,l}}{\partial b_j^{\,l}} \\
&= \frac{\partial C}{\partial z_j^{\,l}} \cdot 1 \\
&= \delta_j^{\,l}
\end{aligned}
$$
which is BP3.

Hence, proved.

**Part b: Proof of Equation BP4**

Write out a complete proof of equation BP4 from Nielsen's chapter (the backpropagation equation for hidden layers).

<br/>
<center> Answer <center/>

An equation for the rate of change of the cost with respect to any weight in the network:

$$
\frac{\partial C}{\partial w_{jk}^{\,l}} = a_k^{\,l-1} \, \delta_j^{\,l}
$$

Proof:
The bias $w^l_{jk}$ affects the cost only through $z^l_{j}$. We apply the chain rule

$$
\frac{\partial C}{\partial w_{jk}^{\,l}}
=
\frac{\partial C}{\partial z_j^{\,l}}
\cdot
\frac{\partial z_j^{\,l}}{\partial w_{jk}^{\,l}}
\tag{d}
$$

Taking partial derivative with respect to $w^l_{jk}$. We get
$$
\frac{\partial z_{j}^{\,l}}{\partial w_{jk}^{\,l}} = a_k^{\,l-1}
\tag{e}
$$

By back propagation,
$$
\delta_j^{\,l} = \frac{\partial C}{\partial z_j^{\,l}}
\tag{f}
$$

Substituting (e) and (f) into (d), we get
$$
\frac{\partial C}{\partial w_{jk}^{\,l}}
=
a_k^{\,l-1}\cdot
\delta_j^{\,l}
$$

Which is BP4. Hence, proved.




**Part c: Matrix-Based Backpropagation Implementation**

Implement a fully matrix-based backpropagation algorithm over a mini-batch: - Augment input variables with a "column" of 1s (instead of a separate bias term) - Treat bias as weight w_0 - Use Nielsen's code as a starting point, but rewrite it to use matrix notation (no separate bias) - Test on the Iris dataset with: - 4 input features (5 with the constant column) - 3 hidden layer nodes - 3 output nodes (one per class)

In [8]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder

class Network(object):
    def __init__(self, sizes):
        self.num_layers = len(sizes)
        # add a column to each weight matrix
        self.weights = [np.random.randn(y, x + 1)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for w in self.weights:
            a_aug = np.vstack([np.ones((1, a.shape[1])), a])
            a = sigmoid(np.dot(w, a_aug))
        return a

    def update_mini_batch(self, mini_batch_x, mini_batch_y, eta):
        m = mini_batch_x.shape[1]

        # Call backprop once for the entire matrix
        nabla_w = self.backprop(mini_batch_x, mini_batch_y)

        # Update weights using the gradients
        self.weights = [w - (eta / m) * nw
                        for w, nw in zip(self.weights, nabla_w)]

    def backprop(self, x, y):
        """Matrix-based version of backpropagation."""
        nabla_w = [np.zeros(w.shape) for w in self.weights]

        # --- Feedforward ---
        activation = x
        activations = [x]
        zs = []

        for w in self.weights:
            # Augment input with a row of 1s
            a_aug = np.vstack([np.ones((1, activation.shape[1])), activation])
            z = np.dot(w, a_aug)
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)

        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])

        # Gradient for the last layer (using augmented activations from layer L-1)
        a_prev_aug = np.vstack([np.ones((1, activations[-2].shape[1])), activations[-2]])
        nabla_w[-1] = np.dot(delta, a_prev_aug.transpose())

        # Backprop through hidden layers
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)

            # slice out the w0 (bias) weights when
            # backpropagating error to previous layers
            delta = np.dot(self.weights[-l+1][:, 1:].transpose(), delta) * sp

            # Gradient for layer -l
            a_prev_aug = np.vstack([np.ones((1, activations[-l-1].shape[1])), activations[-l-1]])
            nabla_w[-l] = np.dot(delta, a_prev_aug.transpose())

        return nabla_w

    def cost_derivative(self, output_activations, y):
        return (output_activations - y)

def sigmoid(z):
    return 1.0/(1.0 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z) * (1 - sigmoid(z))


# Load and prepare data
iris = load_iris()
X = iris.data.T
y_labels = iris.target.reshape(-1, 1)

encoder = OneHotEncoder(sparse_output=False)
Y = encoder.fit_transform(y_labels).T

# 4 inputs, 3 hidden, 3 outputs
net = Network([4, 3, 3])

# Training
print("Training Model...")
for epoch in range(1001):
    net.update_mini_batch(X, Y, eta=0.3)
    if epoch % 200 == 0:
        outputs = net.feedforward(X)
        acc = np.mean(np.argmax(outputs, axis=0) == np.argmax(Y, axis=0))
        print(f"Epoch {epoch}: Accuracy {acc*100:.2f}%")

Training Model...
Epoch 0: Accuracy 33.33%
Epoch 200: Accuracy 74.67%
Epoch 400: Accuracy 96.00%
Epoch 600: Accuracy 98.00%
Epoch 800: Accuracy 98.00%
Epoch 1000: Accuracy 97.33%
