# Multiple Layer Training: Backpropagation 

## Complete Algorithm

### Forward Pass
```
a⁰ = x
for l = 1 to L:
    zˡ = Wˡ · aˡ⁻¹ + bˡ
    aˡ = σ(zˡ)
```

### Backward Pass
```
# 1. Compute output error
δᴸ = (aᴸ - y) ⊙ aᴸ ⊙ (1 - aᴸ)

# 2. Backpropagate through hidden layers
for l = L-1 down to 1:
    δˡ = (Wˡ⁺¹)ᵀ · δˡ⁺¹ ⊙ aˡ ⊙ (1 - aˡ)

# 3. Compute gradients
for l = 1 to L:
    ∂C/∂Wˡ = δˡ · (aˡ⁻¹)ᵀ
    ∂C/∂bˡ = δˡ
```

### Parameter Update (Gradient Descent)
```
for l = 1 to L:
    Wˡ = Wˡ - η · ∂C/∂Wˡ
    bˡ = bˡ - η · ∂C/∂bˡ
```
Where $\eta$ is the learning rate.

### ====================================
### Read from here for your own knowledge; ***you can skip***.
### ====================================
## Batch Processing (Code Enhancement)

Improves the basic algorithm by processing multiple examples simultaneously:

- **Input**: $X$ with shape $(input\_size \times batch\_size)$
- **Gradients** are averaged over the batch:
  $$\frac{\partial C}{\partial W^l} = \frac{1}{m} \sum_{i=1}^m \delta^l_{(i)} (a^{l-1}_{(i)})^T$$
  $$\frac{\partial C}{\partial b^l} = \frac{1}{m} \sum_{i=1}^m \delta^l_{(i)}$$

## Why This Works: The Chain Rule Insight

The fundamental mathematics is the chain rule:

$$\frac{\partial C}{\partial W^l} = \frac{\partial C}{\partial a^L} \cdot \frac{\partial a^L}{\partial a^{L-1}} \cdot \ldots \cdot \frac{\partial a^{l+1}}{\partial a^l} \cdot \frac{\partial a^l}{\partial z^l} \cdot \frac{\partial z^l}{\partial W^l}$$

Instead of computing this massive chain directly, backpropagation computes it recursively:

$$\frac{\partial C}{\partial z^l} = \underbrace{\frac{\partial C}{\partial z^{l+1}}}_{\delta^{l+1}} \cdot \underbrace{\frac{\partial z^{l+1}}{\partial a^l}}_{W^{l+1}} \cdot \underbrace{\frac{\partial a^l}{\partial z^l}}_{\sigma'(z^l)}$$

## Key Mathematical Insights

1. **Local Computation**: Each layer only needs information from adjacent layers
2. **Dynamic Programming**: We reuse computed $\delta^{l+1}$ to compute $\delta^l$
3. **Efficiency**: Storing $z^l$ and $a^l$ during forward pass avoids recomputation
4. **Sigmoid Efficiency**: $\sigma'(z^l) = a^l(1-a^l)$ uses pre-computed activations

## The Essence

**Backpropagation calculates how much each weight contributes to the final error by:**
1. **Forward pass**: Compute all activations
2. **Backward pass**: Propagate errors from output back to input
3. **Update**: Nudge weights in the direction that reduces error

The mathematical elegance is that we can compute exact gradients for millions of parameters using only local information and efficient recursive computation!

## The Code:

In [86]:
import numpy as np


# Sigmoid activation and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)


# Mean Squared Error (MSE) Loss
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)


# Deep Neural Network Class
class DeepNeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.1):
        """
        Initialize a deep neural network.

        Args:
            layer_sizes: List of integers representing the number of neurons in each layer.
                         Example: [2, 4, 3, 1] for input=2, hidden1=4, hidden2=3, output=1
            learning_rate: Learning rate for gradient descent
        """
        self.layer_sizes = layer_sizes
        self.lr = learning_rate
        self.L = len(layer_sizes) - 1  # Number of weight layers

        # Initialize weights and biases
        self.weights = []
        self.biases = []

        for i in range(self.L):
            """'He' aka 'Kaiming' initialization for better training of deep networks;
            (you can use random initialization and still have it work normally btw)."""
            w = np.random.randn(layer_sizes[i + 1], layer_sizes[i]) * np.sqrt(
                2.0 / layer_sizes[i]
            )
            b = np.zeros((layer_sizes[i + 1], 1))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass through all layers"""
        self.activations = [X]  # Store all activations, starting with input
        self.z_values = []  # Store all z values (before activation)

        current_activation = X

        for i in range(self.L):
            z = self.weights[i] @ current_activation + self.biases[i]
            current_activation = sigmoid(z)

            self.z_values.append(z)
            self.activations.append(current_activation)

        return current_activation

    def backward(self, X, Y):
        m = X.shape[1]
        dW = [np.zeros_like(w) for w in self.weights]
        db = [np.zeros_like(b) for b in self.biases]

        # Output layer error - using activations for derivative
        dZ = (self.activations[-1] - Y) * self.activations[-1] * (1 - self.activations[-1])
        dW[-1] = (dZ @ self.activations[-2].T) / m
        db[-1] = np.sum(dZ, axis=1, keepdims=True) / m

        # Backpropagate through hidden layers
        for l in range(self.L - 2, -1, -1):
            dA = self.weights[l + 1].T @ dZ
            dZ = dA * self.activations[l + 1] * (1 - self.activations[l + 1])
            dW[l] = (dZ @ self.activations[l].T) / m
            db[l] = np.sum(dZ, axis=1, keepdims=True) / m

        # Update parameters (same as before)
        for l in range(self.L):
            self.weights[l] -= self.lr * dW[l]
            self.biases[l] -= self.lr * db[l]

    def train(self, X, Y, epochs=10000, print_every=1000):
        """Train the network"""
        for epoch in range(1, epochs + 1):
            output = self.forward(X)
            loss = mse_loss(Y, output)
            self.backward(X, Y)

            if epoch % print_every == 0:
                print(f"Epoch {epoch} - Loss: {loss:.6f}")

    def predict(self, X):
        """Make predictions"""
        output = self.forward(X)
        return (output > 0.5).astype(int)

    def get_architecture(self):
        """Return the network architecture"""
        return f"Network Architecture: {self.layer_sizes}"

    def print_parameters(self):
        """Print weights and biases in a detailed format"""
        print("\n" + "=" * 50)
        print("DETAILED NETWORK PARAMETERS")
        print("=" * 50)

        for i in range(self.L):
            print(f"\nLayer {i+1}:")
            print("-" * 20)
            print(f"Shape: {self.layer_sizes[i]} → {self.layer_sizes[i+1]}")

            print(f"\nWeights W{i+1}:")
            print(self.weights[i])

            print(f"\nBiases b{i+1}:")
            print(self.biases[i])

### Example 1: XNOR (Negated XOR). [2 different architectures]

In [None]:
if __name__ == "__main__":
    # XNOR dataset
    X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])  # 2 inputs
    Y = np.array([[0, 1, 1, 0]])  # 1 output

    # Test different architectures
    architectures = [
        [2, 2, 1],  # Original: 1 hidden layer
        [2, 4, 3, 1],  # 2 hidden layers
    ]

    for i, arch in enumerate(architectures):
        print(f"\n{'='*50}")
        print(f"Testing Architecture {i+1}: {arch}")
        print(f"{'='*50}")

        dnn = DeepNeuralNetwork(layer_sizes=arch, learning_rate=1.0)
        print(dnn.get_architecture())

        dnn.train(X, Y, epochs=10000, print_every=2500)

        print("\nPredictions:")
        preds = dnn.predict(X)
        print(preds)
        print("Expected:")
        print(Y)

        # Calculate accuracy
        accuracy = np.mean(preds == Y)
        print(f"Accuracy: {accuracy*100:.1f}%")

        # Print all parameters in detailed format
        dnn.print_parameters() 


Testing Architecture 1: [2, 2, 1]
Network Architecture: [2, 2, 1]
Epoch 2500 - Loss: 0.014748
Epoch 5000 - Loss: 0.002191
Epoch 7500 - Loss: 0.001114
Epoch 10000 - Loss: 0.000737

Predictions:
[[0 1 1 0]]
Expected:
[[0 1 1 0]]
Accuracy: 100.0%

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 2

Weights W1:
[[-4.5100367  -4.50108298]
 [-6.16458819 -6.11866504]]

Biases b1:
[[6.69228454]
 [2.45115722]]

Layer 2:
--------------------
Shape: 2 → 1

Weights W2:
[[ 8.96569965 -9.17934788]]

Biases b2:
[[-4.20628334]]

Testing Architecture 2: [2, 4, 3, 1]
Network Architecture: [2, 4, 3, 1]
Epoch 2500 - Loss: 0.066375
Epoch 5000 - Loss: 0.000992
Epoch 7500 - Loss: 0.000414
Epoch 10000 - Loss: 0.000253

Predictions:
[[0 1 1 0]]
Expected:
[[0 1 1 0]]
Accuracy: 100.0%

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 4

Weights W1:
[[-4.22754832  0.36076758]
 [ 5.29270515  5.52160067]
 [-2.27174012  4.54026932]
 [-2.38659954 -2.50347866]]

Biases b1:
[[

### Example 2: Approximate a real valued function. [sin(pi*x1)*x2]

In [103]:
# y = sin(pi*x1)*x2
samples = 8  # small for readability
X = np.random.rand(2, samples)
Y = np.sin(np.pi * X[0]) * X[1]
Y = Y.reshape(1, -1)

nn = DeepNeuralNetwork([2, 16, 8, 4, 1], learning_rate=1.0)
nn.train(X, Y, epochs=60000, print_every=10000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions (raw sigmoid outputs):")
print(nn.forward(X))
nn.print_parameters()

Epoch 10000 - Loss: 0.004201
Epoch 20000 - Loss: 0.002874
Epoch 30000 - Loss: 0.001800
Epoch 40000 - Loss: 0.000131
Epoch 50000 - Loss: 0.000031
Epoch 60000 - Loss: 0.000024

Input Data (X):
[[0.38062329 0.87797432 0.86805669 0.8059254  0.79003044 0.30467914
  0.08091928 0.40298018]
 [0.17352451 0.69495109 0.34609973 0.9756102  0.64097208 0.82248056
  0.13252467 0.86201448]]

Expected Output (Y):
[[0.16146379 0.25993535 0.13938939 0.55865808 0.39280788 0.67243518
  0.03332812 0.82228248]]

Predictions (raw sigmoid outputs):
[[0.16134229 0.25117566 0.14795882 0.5631561  0.39165357 0.67406507
  0.03184265 0.81842488]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 16

Weights W1:
[[-0.77060362  0.72514064]
 [ 3.62699595  0.97586552]
 [ 1.25492842 -0.55942382]
 [-2.35826629  0.02750124]
 [ 0.18589521 -0.50173288]
 [-1.11769793  0.39494011]
 [-0.72614234  0.50262718]
 [ 1.4335219  -1.18083421]
 [-1.65277038 -0.26017755]
 [-1.67840034  0.39611865]
 [ 4.72445078  0.92

### Example 3: Circular Boundary.

In [89]:
# np.random.seed(0)
n = 8  # keep small for display
X = np.random.rand(2, n)
Y = np.array([(x1**2 + x2**2 > 0.5).astype(int) for x1, x2 in X.T]).reshape(1, -1)

nn = DeepNeuralNetwork([2, 8, 4, 1], learning_rate=1.0)
nn.train(X, Y, epochs=60000, print_every=10000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()

Epoch 10000 - Loss: 0.000977
Epoch 20000 - Loss: 0.000270
Epoch 30000 - Loss: 0.000146
Epoch 40000 - Loss: 0.000098
Epoch 50000 - Loss: 0.000073
Epoch 60000 - Loss: 0.000058

Input Data (X):
[[0.62289048 0.08534746 0.05168172 0.53135463 0.54063512 0.6374299
  0.72609133 0.97585208]
 [0.51630035 0.32295647 0.79518619 0.27083225 0.43897142 0.07845638
  0.02535074 0.96264841]]

Expected Output (Y):
[[1 0 1 0 0 0 1 1]]

Predictions:
[[1 0 1 0 0 0 1 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 8

Weights W1:
[[ 1.0156926   0.85903158]
 [ 5.91187352 -7.51470929]
 [-5.85505464 -0.63357222]
 [-0.3232339   3.49072259]
 [-0.97684704 -3.29629978]
 [ 1.48011458  7.79498761]
 [ 1.20956698 -3.46969306]
 [-0.62880934  2.61278913]]

Biases b1:
[[-0.53032483]
 [-3.73224668]
 [ 4.0280814 ]
 [-1.31435266]
 [ 2.16386161]
 [-4.54046492]
 [-0.00807215]
 [-0.35612492]]

Layer 2:
--------------------
Shape: 8 → 4

Weights W2:
[[-7.35074774e-01  2.75143425e+00  3.93937840e+00 -3.2

### Example 4: Sanity Checks [AND/OR]

In [90]:
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y = np.array([[0, 1, 1, 1]]) # OR operation

nn = DeepNeuralNetwork([2, 1], learning_rate=0.3)
nn.train(X, Y, epochs=10000, print_every=2000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()

Epoch 2000 - Loss: 0.010988
Epoch 4000 - Loss: 0.005067
Epoch 6000 - Loss: 0.003222
Epoch 8000 - Loss: 0.002343
Epoch 10000 - Loss: 0.001833

Input Data (X):
[[0 0 1 1]
 [0 1 0 1]]

Expected Output (Y):
[[0 1 1 1]]

Predictions:
[[0 1 1 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 1

Weights W1:
[[5.85586768 5.85570795]]

Biases b1:
[[-2.67861634]]


In [91]:
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y = np.array([[0, 0, 0, 1]]) # AND operation

nn = DeepNeuralNetwork([2, 1], learning_rate=0.3)
nn.train(X, Y, epochs=5000, print_every=1000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()  

Epoch 1000 - Loss: 0.042057
Epoch 2000 - Loss: 0.021467
Epoch 3000 - Loss: 0.013894
Epoch 4000 - Loss: 0.010110
Epoch 5000 - Loss: 0.007882

Input Data (X):
[[0 0 1 1]
 [0 1 0 1]]

Expected Output (Y):
[[0 0 0 1]]

Predictions:
[[0 0 0 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 1

Weights W1:
[[4.28824405 4.28825023]]

Biases b1:
[[-6.53252905]]
