# Multiple Layer Training: Backpropagation 

## Complete Algorithm

### Forward Pass
```
a⁰ = x
for l = 1 to L:
    zˡ = Wˡ · aˡ⁻¹ + bˡ
    aˡ = σ(zˡ)
```

### Backward Pass
```
# 1. Compute output error
δᴸ = (aᴸ - y) ⊙ aᴸ ⊙ (1 - aᴸ)

# 2. Backpropagate through hidden layers
for l = L-1 down to 1:
    δˡ = (Wˡ⁺¹)ᵀ · δˡ⁺¹ ⊙ aˡ ⊙ (1 - aˡ)

# 3. Compute gradients
for l = 1 to L:
    ∂C/∂Wˡ = δˡ · (aˡ⁻¹)ᵀ
    ∂C/∂bˡ = δˡ
```

### Parameter Update (Gradient Descent)
```
for l = 1 to L:
    Wˡ = Wˡ - η · ∂C/∂Wˡ
    bˡ = bˡ - η · ∂C/∂bˡ
```
Where $\eta$ is the learning rate.

### ====================================
### Read from here for your own knowledge; ***you can skip***.
### ====================================
## Batch Processing (Code Enhancement)

Improves the basic algorithm by processing multiple examples simultaneously:

- **Input**: $X$ with shape $(input\_size \times batch\_size)$
- **Gradients** are averaged over the batch:
  $$\frac{\partial C}{\partial W^l} = \frac{1}{m} \sum_{i=1}^m \delta^l_{(i)} (a^{l-1}_{(i)})^T$$
  $$\frac{\partial C}{\partial b^l} = \frac{1}{m} \sum_{i=1}^m \delta^l_{(i)}$$

## Why This Works: The Chain Rule Insight

The fundamental mathematics is the chain rule:

$$\frac{\partial C}{\partial W^l} = \frac{\partial C}{\partial a^L} \cdot \frac{\partial a^L}{\partial a^{L-1}} \cdot \ldots \cdot \frac{\partial a^{l+1}}{\partial a^l} \cdot \frac{\partial a^l}{\partial z^l} \cdot \frac{\partial z^l}{\partial W^l}$$

Instead of computing this massive chain directly, backpropagation computes it recursively:

$$\frac{\partial C}{\partial z^l} = \underbrace{\frac{\partial C}{\partial z^{l+1}}}_{\delta^{l+1}} \cdot \underbrace{\frac{\partial z^{l+1}}{\partial a^l}}_{W^{l+1}} \cdot \underbrace{\frac{\partial a^l}{\partial z^l}}_{\sigma'(z^l)}$$

## Key Mathematical Insights

1. **Local Computation**: Each layer only needs information from adjacent layers
2. **Dynamic Programming**: We reuse computed $\delta^{l+1}$ to compute $\delta^l$
3. **Efficiency**: Storing $z^l$ and $a^l$ during forward pass avoids recomputation
4. **Sigmoid Efficiency**: $\sigma'(z^l) = a^l(1-a^l)$ uses pre-computed activations

## The Essence

**Backpropagation calculates how much each weight contributes to the final error by:**
1. **Forward pass**: Compute all activations
2. **Backward pass**: Propagate errors from output back to input
3. **Update**: Nudge weights in the direction that reduces error

The mathematical elegance is that we can compute exact gradients for millions of parameters using only local information and efficient recursive computation!

## The Code:

In [104]:
import numpy as np


# Sigmoid activation and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)


# Mean Squared Error (MSE) Loss
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)


# Deep Neural Network Class
class DeepNeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.1):
        """
        Initialize a deep neural network.

        Args:
            layer_sizes: List of integers representing the number of neurons in each layer.
                         Example: [2, 4, 3, 1] for input=2, hidden1=4, hidden2=3, output=1
            learning_rate: Learning rate for gradient descent
        """
        self.layer_sizes = layer_sizes
        self.lr = learning_rate
        self.L = len(layer_sizes) - 1  # Number of weight layers

        # Initialize weights and biases
        self.weights = []
        self.biases = []

        for i in range(self.L):
            """'He' aka 'Kaiming' initialization for better training of deep networks;
            (you can use random initialization and still have it work normally btw)."""
            w = np.random.randn(layer_sizes[i + 1], layer_sizes[i]) * np.sqrt(
                2.0 / layer_sizes[i]
            )
            b = np.zeros((layer_sizes[i + 1], 1))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass through all layers"""
        self.activations = [X]  # Store all activations, starting with input
        self.z_values = []  # Store all z values (before activation)

        current_activation = X

        for i in range(self.L):
            z = self.weights[i] @ current_activation + self.biases[i]
            current_activation = sigmoid(z)

            self.z_values.append(z)
            self.activations.append(current_activation)

        return current_activation

    def backward(self, X, Y):
        m = X.shape[1]
        dW = [np.zeros_like(w) for w in self.weights]
        db = [np.zeros_like(b) for b in self.biases]

        # Output layer error - using activations for derivative
        dZ = (self.activations[-1] - Y) * self.activations[-1] * (1 - self.activations[-1])
        dW[-1] = (dZ @ self.activations[-2].T) / m
        db[-1] = np.sum(dZ, axis=1, keepdims=True) / m

        # Backpropagate through hidden layers
        for l in range(self.L - 2, -1, -1):
            dA = self.weights[l + 1].T @ dZ
            dZ = dA * self.activations[l + 1] * (1 - self.activations[l + 1])
            dW[l] = (dZ @ self.activations[l].T) / m
            db[l] = np.sum(dZ, axis=1, keepdims=True) / m

        # Update parameters (same as before)
        for l in range(self.L):
            self.weights[l] -= self.lr * dW[l]
            self.biases[l] -= self.lr * db[l]

    def train(self, X, Y, epochs=10000, print_every=1000):
        """Train the network"""
        for epoch in range(1, epochs + 1):
            output = self.forward(X)
            loss = mse_loss(Y, output)
            self.backward(X, Y)

            if epoch % print_every == 0:
                print(f"Epoch {epoch} - Loss: {loss:.6f}")

    def predict(self, X):
        """Make predictions"""
        output = self.forward(X)
        return (output > 0.5).astype(int)

    def print_parameters(self):
        """Print weights and biases in a detailed format"""
        print("\n" + "=" * 50)
        print("DETAILED NETWORK PARAMETERS")
        print("=" * 50)

        for i in range(self.L):
            print(f"\nLayer {i+1}:")
            print("-" * 20)
            print(f"Shape: {self.layer_sizes[i]} → {self.layer_sizes[i+1]}")

            print(f"\nWeights W{i+1}:")
            print(self.weights[i])

            print(f"\nBiases b{i+1}:")
            print(self.biases[i])

### Example 1: XNOR (Negated XOR). [2 different architectures]

In [105]:
if __name__ == "__main__":
    # XNOR dataset
    X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])  # 2 inputs
    Y = np.array([[0, 1, 1, 0]])  # 1 output

    # Test different architectures
    architectures = [
        [2, 2, 1],  # Original: 1 hidden layer
        [2, 4, 3, 1],  # 2 hidden layers
    ]

    for i, arch in enumerate(architectures):
        print(f"\n{'='*50}")
        print(f"Testing Architecture {i+1}: {arch}")
        print(f"{'='*50}")

        dnn = DeepNeuralNetwork(layer_sizes=arch, learning_rate=1.0)
        print(f"Network Architecture: {dnn.layer_sizes}")

        dnn.train(X, Y, epochs=10000, print_every=2500)

        print("\nPredictions:")
        preds = dnn.predict(X)
        print(preds)
        print("Expected:")
        print(Y)

        # Calculate accuracy
        accuracy = np.mean(preds == Y)
        print(f"Accuracy: {accuracy*100:.1f}%")

        # Print all parameters in detailed format
        dnn.print_parameters() 


Testing Architecture 1: [2, 2, 1]
Network Architecture: [2, 2, 1]
Epoch 2500 - Loss: 0.008217
Epoch 5000 - Loss: 0.001890
Epoch 7500 - Loss: 0.001024
Epoch 10000 - Loss: 0.000694

Predictions:
[[0 1 1 0]]
Expected:
[[0 1 1 0]]
Accuracy: 100.0%

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 2

Weights W1:
[[4.32814095 4.33226122]
 [6.1725246  6.18915995]]

Biases b1:
[[-6.64946256]
 [-2.71010771]]

Layer 2:
--------------------
Shape: 2 → 1

Weights W2:
[[-9.5152431   8.83491387]]

Biases b2:
[[-4.05803911]]

Testing Architecture 2: [2, 4, 3, 1]
Network Architecture: [2, 4, 3, 1]
Epoch 2500 - Loss: 0.082558
Epoch 5000 - Loss: 0.001050
Epoch 7500 - Loss: 0.000443
Epoch 10000 - Loss: 0.000271

Predictions:
[[0 1 1 0]]
Expected:
[[0 1 1 0]]
Accuracy: 100.0%

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 4

Weights W1:
[[-3.25750153 -4.16018297]
 [ 4.60884898 -1.28422546]
 [ 3.12279228  3.70908962]
 [-3.78894799  5.69800386]]

Biases b1:
[[ 0

### Example 2: Approximate a real valued function. [sin(pi*x1) x2]

In [111]:
# y = sin(pi*x1)*x2
samples = 8  # small for readability
X = np.random.rand(2, samples)
Y = np.sin(np.pi * X[0]) * X[1]
Y = Y.reshape(1, -1)

nn = DeepNeuralNetwork([2, 16, 8, 4, 1], learning_rate=1.0)
nn.train(X, Y, epochs=60000, print_every=10000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions (raw sigmoid outputs):")
print(nn.forward(X))
nn.print_parameters()

Epoch 10000 - Loss: 0.000376
Epoch 20000 - Loss: 0.000167
Epoch 30000 - Loss: 0.000132
Epoch 40000 - Loss: 0.000108
Epoch 50000 - Loss: 0.000088
Epoch 60000 - Loss: 0.000072

Input Data (X):
[[0.35781408 0.64781745 0.12292068 0.88865908 0.50308395 0.44934974
  0.58586479 0.62478386]
 [0.07177581 0.68261722 0.24193168 0.71395263 0.82253479 0.80395851
  0.55250097 0.52016989]]

Expected Output (Y):
[[0.06473329 0.61032699 0.09112119 0.24467048 0.82249619 0.79380187
  0.53252086 0.48070937]]

Predictions (raw sigmoid outputs):
[[0.06770083 0.61295924 0.0894697  0.24606049 0.80584454 0.80772176
  0.53790728 0.47316372]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 16

Weights W1:
[[ 0.47584095  0.26162599]
 [-0.75520716 -0.36543449]
 [-4.51653899 -1.10811182]
 [-0.77520137 -0.623343  ]
 [ 0.99998008 -0.44012063]
 [ 2.84582339  0.01331806]
 [-0.09993276 -0.26916367]
 [-0.78879763  0.55031018]
 [-2.23794086 -1.80890857]
 [-0.81560549 -2.16527384]
 [-1.79168577 -1.43

### Example 3: Circular Boundary.

In [107]:
# np.random.seed(0)
n = 8  # keep small for display
X = np.random.rand(2, n)
Y = np.array([(x1**2 + x2**2 > 0.5).astype(int) for x1, x2 in X.T]).reshape(1, -1)

nn = DeepNeuralNetwork([2, 8, 4, 1], learning_rate=1.0)
nn.train(X, Y, epochs=60000, print_every=10000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()

Epoch 10000 - Loss: 0.000131
Epoch 20000 - Loss: 0.000056
Epoch 30000 - Loss: 0.000035
Epoch 40000 - Loss: 0.000025
Epoch 50000 - Loss: 0.000019
Epoch 60000 - Loss: 0.000016

Input Data (X):
[[0.06364091 0.83137351 0.59897851 0.114933   0.09385728 0.90962682
  0.66920026 0.8292868 ]
 [0.8789789  0.57177235 0.51744635 0.43042741 0.31694659 0.43459596
  0.77387966 0.60192343]]

Expected Output (Y):
[[1 1 1 0 0 1 1 1]]

Predictions:
[[1 1 1 0 0 1 1 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 8

Weights W1:
[[ 1.57006618  0.63402747]
 [ 1.98527999  3.38740314]
 [ 1.37591758  1.85056585]
 [ 2.57984573  3.05800054]
 [-2.76522636 -3.31187039]
 [-0.13812362  2.49277785]
 [-0.16039805 -0.11814049]
 [ 0.57923325 -0.06735808]]

Biases b1:
[[-0.54917717]
 [-2.30512394]
 [-0.79900545]
 [-2.23164424]
 [ 2.41675412]
 [-1.07661481]
 [ 0.06638561]
 [ 0.02778843]]

Layer 2:
--------------------
Shape: 8 → 4

Weights W2:
[[ 0.51969422 -1.17277826 -0.42442142 -1.10886568  0.

### Example 4: Sanity Checks [OR/AND]

In [108]:
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y = np.array([[0, 1, 1, 1]]) # OR operation

nn = DeepNeuralNetwork([2, 1], learning_rate=0.3)
nn.train(X, Y, epochs=10000, print_every=2000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()

Epoch 2000 - Loss: 0.011966
Epoch 4000 - Loss: 0.005287
Epoch 6000 - Loss: 0.003313
Epoch 8000 - Loss: 0.002392
Epoch 10000 - Loss: 0.001864

Input Data (X):
[[0 0 1 1]
 [0 1 0 1]]

Expected Output (Y):
[[0 1 1 1]]

Predictions:
[[0 1 1 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 1

Weights W1:
[[5.83867821 5.83822696]]

Biases b1:
[[-2.66976654]]


In [109]:
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y = np.array([[0, 0, 0, 1]]) # AND operation

nn = DeepNeuralNetwork([2, 1], learning_rate=0.3)
nn.train(X, Y, epochs=5000, print_every=1000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()  

Epoch 1000 - Loss: 0.041687
Epoch 2000 - Loss: 0.021355
Epoch 3000 - Loss: 0.013843
Epoch 4000 - Loss: 0.010082
Epoch 5000 - Loss: 0.007865

Input Data (X):
[[0 0 1 1]
 [0 1 0 1]]

Expected Output (Y):
[[0 0 0 1]]

Predictions:
[[0 0 0 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 1

Weights W1:
[[4.29076795 4.29076243]]

Biases b1:
[[-6.5362845]]
