# Multiple Layer Training: Backpropagation 

## Complete Algorithm

### Forward Pass
```
a⁰ = x
for l = 1 to L:
    zˡ = Wˡ · aˡ⁻¹ + bˡ
    aˡ = σ(zˡ)
```

### Backward Pass
```
# 1. Compute output error
δᴸ = (aᴸ - y) ⊙ aᴸ ⊙ (1 - aᴸ)

# 2. Backpropagate through hidden layers
for l = L-1 down to 1:
    δˡ = (Wˡ⁺¹)ᵀ · δˡ⁺¹ ⊙ aˡ ⊙ (1 - aˡ)

# 3. Compute gradients
for l = 1 to L:
    ∂C/∂Wˡ = δˡ · (aˡ⁻¹)ᵀ
    ∂C/∂bˡ = δˡ
```

### Parameter Update (Gradient Descent)
```
for l = 1 to L:
    Wˡ = Wˡ - η · ∂C/∂Wˡ
    bˡ = bˡ - η · ∂C/∂bˡ
```
Where $\eta$ is the learning rate.

### ====================================
### Read from here for your own knowledge; ***you can skip***.
### ====================================
## Batch Processing (Code Enhancement)

Improves the basic algorithm by processing multiple examples simultaneously:

- **Input**: $X$ with shape $(input\_size \times batch\_size)$
- **Gradients** are averaged over the batch:
  $$\frac{\partial C}{\partial W^l} = \frac{1}{m} \sum_{i=1}^m \delta^l_{(i)} (a^{l-1}_{(i)})^T$$
  $$\frac{\partial C}{\partial b^l} = \frac{1}{m} \sum_{i=1}^m \delta^l_{(i)}$$

## Why This Works: The Chain Rule Insight

The fundamental mathematics is the chain rule:

$$\frac{\partial C}{\partial W^l} = \frac{\partial C}{\partial a^L} \cdot \frac{\partial a^L}{\partial a^{L-1}} \cdot \ldots \cdot \frac{\partial a^{l+1}}{\partial a^l} \cdot \frac{\partial a^l}{\partial z^l} \cdot \frac{\partial z^l}{\partial W^l}$$

Instead of computing this massive chain directly, backpropagation computes it recursively:

$$\frac{\partial C}{\partial z^l} = \underbrace{\frac{\partial C}{\partial z^{l+1}}}_{\delta^{l+1}} \cdot \underbrace{\frac{\partial z^{l+1}}{\partial a^l}}_{W^{l+1}} \cdot \underbrace{\frac{\partial a^l}{\partial z^l}}_{\sigma'(z^l)}$$

## Key Mathematical Insights

1. **Local Computation**: Each layer only needs information from adjacent layers
2. **Dynamic Programming**: We reuse computed $\delta^{l+1}$ to compute $\delta^l$
3. **Efficiency**: Storing $z^l$ and $a^l$ during forward pass avoids recomputation
4. **Sigmoid Efficiency**: $\sigma'(z^l) = a^l(1-a^l)$ uses pre-computed activations

## The Essence

**Backpropagation calculates how much each weight contributes to the final error by:**
1. **Forward pass**: Compute all activations
2. **Backward pass**: Propagate errors from output back to input
3. **Update**: Nudge weights in the direction that reduces error

The mathematical elegance is that we can compute exact gradients for millions of parameters using only local information and efficient recursive computation!

## The Code:

In [1]:
import numpy as np


# Sigmoid activation and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)


# Mean Squared Error (MSE) Loss
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)


# Deep Neural Network Class
class DeepNeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.1):
        """
        Initialize a deep neural network.

        Args:
            layer_sizes: List of integers representing the number of neurons in each layer.
                         Example: [2, 4, 3, 1] for input=2, hidden1=4, hidden2=3, output=1
            learning_rate: Learning rate for gradient descent
        """
        self.layer_sizes = layer_sizes
        self.lr = learning_rate
        self.L = len(layer_sizes) - 1  # Number of weight layers

        # Initialize weights and biases
        self.weights = []
        self.biases = []

        for i in range(self.L):
            """'He' aka 'Kaiming' initialization for better training of deep networks;
            (you can use random initialization and still have it work normally btw)."""
            w = np.random.randn(layer_sizes[i + 1], layer_sizes[i]) * np.sqrt(
                2.0 / layer_sizes[i]
            )
            b = np.zeros((layer_sizes[i + 1], 1))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass through all layers"""
        self.activations = [X]  # Store all activations, starting with input
        self.z_values = []  # Store all z values (before activation)

        current_activation = X

        for i in range(self.L):
            z = self.weights[i] @ current_activation + self.biases[i]
            current_activation = sigmoid(z)

            self.z_values.append(z)
            self.activations.append(current_activation)

        return current_activation

    def backward(self, X, Y):
        m = X.shape[1]
        dW = [np.zeros_like(w) for w in self.weights]
        db = [np.zeros_like(b) for b in self.biases]

        # Output layer error - using activations for derivative
        dZ = (self.activations[-1] - Y) * self.activations[-1] * (1 - self.activations[-1])
        dW[-1] = (dZ @ self.activations[-2].T) / m
        db[-1] = np.sum(dZ, axis=1, keepdims=True) / m

        # Backpropagate through hidden layers
        for l in range(self.L - 2, -1, -1):
            dA = self.weights[l + 1].T @ dZ
            dZ = dA * self.activations[l + 1] * (1 - self.activations[l + 1])
            dW[l] = (dZ @ self.activations[l].T) / m
            db[l] = np.sum(dZ, axis=1, keepdims=True) / m

        # Update parameters (same as before)
        for l in range(self.L):
            self.weights[l] -= self.lr * dW[l]
            self.biases[l] -= self.lr * db[l]

    def train(self, X, Y, epochs=10000, print_every=1000):
        """Train the network"""
        for epoch in range(1, epochs + 1):
            output = self.forward(X)
            loss = mse_loss(Y, output)
            self.backward(X, Y)

            if epoch % print_every == 0:
                print(f"Epoch {epoch} - Loss: {loss:.6f}")

    def predict(self, X):
        """Make predictions"""
        output = self.forward(X)
        return (output > 0.5).astype(int)

    def print_parameters(self):
        """Print weights and biases in a detailed format"""
        print("\n" + "=" * 50)
        print("DETAILED NETWORK PARAMETERS")
        print("=" * 50)

        for i in range(self.L):
            print(f"\nLayer {i+1}:")
            print("-" * 20)
            print(f"Shape: {self.layer_sizes[i]} → {self.layer_sizes[i+1]}")

            print(f"\nWeights W{i+1}:")
            print(self.weights[i])

            print(f"\nBiases b{i+1}:")
            print(self.biases[i])

### Example 1: XNOR (Negated XOR). [2 different architectures]

In [2]:
if __name__ == "__main__":
    # XNOR dataset
    X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])  # 2 inputs
    Y = np.array([[0, 1, 1, 0]])  # 1 output

    # Test different architectures
    architectures = [
        [2, 2, 1],  # Original: 1 hidden layer
        [2, 4, 3, 1],  # 2 hidden layers
    ]

    for i, arch in enumerate(architectures):
        print(f"\n{'='*50}")
        print(f"Testing Architecture {i+1}: {arch}")
        print(f"{'='*50}")

        dnn = DeepNeuralNetwork(layer_sizes=arch, learning_rate=1.0)
        print(f"Network Architecture: {dnn.layer_sizes}")

        dnn.train(X, Y, epochs=10000, print_every=2500)

        print("\nPredictions:")
        preds = dnn.predict(X)
        print(preds)
        print("Expected:")
        print(Y)

        # Calculate accuracy
        accuracy = np.mean(preds == Y)
        print(f"Accuracy: {accuracy*100:.1f}%")

        # Print all parameters in detailed format
        dnn.print_parameters() 


Testing Architecture 1: [2, 2, 1]
Network Architecture: [2, 2, 1]
Epoch 2500 - Loss: 0.005531
Epoch 5000 - Loss: 0.001676
Epoch 7500 - Loss: 0.000956
Epoch 10000 - Loss: 0.000663

Predictions:
[[0 1 1 0]]
Expected:
[[0 1 1 0]]
Accuracy: 100.0%

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 2

Weights W1:
[[-4.55115009 -4.55214052]
 [-6.0840895  -6.0888564 ]]

Biases b1:
[[6.76396353]
 [2.42951796]]

Layer 2:
--------------------
Shape: 2 → 1

Weights W2:
[[ 9.0720307  -9.31196078]]

Biases b2:
[[-4.25744572]]

Testing Architecture 2: [2, 4, 3, 1]
Network Architecture: [2, 4, 3, 1]
Epoch 2500 - Loss: 0.006881
Epoch 5000 - Loss: 0.000919
Epoch 7500 - Loss: 0.000448
Epoch 10000 - Loss: 0.000289

Predictions:
[[0 1 1 0]]
Expected:
[[0 1 1 0]]
Accuracy: 100.0%

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 4

Weights W1:
[[ 2.08670653 -4.13538263]
 [ 3.54714705 -0.01671905]
 [ 5.76475613  5.80337491]
 [ 1.35298455  1.32866933]]

Biases b1:
[[

### Example 3: Circular Boundary.

In [3]:
# np.random.seed(0)
n = 8  # keep small for display
X = np.random.rand(2, n)
Y = np.array([(x1**2 + x2**2 > 0.5).astype(int) for x1, x2 in X.T]).reshape(1, -1)

nn = DeepNeuralNetwork([2, 8, 4, 1], learning_rate=1.0)
nn.train(X, Y, epochs=60000, print_every=10000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()

Epoch 10000 - Loss: 0.000120
Epoch 20000 - Loss: 0.000053
Epoch 30000 - Loss: 0.000033
Epoch 40000 - Loss: 0.000024
Epoch 50000 - Loss: 0.000019
Epoch 60000 - Loss: 0.000015

Input Data (X):
[[0.25335268 0.74614652 0.88184622 0.73484997 0.96319048 0.54675465
  0.58390522 0.54025938]
 [0.03328745 0.38489802 0.49099051 0.9442814  0.68576348 0.49448593
  0.52153613 0.03154741]]

Expected Output (Y):
[[0 1 1 1 1 1 1 0]]

Predictions:
[[0 1 1 1 1 1 1 0]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 8

Weights W1:
[[ 1.44973103  3.9747974 ]
 [ 1.4490117   1.83976245]
 [-0.74202255 -0.53225695]
 [ 0.23454735 -0.9068718 ]
 [ 0.49963383  2.36529899]
 [-0.81507762 -4.74976815]
 [-0.49520996 -1.04680001]
 [-1.58550405 -1.39571535]]

Biases b1:
[[-1.6275902 ]
 [-0.91345002]
 [ 0.10956359]
 [ 0.00614543]
 [-0.55488528]
 [ 1.44143708]
 [ 0.1397682 ]
 [ 0.66556763]]

Layer 2:
--------------------
Shape: 8 → 4

Weights W2:
[[-1.21503017 -1.15414077  0.05533435  0.42504381 -0.

### Example 4: Sanity Checks [OR/AND]

In [4]:
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y = np.array([[0, 1, 1, 1]]) # OR operation

nn = DeepNeuralNetwork([2, 1], learning_rate=0.3)
nn.train(X, Y, epochs=10000, print_every=2000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()

Epoch 2000 - Loss: 0.011472
Epoch 4000 - Loss: 0.005178
Epoch 6000 - Loss: 0.003268
Epoch 8000 - Loss: 0.002368
Epoch 10000 - Loss: 0.001849

Input Data (X):
[[0 0 1 1]
 [0 1 0 1]]

Expected Output (Y):
[[0 1 1 1]]

Predictions:
[[0 1 1 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 1

Weights W1:
[[5.84650411 5.8473781 ]]

Biases b1:
[[-2.6741001]]


In [5]:
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y = np.array([[0, 0, 0, 1]]) # AND operation

nn = DeepNeuralNetwork([2, 1], learning_rate=0.3)
nn.train(X, Y, epochs=5000, print_every=1000)

print("\nInput Data (X):")
print(X)
print("\nExpected Output (Y):")
print(Y)
print("\nPredictions:")
print(nn.predict(X))
nn.print_parameters()  

Epoch 1000 - Loss: 0.043621
Epoch 2000 - Loss: 0.021935
Epoch 3000 - Loss: 0.014105
Epoch 4000 - Loss: 0.010227
Epoch 5000 - Loss: 0.007955

Input Data (X):
[[0 0 1 1]
 [0 1 0 1]]

Expected Output (Y):
[[0 0 0 1]]

Predictions:
[[0 0 0 1]]

DETAILED NETWORK PARAMETERS

Layer 1:
--------------------
Shape: 2 → 1

Weights W1:
[[4.27796714 4.27796683]]

Biases b1:
[[-6.51719745]]
