## Mini Deep Neural Network to learn XOR (from scratch)

This notebook implements a Deep Neural Network to learn the XOR - function with Neural Networks from scratch. It only requires numpy. The challenge of learning XOR — a non-linearly separable problem — has a long and influential history in neural network research, and today it serves as a classic toy example for demonstrating the power of deep learning.

## History of Deep Neural Networks and the XOR Problem

### 1958 – The Perceptron (Frank Rosenblatt)
- Introduced the single-layer perceptron.
- Could only solve linearly separable problems (e.g. AND, OR).
- Failed to solve XOR — a non-linearly separable problem.

### 1969 – Minsky & Papert’s Critique
- Published *"Perceptrons"* book.
- Proved mathematically that perceptrons cannot solve the XOR problem.
- This led to a significant decline in neural network research (known as the "AI Winter").

### 1986 – Backpropagation and the Revival
- Rumelhart, Hinton, and Williams introduced the backpropagation algorithm.
- Showed that multi-layer networks (with hidden layers and non-linear activations) can solve XOR.
- This marked the rebirth of interest in neural networks and deep learning.

### XOR as a Benchmark Problem
- XOR became a standard test case for demonstrating:
  - The power of hidden layers.
  - Non-linear decision boundaries.
  - Correctness of backpropagation implementations.

### Summary Table

| Year | Milestone                          | Significance                                |
|------|------------------------------------|---------------------------------------------|
| 1958 | Rosenblatt’s Perceptron            | First neural net, couldn't solve XOR        |
| 1969 | Minsky & Papert's critique         | Proved XOR unsolvable with perceptrons      |
| 1986 | Backpropagation by Rumelhart et al | Multi-layer networks can learn XOR          |
| Today| XOR in teaching and benchmarking   | A classic test for non-linear learning      |


In [37]:
import numpy as np

## Specify input

In [38]:
# XOR input (X) and output (y)
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[0], [1], [1], [0]])

# Sigmoid activation and derivative
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    return a * (1 - a)

# Initialize parameters
np.random.seed(42)

# Number of neurons per layer
input_size = 2
hidden_size = 4
output_size = 1

lr = 0.1  # Learning rate

# Initialize weights and biases
W1 = np.random.randn(input_size, hidden_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size)
b2 = np.zeros((1, output_size))

## Training loop

### Backpropagation

$\delta^{[2]} = - (y - \hat{y}) \cdot \sigma'(\hat{y})$

$\nabla W^{[2]} = (a^{[1]})^\top \cdot \delta^{[2]}$

$\nabla b^{[2]} = \sum \delta^{[2]}$ &nbsp;&nbsp; (*across batch*)

$\delta^{[1]} = (\delta^{[2]} \cdot (W^{[2]})^\top) \cdot \sigma'(a^{[1]})$

$\nabla W^{[1]} = X^\top \cdot \delta^{[1]}$

$\nabla b^{[1]} = \sum \delta^{[1]}$ &nbsp;&nbsp; (*across batch*)

---

### Gradient Descent Update

$W^{[2]} \leftarrow W^{[2]} - \eta \cdot \nabla W^{[2]}$

$b^{[2]} \leftarrow b^{[2]} - \eta \cdot \nabla b^{[2]}$

$W^{[1]} \leftarrow W^{[1]} - \eta \cdot \nabla W^{[1]}$

$b^{[1]} \leftarrow b^{[1]} - \eta \cdot \nabla b^{[1]}$


In [39]:
# Training loop
for epoch in range(20000):
    
    # --- Feedforward pass ---
    z1 = X @ W1 + b1
    a1 = sigmoid(z1)

    z2 = a1 @ W2 + b2
    y_hat = sigmoid(z2)  # prediction y_hat

    # --- Compute loss (mean squared error) ---
    loss = np.mean((y - y_hat) ** 2)

    # --- Backpropagation ---
    d_loss_y_hat = -(y - y_hat)
    d_y_hat_z2 = sigmoid_derivative(y_hat)
    d_z2_W2 = a1

    dW2 = d_z2_W2.T @ (d_loss_y_hat * d_y_hat_z2)
    db2 = np.sum(d_loss_y_hat * d_y_hat_z2, axis=0, keepdims=True)

    d_z2_a1 = W2
    d_a1_z1 = sigmoid_derivative(a1)

    dW1 = X.T @ ((d_loss_y_hat * d_y_hat_z2) @ d_z2_a1.T * d_a1_z1)
    db1 = np.sum((d_loss_y_hat * d_y_hat_z2) @ d_z2_a1.T * d_a1_z1, axis=0, keepdims=True)

    # --- Gradient descent update ---
    W2 -= lr * dW2
    b2 -= lr * db2
    W1 -= lr * dW1
    b1 -= lr * db1

    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.5f}")

Epoch 0, Loss: 0.28319
Epoch 1000, Loss: 0.24523
Epoch 2000, Loss: 0.21241
Epoch 3000, Loss: 0.15033
Epoch 4000, Loss: 0.05716
Epoch 5000, Loss: 0.02093
Epoch 6000, Loss: 0.01068
Epoch 7000, Loss: 0.00668
Epoch 8000, Loss: 0.00468
Epoch 9000, Loss: 0.00353
Epoch 10000, Loss: 0.00279
Epoch 11000, Loss: 0.00228
Epoch 12000, Loss: 0.00192
Epoch 13000, Loss: 0.00164
Epoch 14000, Loss: 0.00143
Epoch 15000, Loss: 0.00126
Epoch 16000, Loss: 0.00113
Epoch 17000, Loss: 0.00102
Epoch 18000, Loss: 0.00092
Epoch 19000, Loss: 0.00084


## Prediction

In [40]:
# Final predictions
print("\nPredictions after training:")
print(np.round(y_hat, 3))
print("mse = ", np.mean((y - y_hat) ** 2))


Predictions after training:
[[0.02 ]
 [0.974]
 [0.97 ]
 [0.034]]
mse =  0.000775122615185277
