# Day 4: Keeping Neural Networks Simple (MDL) üß†

Welcome to Day 4 of 30 Papers in 30 Days!

Today we are tackling a profound idea from Geoffrey Hinton (1993): **"Keeping Neural Networks Simple by Minimizing the Description Length of the Weights"**.

This paper introduced the idea that **Compression = Generalization**. If you can describe your model with fewer bits, it will understand the world better.

## What You'll Learn

1.  **The Problem**: Why standard networks are "brittle" and overconfident.
2.  **The Solution**: How adding **noise** to weights forces simplicity.
3.  **The Implementation**: Building a **Bayesian Neural Network** from scratch.
4.  **The Visualization**: Seeing the famous **"Uncertainty Envelope"**.

## The Big Idea (in 30 seconds)

Imagine you are a teacher grading a student.
* **Standard Student (Overfitting):** Memorizes the textbook word-for-word. Gets 100% on the practice test, but fails if you rephrase the question.
* **MDL Student (Generalization):** Remembers the *concepts* loosely. Might get a few details wrong, but understands the logic and passes any test.

We force the network to be the second student by telling it: **"You are not allowed to memorize precise weights (e.g., 5.12391). You can only memorize fuzzy ranges (e.g., roughly 5)."**

Let's build it! üöÄ

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
import sys
import os

# Add parent directory to path for imports
sys.path.insert(0, os.path.dirname(os.path.abspath('__file__')))

# Import our MDL implementation
from implementation import MDLNetwork
from visualization import (
    plot_uncertainty_envelope,
    plot_weight_distributions,
    plot_loss_dynamics,
    plot_snr_analysis,
    analyze_compression_stats
)

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ All imports successful!")
print(f"NumPy version: {np.__version__}")

## 1. The Data: A "Gappy" Sine Wave üåä

To see why MDL is cool, we need a tricky dataset.
We will generate a sine wave, but we will **delete the middle part**.

* **Standard NN:** Will confidently hallucinate a line through the gap.
* **Bayesian NN:** Will say "I don't know what's in the gap!" (High Uncertainty).

This "knowing what you don't know" is crucial for AI safety.

In [None]:
# Generate noisy sine wave with a GAP
def generate_gappy_data(n=100):
    # Left side (-3 to -1)
    X1 = np.random.uniform(-3, -1, n//2)
    # Right side (1 to 3)
    X2 = np.random.uniform(1, 3, n//2)
    # Combine
    X = np.concatenate([X1, X2])
    # Add noise
    y = np.sin(X) + np.random.normal(0, 0.1, n)
    return X.reshape(-1, 1), y.reshape(-1, 1)

X_train, y_train = generate_gappy_data(100)

# Visualization range (including the gap)
X_test = np.linspace(-4, 4, 200).reshape(-1, 1)

plt.figure(figsize=(10, 5))
plt.scatter(X_train, y_train, c='red', label='Training Data')
plt.axvspan(-1, 1, color='gray', alpha=0.2, label='The GAP (No Data)')
plt.plot(X_test, np.sin(X_test), 'k--', alpha=0.5, label='True Sine Wave')
plt.title("The Challenge: Predict what happens in the Gap")
plt.legend()
plt.show()

## 2. Training the Bayesian Brain üß†

We will now train our `MDLNetwork`. Unlike a normal network, this one has **two** loss terms:

1.  **Error Cost (NLL):** "Did I get the prediction right?"
2.  **Complexity Cost (KL):** "Did I use simple weights?"

The `kl_weight` parameter controls the balance. 
* Too low? It overfits (memorizes).
* Too high? It underfits (ignores data to be simple).
* Just right? **Magic happens.**

In [None]:
# Initialize Network
net = MDLNetwork(input_size=1, hidden_size=20, output_size=1)

# Hyperparameters
epochs = 2000
lr = 0.01
kl_weight = 0.1  # The "Simplicity Pressure"

# Storage for plotting
history = {'total': [], 'nll': [], 'kl': []}

print("Training Bayesian Network...")
print("=" * 50)

for epoch in range(epochs):
    # 1. Forward Pass (This samples random weights!)
    # Every time we call this, the network is slightly different.
    preds = net.forward(X_train)
    
    # 2. Data Loss (MSE as proxy for Negative Log Likelihood)
    nll = np.mean((preds - y_train)**2)
    d_nll = 2 * (preds - y_train) / len(X_train)
    
    # 3. Complexity Loss (KL Divergence)
    kl = net.total_kl() / len(X_train)
    
    # 4. Total Loss
    loss = nll + kl_weight * kl
    
    # Store history
    history['total'].append(loss)
    history['nll'].append(nll)
    history['kl'].append(kl)
    
    # 5. Backward Pass
    net.backward(d_nll)
    
    # 6. Update Weights
    net.update_weights(lr, kl_weight)
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch:4d} | Total: {loss:.4f} | Error: {nll:.4f} | Complexity: {kl:.4f}")

print("\n‚úÖ Training Complete!")

## 3. The Battle: Complexity vs. Error ‚öîÔ∏è

Let's look at how the network learned. 
Usually, the **Error** drops quickly, but the **Complexity** (KL) might rise at first as the network learns structure, before stabilizing.

In [None]:
plot_loss_dynamics(history)

## 4. The Uncertainty Envelope üìâ

This is the most famous visualization in Bayesian Deep Learning.

We will run the network **100 times** on the test data.
Because the weights are "fuzzy" (random distributions), each run produces a slightly different line.

* **Where we have data:** All the lines agree (Low Variance).
* **In the GAP:** The lines disagree (High Variance).

This tube represents what the model **doesn't know**.

In [None]:
plot_uncertainty_envelope(net, X_train, y_train, X_test, n_samples=100)

## 5. Peeking Inside the Brain (Weight Distributions) üî¨

The paper's title is about "Minimizing Description Length."
How do we see that?

We look at the **Sigma** (Uncertainty) of the weights.
* **Sharp Weights (Low Sigma):** The model says "This weight MUST be exactly 0.5."
* **Fuzzy Weights (High Sigma):** The model says "This weight can be anything around 0.1. It doesn't matter."

**Key Insight:** The fuzzy weights carry **0 bits of information**. They are effectively compressed away!

In [None]:
plot_weight_distributions(net, 'layer1')
plot_snr_analysis(net)

## 6. How much did we compress? üóúÔ∏è

Finally, let's calculate the statistics.
If a weight has a Signal-to-Noise Ratio (SNR) < 0.5, it is basically noise.
We can set it to zero (prune it) without hurting the model.

In [None]:
analyze_compression_stats(net, threshold_snr=0.5)

## 7. Key Takeaways üéØ

### 1. Generalization = Compression
By punishing the model for having precise weights (the KL term), we forced it to find a **simple solution**. This simple solution generalizes better to the "Gap" in our data.

### 2. Uncertainty is Useful
Unlike a standard Neural Network which would lie confidently in the gap, this model admits "I don't know." This is critical for self-driving cars and medical AI.

### 3. Noise is a Feature, not a Bug
We injected noise *during* training. This noise prevented the model from memorizing the data. It's like training a runner with a heavy backpack‚Äîwhen you take it off (or average the results), they are stronger.

---

**Congratulations!** üéâ You have implemented one of the deepest concepts in AI theory.

**Next Up:** We move from theory back to architecture with **Pointer Networks**!