# Day 25: The Physics of Scale
## Interactive Analysis: Scaling Laws for Neural Language Models

This notebook provides a first-principles walkthrough of the scaling laws that define modern AI, following the empirical rigor established in the original paper. 

### Learning Objectives:
1. **Derive the 6N Rule**: Calculate FLOPs for arbitrary scale.
2. **The 12Ld^2 Rule**: Verify parameter counting on a scratch-built Transformer.
3. **The Kaplan Sweep**: Sweep model sizes and witness the log-log straight line.
4. **Non-Linear Fitting**: Fit the irreducible loss $L_\infty$ using `MasterFitter`.
5. **Compute Optimization**: Find the optimal frontier for a fixed budget.
6. **The Predictor**: Predict GPT-3 performance from 1M parameter data.
7. **Explaining the Chinchilla Gap**: Identify why Kaplan's curves were slightly biased.

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from implementation import KaplanTransformer, MasterFitter, ComputeEconomy

%matplotlib inline
plt.style.use('ggplot')

## 1. The Arithmetic of a Transformer
Kaplan et al. (Section 2.1) claim that the number of non-embedding parameters $N$ for a Transformer is roughly $12Ld_{model}^2$.
Let's prove this by building a block and counting.

In [None]:
def analyze_parameters(d_model=256, n_layers=4):
    model = KaplanTransformer(vocab_size=100, d_model=d_model, n_heads=4, n_layers=n_layers)
    
    actual_n = model.count_parameters(mode="Kaplan")
    theoretical_n = 12 * n_layers * (d_model**2)
    
    print(f"\033[1mConfiguration: d_model={d_model}, L={n_layers}\033[0m")
    print(f"Actual N (Kaplan): {actual_n:,}")
    print(f"Theoretical 12Ld^2: {theoretical_n:,}")
    print(f"Precision: {100 - abs(actual_n - theoretical_n)/theoretical_n*100:.2f}%")

analyze_parameters(128, 2)
analyze_parameters(512, 12)

## 2. Compute Economy: The 6N Rule
The total compute $C$ for training is $6N$ FLOPS per token. 
Let's see how many PF-days it would take to train models of various sizes on 300 Billion tokens (the GPT-3 budget).

In [None]:
n_params = [1e6, 1e8, 1e9, 1.75e11] # 1M to GPT-3
tokens = 300e9

for n in n_params:
    c_pf = ComputeEconomy.calculate_c_pfdays(n, tokens)
    print(f"N: {n:10.0e} | Compute: {c_pf:10.2e} PF-days")

## 3. The Power Law: Log-Log Linearity
We will now simulate a scaling sweep and use `MasterFitter` to recover the coefficients.

In [None]:
# Generate synthetic data with noise and irreducible loss
ns_empirical = np.logspace(5, 8, 8)
l_inf_true = 1.7
alpha_true = 0.076
nc_true = 8.8e13

ls_empirical = l_inf_true + (nc_true / ns_empirical)**alpha_true + np.random.normal(0, 0.01, 8)

# Fit
fitter = MasterFitter(ns_empirical, ls_empirical)
fitter.fit()

print(f"Recovered alpha: {fitter.alpha:.4f} (Target: {alpha_true})")
print(f"Recovered L_inf: {fitter.l_inf:.4f} (Target: {l_inf_true})")

# Plot
n_plot = np.logspace(4, 12, 100)
plt.figure(figsize=(10, 5))
plt.scatter(ns_empirical, ls_empirical, label="Empirical Pts")
plt.plot(n_plot, fitter.predict(n_plot), 'r--', label="Fitted Law")
plt.xscale('log')
plt.yscale('log')
plt.title("The Universal Law of Model Size")
plt.xlabel("Parameters (N)")
plt.ylabel("Loss (L)")
plt.legend()
plt.show()

## 4. Compute-Optimal Frontier
If you have 1 PF-day of budget, what $N$ and $D$ should you pick? 
Kaplan (2020) vs Chinchilla (2022) differ here. Let's visualize the trade-off.

In [None]:
budget = 1e-2 # 0.01 PF-days
n_candidates = np.logspace(6, 9, 100)

# C = 6 * N * D => D = C / (6 * N)
def get_d(n, c_pfdays):
    flops = c_pfdays * 1e15 * 60 * 60 * 24
    return flops / (6 * n)

losses_cap = [fitter.predict(n) for n in n_candidates]

plt.figure(figsize=(10, 5))
plt.plot(n_candidates, losses_cap, label="Loss at Fixed Budget")
plt.xscale('log')
plt.xlabel("Model Size (N)")
plt.ylabel("Estimated Loss")
plt.title("Finding the Optimum: Kaplan Frontier")
plt.show()

## Summary of Insights
1. **Scale is predictable**: Within 100k to 1.5B parameters, laws are nearly perfect.
2. **Irreducible Loss**: No matter how big the model, $L_\infty$ remains (entropy of the source).
3. **Compute is the currency**: All scaling factors eventually map back to FLOPs.

---