# Scaling Law Analysis (Bonus)
This notebook fits a simple scaling law to the provided hypothetical training results, predicts loss for a larger run, and recommends a compute-optimal allocation under a fixed compute budget.

In [None]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

data = pd.DataFrame({
    "model_params": [1e8, 3e8, 1e9, 3e9],
    "dataset_tokens": [1e10, 3e10, 1e11, 3e11],
    "compute_pf_days": [0.1, 0.5, 2.0, 10.0],
    "final_loss": [2.50, 2.10, 1.75, 1.50],
})
data

## 1) Fit a scaling law
Because model size and dataset size scale together almost perfectly in the table (tokens ≈ 100×parameters), it is difficult to uniquely identify separate exponents for model-size and data-size effects. A robust choice with this data is to fit a compute-scaling law:

\[ L(C) = L_\infty + K\,C^{-k} \]

where \(C\) is compute in PF-days.

In [None]:
C = data["compute_pf_days"].to_numpy()
L = data["final_loss"].to_numpy()

def loss_vs_compute(C, L_inf, K, k):
    return L_inf + K * (C ** (-k))

params, cov = curve_fit(
    loss_vs_compute,
    C, L,
    p0=[0.3, 1.5, 0.15],
    bounds=([0.0, 0.0, 0.0], [5.0, 10.0, 2.0]),
    maxfev=200000
)

L_inf, K, k = params
params

In [None]:
# Plot: loss vs compute (log-x)
C_grid = np.logspace(np.log10(C.min()), np.log10(C.max()), 200)
plt.figure(figsize=(7,4))
plt.scatter(C, L)
plt.plot(C_grid, loss_vs_compute(C_grid, *params))
plt.xscale("log")
plt.xlabel("Compute (PF-days, log scale)")
plt.ylabel("Final loss")
plt.title("Compute scaling fit: L(C) = L_inf + K*C^{-k}")
plt.show()

print(f"Fitted parameters: L_inf={L_inf:.3f}, K={K:.3f}, k={k:.3f}")

## 2) Predict expected loss for a 10B parameter model trained on 1T tokens
To use the compute-scaling fit, we need an estimated compute for (10B params, 1T tokens). The provided table implicitly uses a fixed ratio of tokens to parameters:

\[ \text{tokens} \approx 100\times \text{parameters} \]

We fit compute as a power-law function of model size along this frontier:

\[ C(N) = c_0\,N^{p} \]

and then infer compute for 10B params (and 1T tokens) under the same scaling regime.

In [None]:
N = data["model_params"].to_numpy()
D = data["dataset_tokens"].to_numpy()

# Verify frontier ratio
ratio = D / N
print("Tokens/parameter ratios:", ratio)

# Fit compute as a function of N (since D scales ~100N here)
logN = np.log(N)
logC = np.log(C)
p, logc0 = np.polyfit(logN, logC, 1)
c0 = np.exp(logc0)

print(f"Compute fit: C(N) = {c0:.3e} * N^{p:.3f}")

def compute_from_N(N):
    return c0 * (N ** p)

# Prediction target
N_target = 10e9   # 10B parameters
D_target = 1e12   # 1T tokens (matches ~100 tokens/param)
C_target = compute_from_N(N_target)

L_target = loss_vs_compute(C_target, *params)

C_target, L_target

In [None]:
plt.figure(figsize=(7,4))
plt.scatter(N, C, label="data")
N_grid = np.logspace(np.log10(N.min()), np.log10(N.max()), 200)
plt.plot(N_grid, compute_from_N(N_grid), label="fit")
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Model parameters (log scale)")
plt.ylabel("Compute (PF-days, log scale)")
plt.title("Compute vs model size along the provided scaling frontier")
plt.legend()
plt.show()

print(f"Estimated compute for 10B params / 1T tokens: {C_target:.1f} PF-days")
print(f"Predicted loss at that compute: {L_target:.3f}")

## 3) Recommend an optimal allocation under a fixed compute budget (20 PF-days)
Given this dataset follows a near-fixed frontier (tokens ≈ 100×params), a practical compute-optimal recommendation under a fixed compute budget is to stay on the same frontier and pick the largest model/data pair that fits the budget.

Using the fitted compute-vs-size relation, we solve for \(N\) such that \(C(N)=20\) PF-days, then set \(D\approx 100N\).

In [None]:
budget = 20.0  # PF-days

# Solve C(N) = budget => N = (budget/c0)^(1/p)
N_budget = (budget / c0) ** (1.0 / p)
D_budget = 100.0 * N_budget  # follow the observed frontier ratio

L_budget = loss_vs_compute(budget, *params)

print(f"Compute budget: {budget} PF-days")
print(f"Recommended (frontier) model size: {N_budget/1e9:.2f}B params")
print(f"Recommended (frontier) dataset size: {D_budget/1e12:.2f}T tokens")
print(f"Expected loss at {budget} PF-days from compute scaling fit: {L_budget:.3f}")

## 4) Assumptions & limitations
1) **Collinearity of N and D:** In the provided table, dataset size and model size scale together (D≈100N). With only four points on a single frontier, it is not possible to reliably disentangle separate scaling exponents for model size and data size.

2) **Compute proxy:** I treat PF-days as a consistent proxy for training compute and assume the compute definition is comparable across runs. In practice, hardware efficiency, sequence length, optimizer, and implementation details shift this relationship.

3) **Extrapolation risk:** Predicting at 10B/1T requires extrapolating far beyond the observed compute range. Scaling laws are often approximately valid over ranges, but constants/exponents can change with architecture, data mixture, and optimization.

4) **Single-metric view:** I model only final loss. Real deployments care about downstream task performance, calibration, safety, and robustness, which may not track loss monotonically.

5) **Frontier optimality:** The allocation recommendation assumes the original runs are close to compute-optimal (a reasonable heuristic in scaling work), but without additional off-frontier data points (e.g., same compute with different N vs D), this cannot be proven.