**2.3.6 Pre-training Scaling Laws**

---

### Estimating Your FLOPs Budget

Before you start pre-training, you can calculate how much compute you’ll need using:

$$
\text{FLOPs} \;=\; 6 \;\times\; (\#\text{tokens}) \;\times\; (\#\text{parameters})
$$

* The factor of 6 accounts for forward+backward passes.
* You can convert this into “GPU × days.”

  * **Example:** 100 × NVIDIA A800 GPUs running for 30 days.

    * One A800 sustains ∼210 TFLOPs/s.
    * Total FLOPs ≈ $210\!\times\!10^{12}$ FLOPs/s × 100 GPUs × 30 days × 24 h × 3600 s ≈ $5.4\times10^{22}$ FLOPs.

**Common Units**

* 1 T-token = $10^{12}$ tokens
* 1 B-parameter = $10^9$ parameters

---

### Compute-Optimal Model Sizing

Given a fixed FLOPs budget,

$$
(\#\text{tokens}) \;\times\; (\#\text{parameters}) = \text{constant.}
$$

* **Trade-off:** More params ↔ fewer tokens; fewer params ↔ more tokens.
* **Illustration (for $5.4\times10^{22}$ FLOPs):**

  * A 7 B-param model → ≈ 10 T tokens
  * A 70 B-param model → ≈ 1 T tokens

---

### LLaMA3’s Empirical Scaling Law

LLaMA3’s authors ran experiments over budgets $6\times10^{18}$ to $10^{22}$ FLOPs and model sizes from 40 M to 16 B params, measuring validation loss at each “iso-FLOPs” slice. They then fit a power law for the **compute-optimal token count** $N^*(C)$ at budget $C$:

$$
N^*(C) = A\,C^\alpha
\quad\text{with fitted parameters }
\alpha = 0.537,\;
A = 0.299.
$$

Plotting these predicted values against the measured optima shows excellent agreement across four orders of magnitude.

---

### Predicting Downstream Metrics (e.g. ARC-Challenge)

1. **Benchmark setup:**

   * For each multiple-choice option, compute the model’s perplexity (NLL) and select the lowest-PPL choice.
2. **Two-step prediction pipeline:**

   * **Step 1:** Use the scaling-law fit ($N^*(C)$ vs. FLOPs) and the small-model NLL-vs-FLOPs data to predict the model’s NLL on the ARC validation set before you train.
   * **Step 2:** Fit a sigmoid mapping from normalized NLL → accuracy (using data from smaller models and LLaMA 2).
3. **Outcome:** This lets you forecast a model’s final accuracy *before* spending huge compute. On 405 B params, the method under-estimated accuracy by only a hair.

---

### Practical Takeaways

* For models **≤ 70 B**, you’re often constrained by parameter count (model size) more than compute budget—so “compute-optimal” sizing matters less.
* At **larger scales**, where FLOPs are capped, finding the compute-optimal trade-off is crucial.
* As future hardware makes 400 B+ models relatively “small,” you’ll be able to allocate even more tokens.
* **Scaling laws** not only identify the best size but also quantify your return on investment (compute cost, carbon footprint, etc.), guiding cost-benefit analyses and dev-loss projections.
* While this approach works very well for multiple-choice tasks, complex reasoning (chain-of-thought, math) may require more sophisticated predictive frameworks.
