**2.3.4 Pre-training Strategy**

> **Reference recipes:**
> • MiniCPM series
> • phi series
> • DeepSeekMath

These give battle-tested schedules, hyperparameters, and curriculum ideas you can adapt for large-scale Chinese pre-training.

---

### Optimal Batch Size

* **Why it matters:**
  The batch size trades off convergence speed against compute & memory cost.

  * If it’s **too large**, each step consumes huge amounts of data and compute, and beyond a point you see diminishing returns on loss reduction.
  * If it’s **too small**, you waste wall-clock time taking many more steps—and you may not reduce the loss efficiently.

* **Empirical study:**
  On models of 0.009 B, 0.036 B, and 0.17 B parameters, six different batch sizes were tested. The loss surface (C4 loss) vs. batch size and training steps shows a clear “ridge” of optimum batch-size for each model scale.

* **Rule of thumb for C4 loss:**

  $$
    \text{Optimal batch size} \;\;=\;\; \frac{1.2110 \times 10^9}{L^{6.2393}}
  $$

  where $L$ is the sequence length.

*(In the published plots, the red curve traces this optimum across models.)*

---

### WSD (Warmup–Steady–Decay) Scheduler

Most large-model pre-training naturally falls into three phases:

1. **Warmup**
2. **Steady-state**
3. **Decay (annealing)**

During decay, you typically introduce higher-quality data and apply an annealing schedule (e.g. cosine). The **WSD scheduler** formalizes this:

* Let

  * $W$ = number of steps in warmup,
  * $S$ = end of steady-state,
  * $D$ = length of decay.

* Define the learning rate at step $s$ as:

  $$
    \ell r(s) =
    \begin{cases}
      \displaystyle \frac{s}{W}\,\eta, & 0 \le s < W, \\[6pt]
      \eta,                         & W \le s < S, \\[3pt]
      f(s - S)\,\eta,               & S \le s < S + D,
    \end{cases}
  $$

  where

  * $\eta$ = maximum learning rate,
  * $f(\cdot)$ is any monotonically decreasing function with $0 < f(\cdot)\le1$.

**Benefits of WSD:**

1. You can **pause or snapshot** at any phase.
2. It often **outperforms** a pure cosine schedule.
3. Phases are **explicit**, making it easy to swap in different data or curricula.

---

### Pre-training Tricks

* **Keep it simple for long runs:**
  CPT-style runs can last **>1 month**. If your GPU memory suffices, **avoid** adding extra parallelism (tensor\_, pipeline\_, sequence\_parallel), offload, or gradient-recompute—these only complicate debugging.

* **Include instruction data:**
  Early evidence suggests that **mixing in instruction-style examples** (i.e.\ “prompt→response” pairs) helps your model learn more versatile generation behavior.

---

### Multi-Stage Training Flow

A typical four-stage workflow once your data and code are ready:

1. **Warmup**
   Slowly ramp your LR up to the maximum.

2. **Main training**
   Use a schedule like cosine, cosine-decay, constant, or constant-decay. Decide via small-model experiments or literature.

3. **Long-context adaptation**
   Increase RoPE’s base frequency and bump up sequence length so the model masters longer texts.

4. **Final annealing**
   Train on high-quality or in-domain (IFT) data to “hone” the model for benchmarks and downstream tasks.

> **In practice** you often run **two or more phases**:
>
> 1. First, train on the **entire corpus**.
> 2. Then, fine-tune on a **smaller slice**.
> 3. Finally, do a **short anneal** on a **high-quality subset**.
