**2.3.5 Training Resilience & Monitoring**

---

### 1. Monitor “Channel”-Specific Losses

Track training loss **separately** on at least three data streams:

* **Chinese knowledge** texts
* **English knowledge** texts
* **Code** snippets
  This helps you spot imbalances or domain-specific issues early.

### 2. Watch for Loss Spikes

A **loss spike**—either a sudden jump or a sharp drop—often indicates corrupted or mis-formatted data.

* **Sudden jump** (very high loss) can mean gibberish tokens or binary noise.
* **Sudden drop** (very low loss) can mean empty lines, repeated tokens, or constant inputs.
  Even if spikes don’t irreversibly damage the model, eliminating them leads to more stable convergence.

### 3. Track Perplexity (PPL) on Sampled Validation Sets

Perplexity trends tell you how well the model is learning natural-language structure:

1. **Warm-up phase**: PPL often rises at first, then starts falling.
2. **Plateau**: After initial drop, it may drift upward slightly.
3. **Domain data**: When you introduce new, in-domain text, PPL should decline again.

> **Practice**: From each distinct data source, randomly sample \~200 examples and compute PPL periodically as your “health check.”

### 4. Evaluate Few-Shot Prompting at Multiple Checkpoints

Save model snapshots at different training steps and measure few-shot performance (e.g. accuracy on held-out tasks). This reveals which point gives you the best balance of general understanding and domain adaptation.

---

## What Is Perplexity?

Perplexity is a standard metric in language modeling:

$$
\mathrm{Perplexity}(\mathrm{Model})
= \exp\Bigl(-\frac{1}{N}\sum_{i=1}^N \log P(w_i \mid w_1,\dots,w_{i-1})\Bigr).
$$

Lower perplexity means the model predicts the next token more confidently and accurately.

---

## Remedies for Loss Spikes

If you encounter a disruptive spike in training loss, consider the following fixes:

1. **Rollback & resume**
   Load the last checkpoint **before** the spike and continue training from there.
   *(Technique borrowed from GLM-130B’s training report.)*

2. **Reduce optimizer ε**
   A smaller epsilon in Adam/AdamW can stabilize updates.

3. **Scale down shallow-layer gradients**
   Multiply gradients in the lower (early) layers by a small factor to damp abrupt updates.

4. **Use the WSD scheduler**
   The “Warmup–Steady–Decay” schedule (from miniCPM) can smooth out training transitions.

5. **Apply z-loss regularization**
   Add a small penalty on the logit normalizer (the “softmax z”) to prevent runaway softmax growth.

---

By systematically monitoring these signals and applying targeted remedies, you’ll maintain a robust, crash-resilient training process for your large-scale LLM.
