### 2.3.8 Continued Pretraining

---

#### **Domain Continued Pretraining (DCTP)**

* **Purpose:**
  After base pretraining, further pretrain on domain-specific data to inject specialized knowledge (e.g. code models like CodeLlama).

* **How:**
  Same pretraining pipeline (no architectural change), but using domain-specific corpora.

* **Data mixing:**

  * Usually mix general-domain and domain-specific data.
  * Suggested mixing ratio: **7 : 2 : 1** →
    70% general base data,
    20% domain-specific data,
    10% instruction or high-value data.

* **Typical pipeline:**

  * Continue training using the same tokenizer & model.
  * Clean + deduplicate domain data beforehand.

---

#### **Long Context Continued Pretraining (LCTP)**

* **Reference Papers:**

  * CodeLlama
  * Effective Long-Context Scaling of Foundation Models
  * YaRN: Efficient Context Window Extension of Large Language Models

* **Setup:**

  * Use roughly **20B tokens** of long-text data.
  * Use **CodeLlama-style** long-context training.
  * Prefer **NTK-Aware** scaling (adjust RoPE frequencies).

* **Hyperparameters:**
  Based on CodeLlama’s long-context finetuning setup:

  * Gradually increase context length:

    * Start from 4k tokens → grow to 16k tokens.
  * Modify RoPE scaling factor:

    $$
    \theta = \left(\frac{8}{2\sqrt{d}}\right)
    $$

    * Example: if hidden size $d=100000$, $\theta \approx 1000000$.
  * This scaling helps reduce attention instability for distant tokens.
  * Smoothly adapting to longer contexts helps models generalize across different sequence lengths.

* **Engineering Notes:**

  * Add **context-parallel attention** to speed up training on long sequences.
  * Carefully redesign attention (flash attention, high-efficiency attention kernels).
  * Expand positional embeddings up to **32k / 128k tokens** (gradual increase starting from \~8k).

* **Common Long Context Use Cases:**

  * Document-level reasoning
  * Code generation
  * Dialogue models with very long turns

---

#### **Key Idea Summary**

* Domain continued pretraining injects *what* the model should know.
* Long context continued pretraining teaches *how* the model handles longer sequences.
* Both are often combined in high-end models like **CodeLlama 34B, GPT-4o, Claude 3, Gemini** etc.

---

👉 If you want, I can also give you **practical recipes** for how to implement these two kinds of continued pretraining — very useful for real-world finetuning.
Shall we?
