**2.3.2 Determining Model Architecture & Memory Budget**

* **Architecture**
  All of our variants use the “LLaMA-style” backbone (RoPE positional embeddings + grouped-query attention + RMSNorm + SwiGLU activations).

* **Memory Footprint by Precision & Fine-tuning Method**
  \| Method                             | Bits │ 7 B    │ 14 B   │ 30 B    │ 70 B     │ ≈× B   │
  \|------------------------------------|:----:|:------:|:------:|:-------:|:--------:|:------:|
  \| **Full (fp32)**                    |  32  | 120 GB | 240 GB | 600 GB  | 1 200 GB | \~18× GB|
  \| **Full (pure bf16)**               |  16  |  60 GB | 120 GB | 300 GB  |  600 GB  | \~8× GB |
  \| Freeze/LoRA/GaLoRe/APOLLO/BAdam\*  |  16  |  16 GB |  32 GB |  64 GB  |  160 GB  | \~2× GB |
  \| **QLoRA (8-bit)**                  |   8  |  10 GB |  20 GB |  40 GB  |   80 GB  | \~× GB  |
  \| **QLoRA (4-bit)**                  |   4  |   6 GB |  12 GB |  24 GB  |   48 GB  | \~×/2 GB|
  \| **QLoRA (2-bit)**                  |   2  |   4 GB |   8 GB |  16 GB  |   24 GB  | \~×/4 GB|

  > **Key takeaway:**
  >
  > * **Full-precision** training (fp32/bf16) demands hundreds of GB for multi-billion-parameter models.
  > * **Freeze-and-LoRA-style** methods halve that requirement.
  > * **QLoRA** in 4- or 2-bit mode slashes GPU memory further, making even 30 B+ models feasible on commodity hardware.

---

**2.3.3 Choosing a Training Framework**

* **Recommended:**

  * **Megatron-LM** for pure pre-training.
  * If you’re working with the Qwen family, grab **Alibaba’s Pai-Megatron-Patch**:
    [https://github.com/alibaba/Pai-Megatron-Patch](https://github.com/alibaba/Pai-Megatron-Patch)

* **Avoid** using DeepSpeed (and its OpenRLHF/DeepSpeed-Chat wrappers) for your *pre-training* stage.

> **Why Megatron-LM?**
>
> 1. **Optimized parallelism** – Megatron’s tensor- and pipeline-parallel kernels are highly tuned (RoPE lives in APEX ops, and an APEX MLP kernel is on the way).
> 2. **Transparent config** – `arguments.py` exposes hundreds of knobs (dropout, layer settings, etc.), so you can dial in exactly what you need.
> 3. **Fast startup** – Even a 100 B-parameter model loads in under a minute, which makes iterative debugging a breeze.

---

**2.3.4 Pre-training Strategy**

> **Study these reference recipes:**
>
> * **MiniCPM** series
> * **phi** series
> * **DeepSeekMath**

They provide battle-tested schedules, hyperparameters, and curriculum techniques that you can adapt for your own large-scale Chinese pre-training.
