**Section 2.3.1 Training a Tokenizer**

> **Before starting pre-training**, you first need to choose a backbone model and then train a tokenizer based on both your collected in-domain data and larger, general-purpose corpora. Use a subword algorithm such as BPE / BBPE / WordPiece.
>
> A very common issue is that **most top-performing language models have not been sufficiently pre-trained on Chinese text** (this was true in 2024, although by 2025 there are stronger Chinese-capable models like Deepseek-R1). As a result, many practitioners still begin with an English-pretrained model and then fine-tune or further pre-train a tokenizer on Chinese data, hoping to transfer its strong English capabilities into the Chinese domain.
>
> In practice, a tokenizer’s job is to split each input sentence into a sequence of tokens. For example:
>
> ```
> “你好世界”  →  [“你”, “好”, “世”, “界”]
> ```

---

### Tokenization Tips

1. **Numeric splitting**
   Avoid mistakes like splitting “9.9” into `[“9”, “>”, “9”, “11”]`. Ensure your rules handle decimals and number formats correctly.

2. **Compression ratio control**
   Measure how many characters on average correspond to one token.

   * If the ratio is too low (too many tokens), training becomes inefficient.
   * If the ratio is too high (too few tokens), the model’s expressiveness suffers.
     In Chinese models, a reasonable target is roughly **1 token ≃ 1.5 characters**.

3. **Manually remove unwanted or sensitive tokens**
   Weed out overly rare, private, or otherwise undesirable tokens before training.

4. **Domain-specific tokens**
   If you know your application domain in advance, **add tokens** for frequent domain terms to the vocabulary (e.g. in a medical setting, include tokens like “阿莫西林” or “青霉素”). This both reduces sequence length and preserves semantic integrity.

5. **Leave room for future tokens**
   Make sure your chosen vocabulary size sits comfortably below your model’s embedding size—**reserve about 1,000 slots**. Later on, you can introduce brand-new tokens that weren’t seen during the original pre-training.

---

**Vocabulary Expansion**

To **lower the difficulty** of training a model on Chinese text, it’s common to **extend the original tokenizer’s vocabulary** by manually adding high-frequency Chinese tokens. This “vocab expansion” lets you:

* Start with an existing pretrained tokenizer (e.g. LLaMA’s)
* Append a set of new Chinese tokens
* Re-train only the embeddings for those new tokens (a lightweight second phase)

For example, comparing two tokenizers:

| tokenizer                             | original vocab size | added tokens | total vocab size |
| ------------------------------------- | ------------------- | ------------ | ---------------- |
| decapoda-research/llama-7b-hf         | 32,000              | 0            | 32,000           |
| ziqingyang/chinese-llama-plus-lora-7b | 49,953              | 17,953       | 49,953           |
| **difference**                        | —                   | −17,953      | −17,953          |

* **Tokenizer 1 (LLaMA-7B)** has no extra Chinese tokens.
* **Tokenizer 2** adds **17,953** new tokens—and most of these are common Chinese words or phrases.

A small sample of the unique tokens added by Tokenizer 2:

```
0: 原理        (“principle”)
1: 了解更多    (“learn more”)
2: 帮你        (“help you”)
3: 要          (“want to”)
…
```

By merging these high-utility Chinese tokens into the vocabulary and then doing a **brief second-stage embedding pre-training** (e.g. on several billion Chinese characters), you give your model a much stronger foundation for understanding and generating Chinese text.
