# Doge


Train the `Doge` small language model proposed in the paper [Wonderful Matrices](https://arxiv.org/abs/2412.11834).
Doge is based on the Transformers framework, replacing the `Multi-Head Attention` in the sequence transformation part with `Dynamic Mask Attention`, and replacing the `MLP` in the state transformation part with `CDMoE`.

![doge_architecture](../assets/doge_architecture.png)

## PreTrain

### Download Pre-Training and Fine-Tuning Datasets


For the pre-training dataset, we selected the high-quality text `fineweb-edu-dedup`, the synthetic instruction dataset `cosmopedia-v2`, and supplemented it with `python-edu` and `fine-math` to ensure the model's code and mathematical capabilities.


> Note: Due to the large size of the dataset, at least 2TB of storage space is required.

In [None]:
# Padding save path, cache path and number of processes
!python ./examples/pretraining/scripts/download_datasets.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

### Preprocess Datasets


We need to use the `tokenizer` to convert the dataset into `input_ids` and `attention_mask` that the model can accept.
If uses the `LlamaTokenizer`, which has a vocabulary size of `32768`, and uses the `[INST]` and `[/INST]` tags to mark instructions. It also includes utility tokens, but we won't use them here.
Datasets like cosmopedia-v2 include two fields, `prompt` and `text`, which we will mark as user content and assistant content.

```python
conversation = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": text},
]
return tokenizer.apply_chat_template(conversation, tokenize=True, padding='max_length', truncation=True, max_length=MAX_LENGTH, return_dict=True)
```


Of course, you can also add some instruction prompts yourself.


```python
conversation = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am an AI assistant named `Doge`, I am a language model trained by `Shi Jingze` based on the `Doge` architecture, and my task is to provide appropriate answers and support to users based on their questions and requests."},
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": text},
]
```

Here we recommend using the [Doge-tokenizer](https://huggingface.co/JingzeShi/Doge-tokenizer) to process the dataset. It is trained by the `Llama-3.3` tokenizer on the `smollm-corpus`, with a vocabulary size of `32768`. The training script can be found [here](./pretraining/scripts/train_tokenizer_from_old.py).

In [None]:
# Padding dataset path, save path, tokenizer path, number of samples, max length and number of processes
# NOTE: We only keep 256B tokens dataset, the ratio of fineweb-edu:cosmopedia-v2:python-edu:finemath = 7:2:0.5:0.5, if you need to train larger model, please increase the scale of the dataset by yourself
!python ./examples/pretraining/scripts/preprocess_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_path JingzeShi/Doge-tokenizer --train_examples 128000000 --test_examples 1000 --max_length 2048 --num_proc 16

### Concatenate Datasets


We combine the fineweb-edu_tokenized, cosmopedia-v2, python-edu, and finemath datasets into the `pretraining` dataset.
Then shuffle the order `seed=233`, and split out `1,000` samples as the test set.

In [None]:
# Padding dataset path, save path, number of samples and number of processes
!python ./examples/pretraining/scripts/concatenate_datasets.py --datasets_dir ./datasets --save_dir ./datasets --train_examples 128000000 --test_examples 1000 --num_proc 16

### Configure Model


We configure a `20M` small model for training and testing.

| Model | Params | n_layers | d_model | d_ff | n_heads | kv_heads | n_exprets | n_expert_heads | n_expert_pre_head |
|---|---|---|---|---|---|---|---|---|---|
| Doge-20M | 13M | 8 | 256 | 512 | 2 | 1 | - | - | - |
| Doge-MoE-20M | 15M | 8 | 256 | 512 | 2 | 1 | 512 | 1 | 2 |
| Doge-60M | 54M | 16 | 512 | 1024 | 4 | 2 | - | - | - |
| Doge-MoE-80M | 75M | 16 | 512 | 1024 | 4 | 2 | 1024 | 2 | 4 |
| Doge-160M | 152M | 24 | 768 | 1536 | 6 | 3 | - | - | - |
| Doge-MoE-220M | 224M | 24 | 768 | 1536 | 6 | 3 | 1536 | 3 | 6 |
| Doge-320M | 335M | 32 | 1024 | 2048 | 8 | 4 | - | - | - |
| Doge-MoE-500M | 505M | 32 | 1024 | 2048 | 8 | 4 | 2048 | 4 | 8 |


- n_layers is the number of decoder layers in the model
- d_model is the hidden layer dimension of the model
- n_heads is the number of heads of multi-head attention, d_model // n_heads is best kept above 64


> The `Doge-MoE` model can inherit the dense activation parameters of the `Doge` model, and increase the sparse activation parameters by setting `n_experts`, `n_expert_heads`, `n_expert_pre_head`.

### Configure Pre-Training Hyperparameters

| Model | tokens | max_train_steps | accumulate_steps | learning_rate | scheduler | warmup_ratio | decay_ratio | weight_decay | min_lr_rate |
|---|---|---|---|---|---|---|---|---|---|
| Doge-20M | 4B | 8,000 | 256 | 8e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
| Doge-60M | 16B | 16,000 | 512 | 6e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
| Doge-160M | 32B | 24,000 | 768 | 4e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
| Doge-320M | 64B | 32,000 | 1024 | 2e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |

> According to the experience of [SmolLM blog](https://huggingface.co/blog/smollm), we will scale the parameters in [Chinchilla](https://arxiv.org/pdf/2203.15556) by 10 times the scaling ratio of tokens.

> `warmup_stable_decay` is used to continue training with checkpoints on larger datasets at any time, see [Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations](https://arxiv.org/pdf/2405.18392).

### Pre-Training Model

In [None]:
# Padding config path, all arguments are in the config file
!python ./examples/pretraining/scripts/pt.py --config_path ./examples/pretraining/configs/Doge-20M.yaml

### Usage


After training is complete, we can use `AutoModelForCausalLM` of `Transformers` to load the model, and use `AutoTokenizer` to load `LlamaTokenizer`.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JingzeShi/Doge-20M")
model = AutoModelForCausalLM.from_pretrained("JingzeShi/Doge-20M", trust_remote_code=True)

In [None]:
inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.batch_decode(out))

## FineTuning

### Download Fine-Tuning Datasets


For the fine-tuning dataset, we selected the synthetic instruction dataset `smoltalk` for supervised fine-tuning.

In [None]:
# Padding save path, cache path and number of processes
!python ./examples/finetuning/scripts/download_datasets.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

### Process Fine-Tuning Datasets


We'll apply Fine-Tuning datasets with `chat templete` .

In [None]:
# Padding dataset path, save path, tokenizer path, number process.
!python ./examples/finetuning/scripts/preprocess_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_path JingzeShi/Doge-tokenizer --num_proc 8

### SFT Model

We first perform SFT on the model to make it generate responses that follow the `prompt`.

In [None]:
# Padding config path, all arguments are in the config file
!python ./examples/finetuning/scripts/sft.py --config_path ./examples/finetuning/configs/Doge-20M-Instruct-SFT.yaml

### DPO Model

Then we use reinforcement learning to align SFT model with human preferences, here we use the `DPO` algorithm.

In [None]:
!python ./examples/finetuning/scripts/dpo.py --config_path ./examples/finetuning/configs/Doge-20M-Instruct-DPO.yaml

### Usage

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("JingzeShi/Doge-20M-Instruct")
model = AutoModelForCausalLM.from_pretrained("JingzeShi/Doge-20M-Instruct", trust_remote_code=True)

In [12]:
generation_config = GenerationConfig(
      max_new_tokens=100, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      repetition_penalty=1.0
)
steamer = TextStreamer(
      tokenizer=tokenizer, 
      skip_prompt=True
)

In [None]:
prompt = "Hi, how are you doing today?"

conversation = [
      {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

## Evaluation


First, install `miniconda`.


```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
```


Then create an evaluation environment.


```bash
conda create -n lighteval python=3.10.12 
conda activate lighteval
pip install lighteval[accelerate]
```


Finally, we run the evaluation script.


if you use Linux, you can run the following command.


```bash
bash ./examples/evaluate/eval_downstream_tasks.sh
```


if you use Windows, you can run the following command.


```bash
. ./examples/evaluate/eval_downstream_tasks.ps1
```


> NOTE: The MODEL in the script can also be filled with the saved checkpoint path, just need to register the save to run.