# Doge


Train the `Doge` small language model proposed in the paper [Wonderful Matrices](https://arxiv.org/abs/2412.11834).
Doge is based on the Transformers framework, replacing the `Multi-Head Attention` in the sequence transformation part with `Dynamic Mask Attention`, and replacing the `MLP` in the state transformation part with `CDMoE`.

![doge_architecture](https://github.com/SmallDoges/small-doge/blob/main/assets/doge_architecture.png?raw=1)

## PreTraining

### Download Pre-Training Datasets


For the pre-training dataset, we selected the high-quality text `fineweb-edu-dedup`, the synthetic instruction dataset `cosmopedia-v2`, and supplemented it with `python-edu` and `fine-math` to ensure the model's code and mathematical capabilities.


> Note: Due to the large size of the dataset, at least 2TB of storage space is required. If you do not have enough storage space, you can choose to download part of the dataset [here](./utils/download_pt_datasets.py).

In [6]:
!git clone https://github.com/SmallDoges/small-doge.git


Cloning into 'small-doge'...
remote: Enumerating objects: 2091, done.[K
remote: Counting objects: 100% (479/479), done.[K
remote: Compressing objects: 100% (295/295), done.[K
remote: Total 2091 (delta 233), reused 250 (delta 154), pack-reused 1612 (from 2)[K
Receiving objects: 100% (2091/2091), 11.50 MiB | 35.06 MiB/s, done.
Resolving deltas: 100% (1073/1073), done.


In [7]:
%cd small-doge
!pip install boto3 datasets


/content/small-doge/small-doge


In [8]:
# Padding save path, cache path and number of processes
!python ./examples/utils/download_pt_datasets.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

Namespace(save_dir='./datasets', cache_dir='./cache', num_proc=1, is_parallel=False)
README.md: 100% 7.05k/7.05k [00:00<00:00, 26.2MB/s]
Resolving data files: 100% 104/104 [00:00<00:00, 1280.02it/s]
Resolving data files: 100% 234/234 [00:00<00:00, 43487.40it/s]
Downloading data:   0% 0/234 [00:00<?, ?files/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

train-00000-of-00234.parquet:   0% 0.00/2.40G [00:00<?, ?B/s][A
train-00000-of-00234.parquet:   0% 10.5M/2.40G [00:00<01:28, 27.0MB/s][A
train-00000-of-00234.parquet:   1% 31.5M/2.40G [00:00<00:33, 69.5MB/s][A
train-00000-of-00234.parquet:   2% 41.9M/2.40G [00:00<00:30, 77.9MB/s][A
train-00000-of-00234.parquet:   3% 62.9M/2.40G [00:00<00:27, 83.5MB/s][A
train-00000-of-00234.parquet:   3% 73.4M/2.40G [00:01<00:36, 64.3MB/s][A
train-00000-of-00234.p

### Preprocess Datasets


We need to use the `tokenizer` to convert the dataset into `input_ids` and `attention_mask` that the model can accept.
If uses the `LlamaTokenizer`, which has a vocabulary size of `32768`, and uses the `[INST]` and `[/INST]` tags to mark instructions. It also includes utility tokens, but we won't use them here.
Datasets like cosmopedia-v2 include two fields, `prompt` and `text`, which we will mark as user content and assistant content.

```python
conversation = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": text},
]
return tokenizer.apply_chat_template(conversation, tokenize=True, padding='max_length', truncation=True, max_length=MAX_LENGTH, return_dict=True)
```


Of course, you can also add some instruction prompts yourself.


```python
conversation = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am an AI assistant named `Doge`, I am a language model trained by `Shi Jingze` based on the `Doge` architecture, and my task is to provide appropriate answers and support to users based on their questions and requests."},
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": text},
]
```

Here we recommend using the [Doge-tokenizer](https://huggingface.co/SmallDoge/Doge-tokenizer) to process the dataset. It is trained by the `Llama-3.3` tokenizer on the `smollm-corpus`, with a vocabulary size of `32768`. The training script can be found [here](./examples/utils/train_tokenizer_from_old.py).

In [9]:
# Padding dataset path, save path, tokenizer path, number of samples, max length and number of processes
# NOTE: We only keep 256B tokens dataset, the ratio of fineweb-edu:cosmopedia-v2:python-edu:finemath = 7:2:0.5:0.5, if you need to train larger model, please increase the scale of the dataset by yourself
!python ./examples/utils/preprocess_pt_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_name_or_path SamllDoge/Doge-tokenizer --train_examples 128000000 --test_examples 1000 --max_length 2048 --num_proc 16

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_http.py", line 409, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.11/dist-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/SamllDoge/Doge-tokenizer/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/transformers/utils/hub.py", line 470, in cached_files
    hf_hub_download(
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1008, in hf_hub_download


### Concatenate Datasets


We combine the fineweb-edu_tokenized, cosmopedia-v2, python-edu, and finemath datasets into the `pretraining` dataset.
Then shuffle the order `seed=233`, and split out `1,000` samples as the test set.

In [10]:
# Padding dataset path, save path, number of samples and number of processes
!python ./examples/utils/concatenate_pt_datasets.py --datasets_dir ./datasets --save_dir ./datasets --train_examples 128000000 --test_examples 1000 --num_proc 16

Traceback (most recent call last):
  File "/content/small-doge/small-doge/./examples/utils/concatenate_pt_datasets.py", line 35, in <module>
    main(args)
  File "/content/small-doge/small-doge/./examples/utils/concatenate_pt_datasets.py", line 7, in main
    fineweb_edu_dataset = load_from_disk(args.datasets_dir + '/fineweb-edu_processed')
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/datasets/load.py", line 2140, in load_from_disk
    raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory ./datasets/fineweb-edu_processed not found


### Configure Model


We configure a `20M` small model for training and testing.

| Model | Params | n_layers | d_model | d_ff | n_heads | kv_heads | n_exprets | n_expert_heads | n_expert_pre_head |
|---|---|---|---|---|---|---|---|---|---|
| Doge-20M | 13M | 8 | 256 | 512 | 2 | 1 | - | - | - |
| Doge-60M | 54M | 16 | 512 | 1024 | 4 | 2 | - | - | - |
| Doge-160M | 152M | 24 | 768 | 1536 | 6 | 3 | - | - | - |
| Doge-320M | 335M | 32 | 1024 | 2048 | 8 | 4 | - | - | - |

- n_layers is the number of decoder layers in the model
- d_model is the hidden layer dimension of the model
- n_heads is the number of heads of multi-head attention, d_model // n_heads is best kept above 64


> The `Doge-MoE` model can inherit the dense activation parameters of the `Doge` model, and increase the sparse activation parameters by setting `n_experts`, `n_expert_heads`, `n_expert_pre_head`.

### Configure Pre-Training Hyperparameters

| Model | tokens | max_train_steps | accumulate_steps | learning_rate | scheduler | warmup_ratio | decay_ratio | weight_decay | min_lr_rate |
|---|---|---|---|---|---|---|---|---|---|
| Doge-20M | 4B | 8,000 | 256 | 8e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
| Doge-60M | 16B | 16,000 | 512 | 6e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
| Doge-160M | 32B | 24,000 | 768 | 4e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
| Doge-320M | 64B | 32,000 | 1024 | 2e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |

> According to the experience of [SmolLM blog](https://huggingface.co/blog/smollm), we will scale the parameters in [Chinchilla](https://arxiv.org/pdf/2203.15556) by 10 times the scaling ratio of tokens.

> `warmup_stable_decay` is used to continue training with checkpoints on larger datasets at any time, see [Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations](https://arxiv.org/pdf/2405.18392).

### Pre-Training Model

In [11]:
# Padding config path, all arguments are in the config file
!ACCELERATE_LOG_LEVEL=info accelerate launch ./src/small_doge/pt.py --config_file recipes/accelerate_configs/single_gpu.yaml --config recipes/doge/Doge-20M/config_full.yaml

	`--num_processes` was set to a value of `0`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
/usr/bin/python3: can't open file '/content/small-doge/small-doge/./src/small_doge/pt.py': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1198, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 785, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', './src/small_doge/pt.py', '--config_file',

### Usage


After training is complete, we can use `AutoModelForCausalLM` of `Transformers` to load the model, and use `AutoTokenizer` to load `LlamaTokenizer`.

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/56.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

configuration_doge.py:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/SmallDoge/Doge-20M:
- configuration_doge.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_doge.py:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/SmallDoge/Doge-20M:
- modeling_doge.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/52.5M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/139 [00:00<?, ?B/s]

In [13]:
inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.batch_decode(out))

["<|begin_of_text|>Hey how are you doing? I'm not sure what you are doing. I'm not sure what you are doing. I'm"]


## Instruction Fine-Tuning

### Download Fine-Tuning Datasets


For the fine-tuning dataset, we selected the `smoltalk` dataset for SFT, and the `ultrafeedback_binarized` dataset for DPO.

In [22]:
# Padding save path, cache path and number of processes
!python ./examples/utils/download_ft_datasets.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

DatasetDict({
    train: Dataset({
        features: ['messages', 'source'],
        num_rows: 1043917
    })
    test: Dataset({
        features: ['messages', 'source'],
        num_rows: 54948
    })
})
Saving the dataset (2/9 shards):  22% 231982/1043917 [00:01<00:04, 171274.25 examples/s]
Traceback (most recent call last):
  File "/content/small-doge/small-doge/./examples/utils/download_ft_datasets.py", line 57, in <module>
    main(args)
  File "/content/small-doge/small-doge/./examples/utils/download_ft_datasets.py", line 39, in main
    download_smoltalk(args.save_dir, args.cache_dir, args.num_proc)
  File "/content/small-doge/small-doge/./examples/utils/download_ft_datasets.py", line 8, in download_smoltalk
    dataset.save_to_disk(save_dir + "/smoltalk", num_proc=num_proc)
  File "/usr/local/lib/python3.11/dist-packages/datasets/dataset_dict.py", line 1348, in save_to_disk
    dataset.save_to_disk(
  File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", li

### Process Fine-Tuning Datasets


We'll apply Fine-Tuning datasets with `chat templete` .

In [15]:
# Padding dataset path, save path, tokenizer path, number process.
!python ./examples/utils/preprocess_ft_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_name_or_path SmallDoge/Doge-tokenizer --num_proc 8

Traceback (most recent call last):
  File "/content/small-doge/small-doge/./examples/utils/preprocess_ft_datasets.py", line 5, in <module>
    import trl
ModuleNotFoundError: No module named 'trl'


### Concatenate Datasets

If you download more datasets for fine-tuning, we need to merge and shuffle them together.

In [16]:
# Padding dataset path, save path and number of processes
!python ./examples/utils/concatenate_ft_datasets.py --datasets_dir ./datasets --save_dir ./datasets --num_proc 16

Traceback (most recent call last):
  File "/content/small-doge/small-doge/./examples/utils/concatenate_ft_datasets.py", line 75, in <module>
    main(args)
  File "/content/small-doge/small-doge/./examples/utils/concatenate_ft_datasets.py", line 55, in main
    concatenate_sft_datasets(args.datasets_dir, args.save_dir, args.num_proc)
  File "/content/small-doge/small-doge/./examples/utils/concatenate_ft_datasets.py", line 6, in concatenate_sft_datasets
    smoltalk_dataset = load_from_disk(datasets_dir + '/smoltalk_processed')
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/datasets/load.py", line 2140, in load_from_disk
    raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory ./datasets/smoltalk_processed not found


### SFT Model

We first perform SFT on the model to make it generate responses that follow the `prompt`.

In [17]:
# Padding config path, all arguments are in the config file
!ACCELERATE_LOG_LEVEL=info accelerate launch ./src/small_doge/sft.py --config_file recipes/accelerate_configs/single_gpu.yaml --config recipes/doge/Doge-20M-Instruct/sft/config_full.yaml

	`--num_processes` was set to a value of `0`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
/usr/bin/python3: can't open file '/content/small-doge/small-doge/./src/small_doge/sft.py': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1198, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 785, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', './src/small_doge/sft.py', '--config_file

### DPO Model

Then we use reinforcement learning to align SFT model with human preferences, here we use the `DPO` algorithm.

In [18]:
# Padding config path, all arguments are in the config file
!ACCELERATE_LOG_LEVEL=info accelerate launch ./src/small_doge/dpo.py --config_file recipes/accelerate_configs/single_gpu.yaml --config recipes/doge/Doge-20M-Instruct/dpo/config_full.yaml

	`--num_processes` was set to a value of `0`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
/usr/bin/python3: can't open file '/content/small-doge/small-doge/./src/small_doge/dpo.py': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1198, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 785, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', './src/small_doge/dpo.py', '--config_file

### Usage

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M-Instruct")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M-Instruct", trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/56.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

configuration_doge.py:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/SmallDoge/Doge-20M-Instruct:
- configuration_doge.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_doge.py:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/SmallDoge/Doge-20M-Instruct:
- modeling_doge.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/52.5M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/139 [00:00<?, ?B/s]

In [20]:
generation_config = GenerationConfig(
      max_new_tokens=100,
      use_cache=True,
      do_sample=True,
      temperature=0.8,
      repetition_penalty=1.0
)
steamer = TextStreamer(
      tokenizer=tokenizer,
      skip_prompt=True
)

In [21]:
prompt = "Hi, how are you doing today?"

conversation = [
      {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs,
    tokenizer=tokenizer,
    generation_config=generation_config,
    streamer=steamer
)

<|start_header_id|>assistant<|end_header_id|>
Hi, I'm reaching out to meet you in the new room. I'm excited to present your work on a new project. I've been working on a real project to create a positive impact that reflects your interest and willingness to learn and experience with the project. I'm eager to meet you in the new room and discuss your creativity.

I'm excited to discuss the potential and potential of our project in a real-world setting. I'm here to support you with our guidance


## Evaluation

Refer to [evaluation](../evaluation/README.md).