# Doge

训练 [Wonderful Matrices](https://arxiv.org/abs/2407.16958) 论文中提出的 `Doge` 小型语言模型.
Doge在 Transformers 的框架基础上, 将序列变换部分的 `Multi-Head Attention` 替换为 `Inner Function Attention`, 将状态变换部分的 `MLP` 替换为 `CDMoE` . 

Train the `Doge` small language model proposed in the paper [Wonderful Matrices](https://arxiv.org/abs/2407.16958).
Doge is based on the Transformers framework, replacing the `Multi-Head Attention` in the sequence transformation part with `Inner Function Attention`, and replacing the `MLP` in the state transformation part with `CDMoE`.

![doge_architecture](./assets/doge_architecture.png)

## 下载预训练与微调数据集
## Download Pre-Training and Fine-Tuning Datasets

预训练数据集, 我们选取了 `cosmopedia-v2` 与 `chinese-cosmopedia` 这种高质量合成数据集并补充 `python-edu` 来保证模型的代码能力. 

For the pre-training dataset, we selected high-quality synthetic datasets such as `cosmopedia-v2` and `chinese-cosmopedia`, and supplemented them with `python-edu` to ensure the model's coding ability.

微调数据集, 我们选取了 `Infinity-Instruct` 的 `0625`, `7M` 与 `Gen` 子集.

For the fine-tuning dataset, we selected the `0625`, `7M`, and `Gen` subsets of `Infinity-Instruct`.

> 请注意: 由于数据集过大, 至少需要 2TB 的存储空间.

> Note: Due to the large size of the dataset, at least 2TB of storage space is required.

In [None]:
# 填写保存路径, 缓存路径和进程数
# Padding save path, cache path and number of processes
!python scripts/download_datasets.py --save_dir S:/datasets --cache_dir S:/datasets/cache --num_proc 16

## 预处理数据集
## Preprocess Datasets

我们需要使用 `tokenizer` 将数据集转为模型可接受的 `input_ids` 与 `attention_mask`.
Doge 使用 `LlamaTokenizer` , 该 tokenizer 词表大小为 `32768` , 使用 `[INST]` 与 `[/INST]` 标记指令. 它还包括工具标记, 但是我们不会在这里使用它们.
像 cosmopedia-v2 与 Infinity-Instruct 这样的数据集就包括 `prompt` 与 `text` 两个字段, 我们就将他们标记为用户指令提示与模型输出文本.

We need to use the `tokenizer` to convert the dataset into `input_ids` and `attention_mask` that the model can accept.
Doge uses the `LlamaTokenizer`, which has a vocabulary size of `32768`, and uses the `[INST]` and `[/INST]` tags to mark instructions. It also includes utility tokens, but we won't use them here.
Datasets like cosmopedia-v2 and Infinity-Instruct include two fields, `prompt` and `text`, which we will mark as user instruction prompts and model output text.

```python
prompt = f"[INST]{prompt}[/INST]"
return tokenizer(prompt, text, padding='max_length', truncation=True, max_length=MAX_LENGTH)
```

当然你也可以自行加入一些指令提示.

Of course, you can also add some instruction prompts yourself.

```python
prompt = f"[INST]You are an AI assistant named `Doge`, you are a language model trained by `Shi Jingze` based on the `Doge` architecture, and your task is to provide appropriate replies and support to users based on their questions and requests.\n你是一个名为 `Doge` 的人工智能助手, 你是由 `石竞泽` 基于 `Doge` 架构训练的语言模型, 你的任务是针对用户的问题和要求提供适当的答复和支持.\n[/INST][INST]{prompt}[/INST]"
return tokenizer(prompt, text, padding='max_length', truncation=True, max_length=MAX_LENGTH)
```

In [None]:
# 填写数据集路径, 保存路径, 分词器路径, 最大长度和进程数
# Padding dataset path, save path, tokenizer path, max length and number of processes
!python scripts/preprocess_datasets.py --datasets_dir S:/datasets --save_dir S:/datasets --tokenizer_path ./tokenizer --max_len 2048 --num_proc 16

## 合并数据集
## Concatenate Datasets

我们将 cosmopedia-v2, chinese-cosmopedia 和 python-edu 数据集合并为 `pretrain` 数据集, 将 0625, 7M 和 Gen 数据集合并为 `finetune` 数据集.
然后将它们打乱顺序 `seed=233` , 并拆分出来 `1,000` 个样本作为测试集.

We combine the cosmopedia-v2, chinese-cosmopedia, and python-edu datasets into the `pretrain` dataset, and the 0625, 7M, and Gen datasets into the `finetune` dataset.
Then shuffle the order `seed=233`, and split out `1,000` samples as the test set.

In [None]:
# 填写数据集路径, 保存路径和进程数
# Padding dataset path, save path and number of processes
!python scripts/merge_datasets.py --datasets_dir S:/datasets --save_dir S:/datasets --num_proc 16

## 配置模型参数
## Configure Model Parameters

我们配置一个 `25M` 的小型模型, 进行训练测试.

| Params | n_layers | d_model | n_heads | n_inner_v | d_cross_domain | d_expert | n_exprets | n_expert_heads | n_expert_pre_head |
|--------|----------|---------|---------|-----------|----------------|----------|-----------|----------------|-------------------|
| 25M    | 8        | 256     | 2       | 2         | 1024           | 256      | 256       | 1              | 2                 |
| 80M    | 12       | 512     | 4       | 4         | 2048           | 512      | 512       | 1              | 2                 |
| 200M   | 16       | 768     | 6       | 6         | 3072           | 768      | 768       | 2              | 4                 |
| 450M   | 24       | 1024    | 8       | 8         | 4096           | 1024     | 1024      | 2              | 4                 |

- n_layers 是模型的解码器层数
- d_model 是模型的隐藏层维度
- n_heads 是InnerFuncAttn的多头注意力头数 d_model // n_heads 最好保持在 64 以上
- n_inner_v 是InnerFUncAttn的V的数量 d_model // n_inner_v 最好保持在 64 以上

In [5]:
from yaml import safe_load

# 读取配置文件, 请根据实际情况自行修改
# Read the configuration file, please modify it according to the actual situation
with open('./model/config/doge_25M.yaml', 'r', encoding='utf-8') as f:
    config = safe_load(f)

config['model']

{'vocab_size': 32768,
 'hidden_size': 256,
 'num_hidden_layers': 8,
 'hidden_bias': False,
 'hidden_dropout': 0.0,
 'hidden_act': 'silu',
 'max_position_embeddings': 16384,
 'rope_theta': 10000.0,
 'use_cache': True,
 'pad_token_id': 0,
 'bos_token_id': 1,
 'eos_token_id': 2,
 'num_attention_heads': 4,
 'num_inner_values': 2,
 'cross_domain_intermediate_size': 1024,
 'private_expert_intermediate_size': 256,
 'num_cdmmoe_experts': 256,
 'num_cdmmoe_heads': 1,
 'num_cdmmoe_experts_per_head': 2}

## 配置预训练超参数
## Configure Pre-Training Hyperparameters

| Params | tokens | num_train_epochs | per_epoch_max_steps | accumulate_steps | learning_rate | warmup_ratio | weight_decay | min_lr_rate |
|--------|--------|------------------|---------------------|------------------|---------------|--------------|--------------|-------------|
| 25M    | 1B     | 2                | 4,000               | 128              | 8e-4          | 0.1          | 0.01         | 0.1         |
| 80M    | 4B     | 2                | 8,000               | 256              | 6e-4          | 0.1          | 0.01         | 0.1         |
| 200M   | 16B    | 2                | 16,000              | 512              | 5e-4          | 0.1          | 0.01         | 0.1         |
| 450M   | 64B    | 2                | 32,000              | 1024             | 4e-4          | 0.1          | 0.01         | 0.1         |

In [5]:
!python train.py --config_path ./model/config/doge_25M.yaml --logging_dir ./log --output_dir ./results --tokenizer_path ./tokenizer 

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=400,
eval_strategy=steps,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=F

DogeForCausalLM(
  (model): DogeModel(
    (word_embed): Embedding(32768, 256, padding_idx=0)
    (rotary_emb): RotaryEmbedding()
    (layers): ModuleList(
      (0-7): 8 x DogeDecoderLayer(
        (in_attn_layernorm): RMSNorm((256,), eps=1e-06)
        (attn): DogeInnerFuncAttn(
          (q_proj): Linear(in_features=256, out_features=256, bias=False)
          (k_proj): Linear(in_features=256, out_features=256, bias=False)
          (v_queries): Linear(in_features=256, out_features=128, bias=False)
          (v_embed): Embedding(2, 256)
          (o_proj): Linear(in_features=256, out_features=256, bias=False)
        )
        (in_ff_layernorm): RMSNorm((256,), eps=1e-06)
        (feed_forward): DogeCDMoE(
          (act_fn): SiLU()
          (shared_up_proj): Linear(in_features=256, out_features=1024, bias=False)
          (shared_down_proj): Linear(in_features=1024, out_features=256, bias=False)
          (queries): Linear(in_features=256, out_features=256, bias=False)
          (

In [16]:
!python train.py --config_path ./model/config/doge_320M.yaml --logging_dir ./log --output_dir ./results --tokenizer_path ./tokenizer 

{'model': {'vocab_size': 32768, 'hidden_size': 1024, 'num_hidden_layers': 16, 'hidden_bias': False, 'hidden_dropout': 0.0, 'hidden_act': 'silu', 'max_position_embeddings': 16384, 'rope_theta': 10000.0, 'use_cache': True, 'pad_token_id': 0, 'bos_token_id': 1, 'eos_token_id': 2, 'num_attention_heads': 8, 'num_inner_values': 8, 'cross_domain_intermediate_size': 4096, 'private_expert_intermediate_size': 1024, 'num_cdmmoe_experts': 1024, 'num_cdmmoe_heads': 2, 'num_cdmmoe_experts_per_head': 4}}


DogeForCausalLM(
  (model): DogeModel(
    (word_embed): Embedding(32768, 1024, padding_idx=0)
    (rotary_emb): RotaryEmbedding()
    (layers): ModuleList(
      (0-15): 16 x DogeDecoderLayer(
        (in_attn_layernorm): RMSNorm((1024,), eps=1e-06)
        (attn): DogeInnerFuncAttn(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_queries): Linear(in_features=1024, out_features=128, bias=False)
          (v_embed): Embedding(8, 1024)
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (in_ff_layernorm): RMSNorm((1024,), eps=1e-06)
        (feed_forward): DogeCDMoE(
          (act_fn): SiLU()
          (shared_up_proj): Linear(in_features=1024, out_features=4096, bias=False)
          (shared_down_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (queries): Linear(in_features=1024, out_features=2048, bias=F

## 评估
## Evaluation

In [1]:
!git clone https://github.com/huggingface/lighteval.git

Cloning into 'lighteval'...
Filtering content: 100% (2/2)
Filtering content: 100% (2/2), 65.44 MiB | 1.52 MiB/s, done.
fatal: active `post-checkout` hook found during `git clone`:
	E:/Doge/lighteval/.git/hooks/post-checkout
For security reasons, this is disallowed by default.
If this is intentional and the hook should actually be run, please
run the command again with `GIT_CLONE_PROTECTION_ACTIVE=false`
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'



In [2]:
!cd lighteval
%pip install '.[accelerate,quantization,adapters]'

Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: "'.[accelerate,quantization,adapters]'"
