# Doge

训练 [Wonderful Matrices](https://arxiv.org/abs/2407.16958) 论文中提出的 `Doge` 小型语言模型.
Doge在 Transformers 的框架基础上, 将序列变换部分的 `Multi-Head Attention` 替换为 `Inner Function Attention`, 将状态变换部分的 `MLP` 替换为 `CDMoE` . 

Train the `Doge` small language model proposed in the paper [Wonderful Matrices](https://arxiv.org/abs/2407.16958).
Doge is based on the Transformers framework, replacing the `Multi-Head Attention` in the sequence transformation part with `Inner Function Attention`, and replacing the `MLP` in the state transformation part with `CDMoE`.

![doge_architecture](./assets/doge_architecture.png)

## 下载预训练与微调数据集
## Download Pre-Training and Fine-Tuning Datasets

预训练数据集, 我们选取了 `fineweb-edu-dedup` 高质量文本, `cosmopedia-v2` 合成指令数据集, 并补充 `python-edu` 与 `open-web-math` 来保证模型的代码与数学能力. 

For the pre-training dataset, we selected the high-quality text `fineweb-edu-dedup`, the synthetic instruction dataset `cosmopedia-v2`, and supplemented it with `python-edu` and `open-web-math` to ensure the model's code and mathematical capabilities.

> 请注意: 由于数据集过大, 至少需要 2TB 的存储空间.

> Note: Due to the large size of the dataset, at least 2TB of storage space is required.

In [None]:
# 填写保存路径, 缓存路径和进程数
# Padding save path, cache path and number of processes
!python scripts/download_datasets.py --save_dir S:/datasets --cache_dir S:/datasets/cache --num_proc 16

## 预处理数据集
## Preprocess Datasets

我们需要使用 `tokenizer` 将数据集转为模型可接受的 `input_ids` 与 `attention_mask`.
Doge 使用 `LlamaTokenizer` , 该 tokenizer 词表大小为 `32768` , 使用 `[INST]` 与 `[/INST]` 标记指令. 它还包括工具标记, 但是我们不会在这里使用它们.
像 cosmopedia-v2 这样的数据集就包括 `prompt` 与 `text` 两个字段, 我们就将他们标记为用户指令提示与模型输出文本.

We need to use the `tokenizer` to convert the dataset into `input_ids` and `attention_mask` that the model can accept.
Doge uses the `LlamaTokenizer`, which has a vocabulary size of `32768`, and uses the `[INST]` and `[/INST]` tags to mark instructions. It also includes utility tokens, but we won't use them here.
Datasets like cosmopedia-v2 include two fields, `prompt` and `text`, which we will mark as user instruction prompts and model output text.

```python
prompt = f"[INST]{prompt}[/INST]"
return tokenizer(prompt, text, padding='max_length', truncation=True, max_length=MAX_LENGTH)
```

当然你也可以自行加入一些指令提示.

Of course, you can also add some instruction prompts yourself.

```python
prompt = f"[INST]You are an AI assistant named `Doge`, you are a language model trained by `Shi Jingze` based on the `Doge` architecture, and your task is to provide appropriate replies and support to users based on their questions and requests.\n你是一个名为 `Doge` 的人工智能助手, 你是由 `石竞泽` 基于 `Doge` 架构训练的语言模型, 你的任务是针对用户的问题和要求提供适当的答复和支持.\n[/INST][INST]{prompt}[/INST]"
return tokenizer(prompt, text, padding='max_length', truncation=True, max_length=MAX_LENGTH)
```

In [None]:
# 填写数据集路径, 保存路径, 分词器路径, token数量, 最大长度和进程数
# Padding dataset path, save path, tokenizer path, token number, max length and number of processes
# NOTE: 我们只保留 100B tokens 的数据集, 比例为 fineweb-edu:cosmopedia-v2:python-edu:open-web-math = 7:2:0.5:0.5, 如果你需要训练更大的模型, 请自行增加数据集的规模
# NOTE: We only keep 100B tokens dataset, the ratio of fineweb-edu:cosmopedia-v2:python-edu:open-web-math = 7:2:0.5:0.5, if you need to train larger model, please increase the scale of the dataset by yourself
!python scripts/preprocess_datasets.py --datasets_dir S:/datasets --save_dir S:/datasets --tokenizer_path ./tokenizer --tokens 100_000_000_000 --max_len 2048 --num_proc 16

## 合并数据集
## Concatenate Datasets

我们将 fineweb-edu_tokenized, cosmopedia-v2, python-edu 和 open-web-math 数据集合并为 `pretrain` 数据集.
然后将它们打乱顺序 `seed=233` , 并拆分出来 `1,000` 个样本作为测试集.

We combine the fineweb-edu_tokenized, cosmopedia-v2, python-edu, and open-web-math datasets into the `pretrain` dataset.
Then shuffle the order `seed=233`, and split out `1,000` samples as the test set.

In [None]:
# 填写数据集路径, 保存路径和进程数
# Padding dataset path, save path and number of processes
!python scripts/concatenate_datasets.py --datasets_dir S:/datasets --save_dir S:/datasets --num_proc 16

## 配置模型参数
## Configure Model Parameters

我们配置一个 `200M` 的小型模型, 进行训练测试.

| Params | n_layers | d_model | n_heads | n_inner_v | d_cross_domain | d_expert | n_exprets | n_expert_heads | n_expert_pre_head |
|--------|----------|---------|---------|-----------|----------------|----------|-----------|----------------|-------------------|
| 200M   | 12       | 768     | 6       | 6         | 3072           | 768      | 3072      | 1              | 2                 |
| 420M   | 16       | 1024    | 8       | 8         | 4096           | 1024     | 4096      | 2              | 4                 |
| 1.3B   | 24       | 1536    | 12      | 12        | 6144           | 1536     | 6144      | 3              | 6                 |
| 3.2B   | 32       | 2048    | 16      | 16        | 8192           | 2048     | 8192      | 4              | 8                 |

- n_layers 是模型的解码器层数
- d_model 是模型的隐藏层维度
- n_heads 是InnerFuncAttn的多头注意力头数 d_model // n_heads 最好保持在 64 以上
- n_inner_v 是InnerFUncAttn的V的数量 d_model // n_inner_v 最好保持在 64 以上

## 配置预训练超参数
## Configure Pre-Training Hyperparameters

| Params | tokens | num_train_epochs | per_epoch_max_steps | accumulate_steps | learning_rate | warmup_ratio | weight_decay | min_lr_rate |
|--------|--------|------------------|---------------------|------------------|---------------|--------------|--------------|-------------|
| 200M   | 5B     | 1                | 10,000              | 256              | 4e-4          | 0.1          | 0.01         | 0.1         |
| 400M   | 20B    | 1                | 20,000              | 512              | 3e-4          | 0.1          | 0.01         | 0.1         |
| 1.3B   | 80B    | 1                | 40,000              | 1024             | 2e-4          | 0.1          | 0.01         | 0.1         |
| 3.2B   | 320B   | 1                | 80,000              | 2048             | 1e-4          | 0.1          | 0.01         | 0.1         |

In [None]:
from yaml import safe_load

# 读取配置文件, 请根据实际情况自行修改
# Read the configuration file, please modify it according to the actual situation
with open('./model/config/doge_25M.yaml', 'r', encoding='utf-8') as f:
    config = safe_load(f)

config['model']

## 模型训练
## Model Training

In [6]:
# 你需要指定配置文件路径, 日志路径, 输出路径和分词器路径, 如果需要的话, 还可以指定检查点继续训练 --resume_from_checkpoint
!python train.py --config_path ./model/config/doge_200M.yaml --logging_dir ./logs --output_dir ./results --tokenizer_path ./tokenizer 

  warn(
DogeForCausalLM(
  (model): DogeModel(
    (word_embed): Embedding(32768, 768, padding_idx=0)
    (rotary_emb): RotaryEmbedding()
    (layers): ModuleList(
      (0-11): 12 x DogeDecoderLayer(
        (in_attn_layernorm): RMSNorm((768,), eps=1e-06)
        (attn): DogeInnerFuncAttn(
          (q_proj): Linear(in_features=768, out_features=768, bias=False)
          (k_proj): Linear(in_features=768, out_features=768, bias=False)
          (v_queries): Linear(in_features=768, out_features=128, bias=False)
          (v_embed): Embedding(6, 768)
          (o_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (in_ff_layernorm): RMSNorm((768,), eps=1e-06)
        (feed_forward): DogeCDMoE(
          (act_fn): SiLU()
          (shared_up_proj): Linear(in_features=768, out_features=3072, bias=False)
          (shared_down_proj): Linear(in_features=3072, out_features=768, bias=False)
          (queries): Linear(in_features=768, out_features=768, bias=False)
 

## 评估
## Evaluation

我们先安装 `miniconda` .

First, install `miniconda`.

```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
```

然后创建评估环境.

Then create an evaluation environment.

```bash
conda create -n lighteval python=3.10.12 
conda activate lighteval
pip install lighteval[accelerate]
```

最后我们运行评估单元格.

Finally, we run the evaluation cell.


In [None]:
MODEL = "LoserCheems/Doge-25M"
OUTPUT_DIR = "./lighteval_results"
!lighteval accelerate \
--model_args="pretrained=$MODEL" \
--output_dir $OUTPUT_DIR \
--override_batch_size 16 \
--tasks "original|mmlu|0|1,lighteval|triviaqa|0|1,lighteval|arc:easy|0|1,lighteval|piqa|0|1,leaderboard|hellaswag|0|1,lighteval|openbookqa|0|1,leaderboard|winogrande|0|1"