Support megatron dataset for T5 #6659

LaiXinyi823 · 2023-08-09T02:52:29Z

PR types

New features

PR changes

APIs

Description

Support megatron dataset for T5

paddle-bot · 2023-08-09T02:52:34Z

Thanks for your contribution!

codecov · 2023-08-09T03:27:29Z

Codecov Report

Merging #6659 (6a41b11) into develop (e49842c) will decrease coverage by 0.17%.
Report is 2 commits behind head on develop.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           develop    #6659      +/-   ##
===========================================
- Coverage    60.06%   59.90%   -0.17%     
===========================================
  Files          552      554       +2     
  Lines        81755    81975     +220     
===========================================
  Hits         49105    49105              
- Misses       32650    32870     +220

Files Changed	Coverage Δ
paddlenlp/experimental/transformers/__init__.py	`0.00% <0.00%> (ø)`
...dlenlp/experimental/transformers/bloom/__init__.py	`0.00% <0.00%> (ø)`
...dlenlp/experimental/transformers/bloom/modeling.py	`0.00% <0.00%> (ø)`
...erimental/transformers/fused_transformer_layers.py	`0.00% <0.00%> (ø)`
...enlp/experimental/transformers/generation_utils.py	`0.00% <0.00%> (ø)`

CLAassistant · 2023-08-09T03:43:47Z

All committers have signed the CLA.

ZHUI · 2023-08-24T07:00:23Z

examples/language_model/t5/README.md

@@ -26,11 +26,13 @@
 python -u  create_pretraining_data.py \


在数据ID化步骤中，我们需要配置tokenzer_name，选择t5模型对应的tokenizer；通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz.（这里提供了一个处理好的预训练数据，可点击链接下载）

这块需要搞一个样例数据出来

KB-Ding

目前有冲突需要解决一下。
是否对齐新旧版本的数据？比如同一份数据处理出的旧版npy和新版bin，设定seed，跑前几步看看两版拿到的数据是否一样。

KB-Ding · 2023-09-05T11:48:11Z

examples/language_model/t5/README.md

@@ -95,6 +98,7 @@ python -u  -m paddle.distributed.launch \
 - `dataloader_num_workers` DataLoader采样进程，当数据输入为瓶颈时，可尝试提高采样进程数目。
 - `eval_steps` 模型评估间隔。
 - `device` 训练设备，默认为GPU。
+- `data_impl` 指定输入文件数据制作类型，默认为mmap，可指定mmap或lazy。


补充一下mmap和lazy的区别：“mmap”格式在读入数据时会建立内存映射，“lazy”格式在读入数据时直接从文件读取。

已修改为：指定输入文件数据制作类型，默认为mmap，可指定mmap或lazy。“mmap”格式在读入数据时会建立内存映射，“lazy”格式在读入数据时直接从文件读取。

KB-Ding · 2023-09-05T11:49:27Z

examples/language_model/t5/t5_run_pretrain_trainer.py

@@ -120,6 +120,7 @@ class DataArguments:
        default=3,
        metadata={"help": "Max N Grams"},
    )
+    data_impl: str = field(default="mmap", metadata={"help": "Data implementation."})


建议和llama一致：help="mmap/lazy format converted from preprocessed data."

已修改为："help": "mmap/lazy format converted from preprocessed data."

KB-Ding · 2023-09-05T12:02:03Z

model_zoo/ernie-1.0/args.py

@@ -32,7 +32,7 @@ def parse_args(MODEL_CLASSES):
    parser.add_argument("--input_dir", default=None, type=str, required=True, help="The input directory where the data will be read from.", )
    parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.")
    parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.")
-
+    parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from json.")


最好和llama保持一致，“mmap/lazy format converted from preprocessed data”

已修改为：help="mmap/lazy format converted from preprocessed data."

KB-Ding · 2023-09-05T12:12:44Z

examples/language_model/t5/README.md

在数据准备中，预置的token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz样例应改为bin格式与idx格式，数据制作可以参考这里，注意参数配置，参考这里

LaiXinyi823 · 2023-09-11T10:00:33Z

目前有冲突需要解决一下。

是否对齐新旧版本的数据？比如同一份数据处理出的旧版npy和新版bin，设定seed，跑前几步看看两版拿到的数据是否一样。

冲突已解决
测试跑几个step，新旧版本数据相同，loss一致。

gongel

paddlenlp/data/indexed_dataset.py没看到任何改动，可以不用放在 PR 里

gongel · 2023-09-12T08:45:21Z

examples/language_model/t5/README.md

@@ -20,17 +20,19 @@

 数据流是预训练的非常重要的，[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意，用户可以查看数据制作的细节文档。

-在数据ID化步骤中，我们需要配置tokenzer_name，选择t5模型对应的tokenizer；通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:[`baike_sample_ids.npy`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_ids.npy), 文章索引信息[`baike_sample_idx.npz`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_idx.npz).（这里提供了一个处理好的预训练数据，可点击链接下载）
+在数据ID化步骤中，我们需要配置tokenzer_name，选择t5模型对应的tokenizer；通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:[`gpt_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/gpt_openwebtext.bin), 文章索引信息[`gpt_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/gpt_openwebtext.idx).（这里提供了一个处理好的预训练数据，可点击链接下载）


这里为啥出来的是 gpt prefix的数据集？

最后t5用的是gpt的openwebtext数据集，所以prefix是gpt，改成t5吗？

用户按你的步骤跑出来的数据集是什么名称呢？保持一致，要可复现。

用户按你的步骤跑出来的数据集是什么名称呢？保持一致，要可复现。

已修改

examples/language_model/t5/t5_run_pretrain_trainer.py

LaiXinyi823 · 2023-09-12T09:04:33Z

paddlenlp/data/indexed_dataset.py没看到任何改动，可以不用放在 PR 里

ok

Support megatron dataset for T5

LaiXinyi823 force-pushed the develop branch 2 times, most recently from a47f2dc to 3bed5b1 Compare August 9, 2023 03:43

LaiXinyi823 force-pushed the develop branch 5 times, most recently from e843a53 to 37ce41e Compare August 9, 2023 09:12

ZHUI reviewed Aug 24, 2023

View reviewed changes

KB-Ding reviewed Sep 5, 2023

View reviewed changes

LaiXinyi823 force-pushed the develop branch 7 times, most recently from e7ab798 to 9cd67e7 Compare September 11, 2023 09:58

LaiXinyi823 force-pushed the develop branch from 9cd67e7 to 08ce7df Compare September 12, 2023 02:13

KB-Ding approved these changes Sep 12, 2023

View reviewed changes

gongel reviewed Sep 12, 2023

View reviewed changes

LaiXinyi823 force-pushed the develop branch 6 times, most recently from 06d6b9a to 673d356 Compare September 12, 2023 15:49

LaiXinyi823 force-pushed the develop branch 2 times, most recently from 1c9e3c4 to d225406 Compare September 12, 2023 16:13

fix T5 readme

6a41b11

Support megatron dataset for T5

LaiXinyi823 force-pushed the develop branch from d225406 to 6a41b11 Compare September 12, 2023 16:21

gongel approved these changes Sep 13, 2023

View reviewed changes

gongel merged commit a43138b into PaddlePaddle:develop Sep 13, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support megatron dataset for T5 #6659

Support megatron dataset for T5 #6659

LaiXinyi823 commented Aug 9, 2023

paddle-bot bot commented Aug 9, 2023

codecov bot commented Aug 9, 2023 •

edited

Loading

CLAassistant commented Aug 9, 2023 •

edited

Loading

ZHUI Aug 24, 2023

KB-Ding left a comment •

edited

Loading

KB-Ding Sep 5, 2023

LaiXinyi823 Sep 6, 2023

KB-Ding Sep 5, 2023

LaiXinyi823 Sep 6, 2023

KB-Ding Sep 5, 2023

LaiXinyi823 Sep 6, 2023

KB-Ding Sep 5, 2023

LaiXinyi823 commented Sep 11, 2023

gongel left a comment

gongel Sep 12, 2023

LaiXinyi823 Sep 12, 2023

gongel Sep 12, 2023

LaiXinyi823 Sep 13, 2023

LaiXinyi823 commented Sep 12, 2023

Support megatron dataset for T5 #6659

Support megatron dataset for T5 #6659

Conversation

LaiXinyi823 commented Aug 9, 2023

PR types

PR changes

Description

paddle-bot bot commented Aug 9, 2023

codecov bot commented Aug 9, 2023 • edited Loading

Codecov Report

CLAassistant commented Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

KB-Ding left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LaiXinyi823 commented Sep 11, 2023

gongel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LaiXinyi823 commented Sep 12, 2023

codecov bot commented Aug 9, 2023 •

edited

Loading

CLAassistant commented Aug 9, 2023 •

edited

Loading

KB-Ding left a comment •

edited

Loading