PaddlePaddle · ZeyuChen · Jan 15, 2022 · Jan 4, 2022 · Jan 4, 2022 · Jan 4, 2022
diff --git a/examples/language_model/data_tools/README.md b/examples/language_model/data_tools/README.md
@@ -131,7 +131,7 @@ chinese words:
                         可选。是否需要WWM策略。一般而言，Bert/Ernie模型需要，GPT不需要。
   --cn_seg_func {lac,seg,jieba}
                         Words segment function for chinese words.
-                        默认lac，jieba速度较快
+                        默认jieba，jieba速度较快，lac模型更复杂。
   --cn_splited          Is chinese corpus is splited in to words.
                         分词后的文本，可选。设置此选项则，cn_seg_func不起作用。
                         例如分词后文本串 "百度 手机助手 是 Android 手机 的 权威 资源平台"
@@ -148,7 +148,7 @@ common config:
   --workers WORKERS     Number of worker processes to launch
                         处理文本id化的进程个数。
 ```
-同过下面脚本转化，我们可以得到处理好的预训练数据，token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
+通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
 ```
 python -u  create_pretraining_data.py \
     --model_name ernie-1.0 \
@@ -190,3 +190,51 @@ sh run_static.sh
 ## 参考内容
 
 注: 大部分数据流程，参考自[Megatron](https://github.com/NVIDIA/Megatron-LM)，特此表达感谢。
+
+
+# 附录
+
+## Clue corpus small 数据集处理教程
+**数据集简介**：可用于语言建模、预训练或生成型任务等，数据量超过14G，近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目
+包含如下子语料库（总共14G语料）：新闻语料 news2016zh_corpus， 社区互动语料webText2019zh_corpus，维基百科语料wiki2019zh_corpus，评论数据-语料comments2019zh_corpus。
+
+**数据集下载**：
+用户可以通过官方githu网页下载，https://github.com/CLUEbenchmark/CLUE 。同时，为方便用户，我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598)，[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据，下载好后，可以核对md5值：
+```shell
+> md5sum ./*
+ 8a8be341ebce39cfe9524fb0b46b08c5  ./comment2019zh_corpus.zip
+ 4bdc2c941a7adb4a061caf273fea42b8  ./news2016zh_corpus.zip
+ fc582409f078b10d717caf233cc58ddd  ./webText2019zh_corpus.zip
+ 157dacde91dcbd2e52a60af49f710fa5  ./wiki2019zh_corpus.zip
+```
+解压文件
+```shell
+unzip comment2019zh_corpus.zip -d  clue_corpus_small_14g/comment2019zh_corpus
+unzip news2016zh_corpus.zip    -d  clue_corpus_small_14g/news2016zh_corpus  
+unzip webText2019zh_corpus.zip -d  clue_corpus_small_14g/webText2019zh_corpus
+unzip wiki2019zh_corpus.zip    -d  clue_corpus_small_14g/wiki2019zh_corpus  
+```
+将txt文件转换为jsonl格式
+```
+python trans_to_json.py  --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
+```
+现在我们得到了jsonl格式的数据集，下面是针对训练任务的数据集应用，此处以ernie为例。
+```
+python -u  create_pretraining_data.py \
+    --model_name ernie-1.0 \
+    --tokenizer_name ErnieTokenizer \
+    --input_path clue_corpus_small_14g.jsonl \
+    --split_sentences\
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func jieba \
+    --output_prefix clue_corpus_small_14g_20220104 \
+    --workers 48 \
+    --log_interval 10000
+```
+数据共有文档`15702702`条左右，由于分词比较耗时，大概一小时左右可以完成。在当前目录下产出训练所需数据。
+```
+clue_corpus_small_14g_20220104_ids.npy
+clue_corpus_small_14g_20220104_idx.npz
+```
+用户可以使用此数据进行预训练任务。
diff --git a/examples/language_model/data_tools/create_pretraining_data.py b/examples/language_model/data_tools/create_pretraining_data.py
@@ -86,7 +86,7 @@ def get_args():
     group.add_argument(
         '--cn_seg_func',
         type=str,
-        default='lac',
+        default='jieba',
         choices=['lac', 'seg', 'jieba'],
         help='Words segment function for chinese words.')
     group.add_argument(