Add English Documents for UTC (#4476)

* [UTC] Add English documents * [UTC] Add English documents * [utc] update readmes * [utc] add xclue link
PaddlePaddle · Jan 17, 2023 · c21ebfc · c21ebfc
1 parent c1359cb
commit c21ebfc
Show file tree

Hide file tree

Showing 4 changed files with 400 additions and 4 deletions.
diff --git a/applications/zero_shot_text_classification/README.md b/applications/zero_shot_text_classification/README.md
@@ -1,3 +1,5 @@
+简体中文 | [English](README_en.md)
+
 # 零样本文本分类
 
 **目录**
@@ -27,7 +29,7 @@
 **零样本文本分类应用亮点：**
 
 - **覆盖场景全面🎓：**  覆盖文本分类各类主流任务，支持多任务训练，满足开发者多样文本分类落地需求。
-- **效果领先🏃：**  具有突出分类效果的UTC模型作为训练基座，提供良好的零样本和小样本学习能力。
+- **效果领先🏃：**  具有突出分类效果的UTC模型作为训练基座，提供良好的零样本和小样本学习能力。该模型在[ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)和[FewCLUE](https://www.cluebenchmarks.com/fewclue.html)均取得榜首（截止2023年1月11日）。
 - **简单易用：** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用，一行命令即可开启文本分类，轻松完成部署上线，降低多任务文本分类落地门槛。
 - **高效调优✊：** 开发者无需机器学习背景知识，即可轻松上手数据标注及模型训练流程。
 
@@ -156,7 +158,9 @@ python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
 * `seed`：全局随机种子，默认为 42。
 * `model_name_or_path`：进行 few shot 训练使用的预训练模型。默认为 "utc-large"。
 * `output_dir`：必须，模型训练或压缩后保存的模型目录；默认为 `None` 。
-* `dev_path`：开发集路径；默认为 `None` 。
+* `dataset_path`：数据集文件所在目录；默认为 `./data/` 。
+* `train_file`：训练集后缀；默认为 `train.txt` 。
+* `dev_file`：开发集后缀；默认为 `dev.txt` 。
 * `max_seq_len`：文本最大切分长度，包括标签的输入超过最大长度时会对输入文本进行自动切分，标签部分不可切分，默认为512。
 * `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小，默认为8。
 * `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小，默认为8。

diff --git a/applications/zero_shot_text_classification/README_en.md b/applications/zero_shot_text_classification/README_en.md
@@ -0,0 +1,253 @@
+[简体中文](README.md) | English
+
+# Zero-shot Text Classification
+
+**Table of contents**
+- [1. Zero-shot Text Classification Application](#1)
+- [2. Quick Start](#2)
+   - [2.1 Code Structure](#21)
+   - [2.2 Data Annotation](#22)
+   - [2.3 Finetuning](#23)
+   - [2.4 Evaluation](#24)
+   - [2.5 Inference](#25)
+   - [2.6 Deployment](#26)
+   - [2.7 Experiments](#27)
+
+<a name="1"></a>
+
+## 1. Zero-shot Text Classification
+
+This project provides an end-to-end application solution for universal text classification based on Universal Task Classification (UTC) finetuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Text Classification techniques with zero-shot ability in your own products or models.
+
+<div align="center">
+    <img width="700" alt="UTC模型结构图" src="https://user-images.githubusercontent.com/25607475/212268807-66181bcb-d3f9-4086-9d4a-de4d1d0933c2.png">
+</div>
+
+Text Classification refers to assigning a set of categories to given input text. Despite the advantages of tuning, applying text classification techniques in practice remains a challenge due to domain adaption and lack of labeled data, etc. This PaddleNLP Zero-shot Text Classification Guide builds on our UTC from the Unified Semantic Matching (USM) model series and provides an industrial-level solution that supports universal text classification tasks, including but not limited to **semantic analysis, semantic matching, intention recognition and event detection**, allowing you accomplish multiple tasks with a single model. Besides, our method brings good generation performance through multi-task pretraining.
+
+**Highlights:**
+
+- **Comprehensive Coverage**🎓: Covers various mainstream tasks of text classification,  including but not limited to semantic analysis, semantic matching, intention recognition and event detection.
+
+- **State-of-the-Art Performance**🏃:  Strong performance from the UTC model, which ranks first on [ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)/[FewCLUE](https://www.cluebenchmarks.com/fewclue.html) as of 01/11/2023.
+
+- **Easy to use**⚡: Three lines of code to use our Taskflow for out-of-box Zero-shot Text Classification capability. One line of command to model training and model deployment.
+
+- **Efficient Tuning**✊: Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
+
+<a name="2"></a>
+
+## 2. Quick start
+
+For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
+
+<a name="21"></a>
+
+### 2.1 Code structure
+
+```shell
+.
+├── deploy/simple_serving/  # model deployment script
+├── utils.py                # data processing tools
+├── run_train.py            # model fine-tuning script
+├── run_eval.py             # model evaluation script
+├── label_studio.py         # data format conversion script
+├── label_studio_text.md    # data annotation instruction
+└── README.md
+```
+<a name="22"></a>
+
+### 2.2 Data labeling
+
+We recommend using [Label Studio](https://labelstud.io/) for data labeling. You can export labeled data in Label Studio and convert them into the required input format. Please refer to [Label Studio Data Labeling Guide](./label_studio_text_en.md) for more details.
+
+Here we provide a pre-labeled example dataset `Medical Question Intent Classification Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning.
+
+Download the medical question intent classification dataset:
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/utc-medical.tar.gz
+tar -xvf utc-medical.tar.gz
+mv utc-medical data
+rm utc-medical.tar.gz
+```
+
+Generate training/validation set files:
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options ./data/label.txt
+```
+
+For multi-task training, you can convert data with script seperately and move them to the same directory.
+
+<a name="23"></a>
+
+### 2.3 Finetuning
+
+Use the following command to fine-tune the model using `utc-large` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best/`:
+
+Single GPU:
+
+```shell
+python run_train.py  \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path utc-large \
+    --output_dir ./checkpoint/model_best \
+    --dataset_path ./data/ \
+    --max_seq_length 512  \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model macro_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+Multiple GPUs:
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path utc-large \
+    --output_dir ./checkpoint/model_best \
+    --dataset_path ./data/ \
+    --max_seq_length 512  \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model macro_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+Parameters:
+
+* `device`: Training device, one of 'cpu' and 'gpu' can be selected; the default is GPU training.
+* `logging_steps`: The interval steps of log printing during training, the default is 10.
+* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `seed`: global random seed, default is 42.
+* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "utc-large".
+* `output_dir`: Required, the model directory saved after model training or compression; the default is `None`.
+* `dataset_path`: The directory to dataset; defaults to `./data`.
+* `train_file`: Training file name; defaults to `train.txt`.
+* `dev_file`: Development file name; defaults to `dev.txt`.
+* `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+* `per_device_train_batch_size`: The batch size of each GPU core/CPU used for training, the default is 8.
+* `per_device_eval_batch_size`: Batch size per GPU core/CPU for evaluation, default is 8.
+* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
+* `learning_rate`: The maximum learning rate for training, UTC recommends setting it to 1e-5; the default value is 3e-5.
+* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
+* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
+* `do_export`: Whether to export, setting this parameter means to export static graph, and it is not set by default.
+* `export_model_dir`: Static map export address, the default is `./checkpoint/model_best`.
+* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
+* `disable_tqdm`: Whether to use tqdm progress bar.
+* `metric_for_best_model`: Optimal model metric, UTC recommends setting it to `macro_f1`, the default is None.
+* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
+* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
+
+<a name="24"></a>
+
+### 2.4 Evaluation
+
+Model evaluation:
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/test.txt \
+    --per_device_eval_batch_size 2 \
+    --max_seq_len 512 \
+    --output_dir ./checkpoint_test
+```
+
+Parameters:
+
+- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
+- `test_path`: The test set file for evaluation.
+- `per_device_eval_batch_size`: Batch size, please adjust it according to the machine situation, the default is 8.
+- `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+
+<a name="25"></a>
+
+### 2.5 Inference
+
+You can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`.
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+>>> my_cls = Taskflow("zero_shot_text_classification", schema=schema, task_path='./checkpoint/model_best', precision="fp16")
+>>> pprint(my_cls("中性粒细胞比率偏低"))
+```
+
+<a name="26"></a>
+
+### 2.6 Deployment
+
+We provide the deployment solution on the foundation of PaddleNLP SimpleServing, where you can easily build your own deployment service with three-line code.
+
+```
+# Save at server.py
+from paddlenlp import SimpleServer, Taskflow
+
+schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"]
+utc = Taskflow("zero_shot_text_classification",
+               schema=schema,
+               task_path="../../checkpoint/model_best/",
+               precision="fp32")
+app = SimpleServer()
+app.register_taskflow("taskflow/utc", utc)
+```
+
+```
+# Start the server
+paddlenlp server server:app --host 0.0.0.0 --port 8990
+```
+
+It supports FP16 (half-precision) and multiple process for inference acceleration.
+
+<a name="27"></a>
+
+### 2.7 Experiments
+
+The results reported here are based on the development set of KUAKE-QIC.
+
+  |          |  Accuracy  | Micro F1   | Macro F1   |
+  | :------: | :--------: | :--------: | :--------: |
+  | 0-shot   | 28.69 | 87.03 | 60.90 |
+  | 5-shot   | 64.75 | 93.34 | 80.33 |
+  | 10-shot  | 65.88 | 93.76 | 81.34 |
+  | full-set | 81.81 | 96.65 | 89.87 |
+
+where k-shot means that there are k annotated samples per label for training.
diff --git a/applications/zero_shot_text_classification/label_studio_text.md b/applications/zero_shot_text_classification/label_studio_text.md
@@ -1,3 +1,5 @@
+简体中文 | [English](label_studio_text_en.md)
+
 # 文本分类任务Label Studio使用指南
 
  **目录**
@@ -105,7 +107,7 @@ label-studio start
 
 将导出的文件重命名为``label_studio.json``后，放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UTC的数据格式。
 
-在数据转换阶段，我们会自动构造用于模型训练的标签候选信息。例如在医疗意图分类中，标签候选为``["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]``，可通过``options``参数进行配置。
+在数据转换阶段，还需要提供标签候选信息，放在`./data/label.txt`文件中，每个标签占一行。例如在医疗意图分类中，标签候选为``["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]``，也可通过``options``参数直接进行配置。
 
 ```shell
 python label_studio.py \
@@ -122,7 +124,7 @@ python label_studio.py \
 - ``label_studio_file``: 从label studio导出的数据标注文件。
 - ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
 - ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
-- ``options``: 指定分类任务的类别标签。若输入类型为文件，则文件中每行一个标签。默认为None，自动从输入数据中构造标签候选集合，当数据量大时耗时较长。
+- ``options``: 指定分类任务的类别标签。若输入类型为文件，则文件中每行一个标签。
 - ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
 - ``seed``: 随机种子，默认为1000.