add hierarchical text classification #2501

lugimzzz · 2022-06-13T12:37:40Z

PR types

Others

PR changes

Others

Description

新增层次分类算法，包含训练，预测，静态图导出，裁剪等算法，支持paddle serving， triton， onnxruntime多种部署方式。

以下是本项目主要代码结构及说明：

hierarchical_classification/
├── deploy # 部署
│   └── onnxruntime # 导出ONNX模型并基于ONNXRuntime部署
│   │   ├── infer.py # ONNXRuntime推理部署示例
│   │   ├── predictor.py
│   │   └── README.md # 使用说明
│   ├── serving # 基于Paddle Serving 部署
│   │   ├──config.yml # 层次分类任务启动服务端的配置文件
│   │   ├──rpc_client.py # 层次分类任务发送pipeline预测请求的脚本
│   │   └──service.py # 层次分类任务启动服务端的脚本
│   └── triton # 基于Triton server 部署
│       ├── README.md # 使用说明
│       ├── seqcls_grpc_client.py # 客户端测试代码
│       └── models # 部署模型
│           ├── seqcls
│           │   └── config.pbtxt
│           ├── seqcls_model
│           │   └──config.pbtxt
│           ├── seqcls_postprocess
│           │   ├── 1
│           │   │   └── model.py
│           │   └── config.pbtxt
│           └── tokenizer
│               ├── 1
│               │   └── model.py
│               └── config.pbtxt
├── train.py # 训练评估脚本
├── predict.py # 预测脚本
├── export_model.py # 动态图参数导出静态图参数脚本
├── utils.py # 工具函数脚本
├── metric.py # metric脚本
├── prune.py # 裁剪脚本
├── prune_trainer.py # 裁剪trainer脚本
├── prune_config.py # 裁剪训练参数配置
├── requirements.txt # 环境依赖
└── README.md # 使用说明

wawltor · 2022-06-15T03:35:48Z

applications/text_classification/hierarchical_classification/README.md

@@ -0,0 +1,251 @@
+# 多标签层次分类任务


pre-commit 没有安装，code style check没有过

wawltor · 2022-06-15T03:38:23Z

applications/text_classification/hierarchical_classification/README.md

+## 层次分类任务介绍
+
+多标签层次分类任务指自然语言处理任务中，每个样本具有多个标签标记，并且标签集合中存在预定义的树状结构或有向无环图结构，多标签层次分类需要充分考虑标签集之间的层次结构关系来预测层次化预测结果。在现实场景中，大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在预定义的层次化结构，需要利用算法为文本自动标注更细粒度和更准确的标签。本项目中采用通用多标签层次分类算法将每个结点的标签路径视为一个多类标签，使用单个分类器进行决策。
+


这块能不能用一些图片或者例子来示意层次分类了？

新增图片表示：
如下图所示(R代表根节点)，层次分类任务中标签层次结构分为两类，一类为树状结构，另一类为有向无环图(DAG)结构。有向无环图结构与树状结构区别在于，有向无环图中的节点可能存在不止一个父节点。层次分类问题可以被视为一个多标签问题，以左图树状结构为例，如果一个样本属于类别1.2.1，样本也天然地同时属于类别1和类别1.2两个样本标签。本项目采用通用多标签层次分类算法，将每个结点的标签路径视为一个多分类标签，使用单个多标签分类器进行决策。以上面的例子为例，该样本包含三个标签：1、1->1.2、1->1.2->1.2.1。

wawltor · 2022-06-15T03:39:10Z

applications/text_classification/hierarchical_classification/README.md

+
+## 层次分类任务介绍
+
+多标签层次分类任务指自然语言处理任务中，每个样本具有多个标签标记，并且标签集合中存在预定义的树状结构或有向无环图结构，多标签层次分类需要充分考虑标签集之间的层次结构关系来预测层次化预测结果。在现实场景中，大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在预定义的层次化结构，需要利用算法为文本自动标注更细粒度和更准确的标签。本项目中采用通用多标签层次分类算法将每个结点的标签路径视为一个多类标签，使用单个分类器进行决策。


多类标签->多分类标签

wawltor · 2022-06-15T03:43:30Z

applications/text_classification/hierarchical_classification/README.md

+## 模型微调
+
+我们以层次分类公开数据集WOS(Web of Science)为示例，在训练集上进行模型训练，并在开发集上验证，开发集中选出的最优的模型在测试集上进行评估。WOS数据集是一个层次文本分类数据集，包含7个父类和134子类，每个样本对应一个父类标签和子类标签，父类标签和子类标签间具有层次结构关系，WOS数据集已内置到PaddleNLP中。
+


，WOS数据集已内置到PaddleNLP中。这句话可以去掉

wawltor · 2022-06-15T03:44:09Z

applications/text_classification/hierarchical_classification/README.md

+单卡训练
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+$ python train.py --early_stop


如果是单卡训练不应该 unset CUDA_VISIBLE_DEVICES , 多卡的时候需要unset一下

wawltor · 2022-06-15T03:45:01Z

applications/text_classification/hierarchical_classification/README.md

+指定GPU卡号/多卡训练
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+$ python -m paddle.distributed.launch --gpus "0,1" train.py --early_stop


这里建议 --gpus "0,1" -> --gpus "0", 因为大多数用户是没有多卡，可以备注一下如果用多卡，可以指定0，1，这样的数字

已修改：

指定GPU卡号/多卡训练

unset CUDA_VISIBLE_DEVICES python -m paddle.distributed.launch --gpus "0" train.py --early_stop

使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"

wawltor · 2022-06-15T03:48:19Z

applications/text_classification/hierarchical_classification/README.md

+**NOTE:**
+* 如需恢复模型训练，则可以设置 `init_from_ckpt` ， 如 `init_from_ckpt=checkpoints/macro/model_state.pdparams` 。
+* 如需训练中文层次分类任务，只需更换预训练模型参数 `model_name` 。中文训练任务推荐使用"ernie-3.0-base-zh"，更多可选模型可参考[Transformer预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)。
+* 如需使用ernie-tiny模型，则需要提前先安装sentencepiece依赖，如 `pip install sentencepiece`。


这块话去掉把，后续我们建议用户使用的ERNIE-3.0-tiny，使用的tokenizer是不需要sentencepiece

wawltor · 2022-06-15T03:50:16Z

applications/text_classification/hierarchical_classification/README.md

+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存开发集上最佳 Macro F1 值和最佳 Micro F1 值的模型在指定的 `save_dir` 中，保存模型文件结构如下所示：
+
+```text


这里有个疑问，一般情况下如果选择macro和micro两个其中的模型了

已修改，默认保存最佳macro f1模型参数

wawltor · 2022-06-15T03:53:01Z

applications/text_classification/hierarchical_classification/README.md

+
+* `params_path`：动态图训练保存的参数路径；默认为"./checkpoint/macro/model_state.pdparams"。
+* `output_path`：静态图图保存的参数路径；默认为"./export"。
+* `dataset`：训练数据集;默认为wos数据集。


如果是导出模型的话应该是不需要dataset

已修改，使用num_classes来确定AutoModelForSequenceClassification中类别数：

num_classes：任务标签类别数;默认为wos数据集类别数141。

wawltor · 2022-06-15T03:56:09Z

applications/text_classification/hierarchical_classification/README.md

+
+```shell
+python deploy/paddle2onnx/infer.py --model_path_prefix ./export/wos/1/float32
+```


这个脚本的模型输入会有点怪，因为这个时候还没有进行模型的裁减的工作

示例export_model.py导出的静态图模型可以用于onnxruntime推理

wawltor · 2022-06-15T03:59:48Z

applications/text_classification/hierarchical_classification/README.md

+
+启动裁剪：
+```shell
+$ python prune.py --output_dir ./export


裁剪这块的输入模型是什么了？

已修改

启动裁剪：

python prune.py --output_dir ./prune --params_dir ./checkpoint/model_state.pdparams

wawltor · 2022-06-15T04:00:17Z

applications/text_classification/hierarchical_classification/README.md

+  * `params_dir`：待预测模型参数文件；默认为"./checkpoint/macro/model_state.pdparams"。
+  * `model_name_or_path`：选择预训练模型；默认为"bert-base-uncased"。
+
+以上参数都可通过 `python xxx.py --dataset xx --params_dir xx` 的方式传入）


xxx.py -> prune.py

已修改：
以上参数都可通过 python prune.py --dataset xx --params_dir xx 的方式传入）

wawltor · 2022-06-15T04:02:08Z

applications/text_classification/hierarchical_classification/README.md

+
+1. 数据集：WOS（英文层次分类数据集）
+
+2. 计算卡：V100、CUDA11.2


1.物理机环境
系统: CentOS Linux release 7.5.1804
GPU: Tesla V100-SXM2-32GB * 8
CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz * 40
CUDA: 11
cuDNN: 8.0.4
Driver Version: 450.80.02
内存: 502 GB

GPU相关信息再细致一点，参考上面

已修改：

物理机环境

系统: CentOS Linux release 7.7.1908 (Core)

GPU: Tesla V100-SXM2-32GB * 8

CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz

CUDA: 11.2

cuDNN: 8.1.0

Driver Version: 460.27.04

内存: 630 GB

wawltor · 2022-06-15T04:03:52Z

applications/text_classification/hierarchical_classification/README.md

+| BERT base             | 86.06 | 81.29 | 8.80  |
+| BERT base+裁剪(3/4)    | 86.83(+0.77) | 81.08(-0.21) | 6.85   |
+| BERT base+裁剪(2/3)    | 86.77(+0.71) | 80.48(-0.81) | 5.98  |
+| BERT base+裁剪(1/4)    | 86.40(+0.34) | 80.79(-0.5) | 2.51   |


这块可能最好是基于ERNIE-2.0-en来做实验，这里的实验细节没有看明白， BERT base+裁剪(3/4) 裁剪越多，latency最高？

裁剪后面数值为保留比例，已更换ernie-2.0-base-en进行实验：

Micro F1 Macro F1 latency(ms)

ERNIE 2.0 85.71 80.82 8.80

ERNIE 2.0+裁剪(保留比例3/4) 86.83(+1.12) 81.78(+0.96) 6.85

ERNIE 2.0+裁剪(保留比例2/3) 86.74(+1.03) 81.64(+0.82) 5.98

ERNIE 2.0+裁剪(保留比例1/4) 85.79(+0.08) 79.53(-1.29) 2.51

wawltor · 2022-06-15T04:06:09Z

applications/text_classification/hierarchical_classification/train.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle


这里的import要符合PE8, 这块可以看看
一般按照这个规则来，标准库模块，第三方模块，自用模块
paddle这种一般都是向后排

wawltor · 2022-06-15T04:06:52Z

applications/text_classification/hierarchical_classification/train.py

+)
+parser.add_argument(
+    '--model_name',
+    default="bert-base-uncased",


这块换成ERNIE EN系列的模型

wawltor · 2022-06-15T06:06:29Z

applications/text_classification/hierarchical_classification/train.py

+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # load and preprocess dataset


注释手字母大写，整体都改一下了

wawltor · 2022-06-15T06:08:22Z

applications/text_classification/hierarchical_classification/train.py

+
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(


这块有点问题，如果是多卡时，应该使用的DistributedBatchSampler

已修改：

if paddle.distributed.get_world_size() > 1: train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) else: train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)

wawltor · 2022-06-15T06:12:54Z

applications/text_classification/hierarchical_classification/train.py

+                lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1


是因为step是从1开始，这块的global_step这块的逻辑是不是有点不太对

global_step是从0开始

wawltor · 2022-06-15T06:13:56Z

applications/text_classification/hierarchical_classification/train.py

+            early_stop_count = 0
+            best_micro_f1_score = micro_f1_score
+            model._layers.save_pretrained(save_best_micro_path)
+            tokenizer.save_pretrained(save_best_micro_path)


Save的逻辑可能再讨论一下，是不是mean值会更好，放出太多的模型会让用户比较迷

选择保留最佳macro f1模型

wawltor · 2022-06-15T06:18:17Z

applications/text_classification/hierarchical_classification/train.py

+            tokenizer.save_pretrained(save_best_micro_path)
+
+
+def test():


这里比较有疑问的是，和evaluate的区别是什么，看这里加载的模型是marco f1值最好的模型，这里的考虑是什么了

test()是打算评测测试集表现，已去掉test函数

wawltor · 2022-06-15T06:20:23Z

applications/text_classification/hierarchical_classification/train.py

+        if step % 100 == 0:
+            logger.info("step %d, %d samples processed" %
+                        (step, step * args.batch_size))
+    metric.report()


这里的report里面看起来是使用了sklearn相关的函数，在requirement没有体现出来

已新加requirements.txt

wawltor · 2022-06-15T06:21:28Z

applications/text_classification/hierarchical_classification/utils.py

+from paddlenlp.utils.log import logger
+
+
+# 构建验证集evaluate函数


注释改成英文注释，统一一下

wawltor · 2022-06-15T06:22:08Z

applications/text_classification/hierarchical_classification/utils.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle


import关系要符合一下PE8标准

wawltor · 2022-06-15T06:23:22Z

applications/text_classification/hierarchical_classification/utils.py

+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.


evals看起来没有这个单测，eval->evaluate

已修改为evaluates

wawltor · 2022-06-15T06:29:06Z

applications/text_classification/hierarchical_classification/prune.py

+import yaml
+import functools
+from typing import Optional
+import paddle


这块的import关系再check一下

wawltor · 2022-06-15T06:31:35Z

applications/text_classification/hierarchical_classification/prune_trainer.py

+from metric import MetricReport
+
+nn.MultiHeadAttention._ori_forward = paddle.nn.MultiHeadAttention.forward
+nn.MultiHeadAttention._ori_prepare_qkv = nn.MultiHeadAttention._prepare_qkv


这块的函数变换，可以问一下佳琪的原因，然后注释一下

已添加注释：

# Paddleslim will modify MultiHeadAttention.forward and MultiHeadAttention._prepare_qkv # Original forward and _prepare_qkv should be saved before import paddleslim nn.MultiHeadAttention._ori_forward = paddle.nn.MultiHeadAttention.forward nn.MultiHeadAttention._ori_prepare_qkv = nn.MultiHeadAttention._prepare_qkv

wawltor · 2022-06-15T06:50:05Z

applications/text_classification/hierarchical_classification/export.py

@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


这块的文件改一下吧，和之前的一些保持一致，export.py->export_model.py

已修改脚本名

wawltor · 2022-06-15T06:50:32Z

applications/text_classification/hierarchical_classification/export.py

+                    help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./export',
+                    help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--dataset", default="wos", type=str,


dataset没有使用，可以去掉

wawltor · 2022-06-15T06:51:47Z

applications/text_classification/hierarchical_classification/predict.py

+args = parser.parse_args()
+
+
+def predict(data, label_list):


加一个@paddle.no_grad，节省显存占用

wawltor · 2022-06-15T06:53:31Z

applications/text_classification/hierarchical_classification/predict.py

+            results.append(labels)
+
+    for idx, text in enumerate(data):
+        label_name = [label_list[r] for r in results[idx]]


这里对结果进行解析的过程中，回到之前的问题，因为这次是一个层次分类的任务，最终输出看起来是多标签分类的结果

已修改

for r in results[idx]: if r < 7: level1.append(label_list[r]) else: level2.append(label_list[r]) print('predicted result:') print('level 1 : {} level 2 : {}'.format(', '.join(level1), ', '.join( level2)))

wawltor · 2022-06-15T06:55:13Z

applications/text_classification/hierarchical_classification/deploy/paddle2onnx/README.md

+## 环境准备
+
+模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)
+


这里的依赖安装是不是也要增加一下paddle2onnx

paddlenlp安装的时候已经安装paddle2onnx了

wawltor · 2022-06-15T07:11:55Z

applications/text_classification/hierarchical_classification/deploy/paddle2onnx/README.md

+
+2. 计算卡：V100、CUDA11.2
+
+3. CPU 信息：Intel(R) Xeon(R) Gold 6271C CPU


这里的硬件设备信息，根据trianer再正式一点

已修改：
2. 物理机环境

系统: CentOS Linux release 7.7.1908 (Core) GPU: Tesla V100-SXM2-32GB * 8 CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz CUDA: 11.2 cuDNN: 8.1.0 Driver Version: 460.27.04 内存: 630 GB

wawltor · 2022-06-15T07:12:46Z

applications/text_classification/hierarchical_classification/deploy/paddle2onnx/infer.py

+
+import paddle
+import argparse
+from predictor import Predictor


这块的import往下摞

wawltor · 2022-06-15T07:28:23Z

applications/text_classification/hierarchical_classification/deploy/paddle2onnx/predictor.py

+import onnxruntime as ort
+from paddlenlp.transformers import AutoTokenizer
+import paddle.nn.functional as F
+from sklearn.metrics import f1_score


咱们的框架是有F1只算的 AccuracyAndF1

sklearn可以计算macro和micro f1

wawltor · 2022-06-15T07:35:40Z

applications/text_classification/hierarchical_classification/deploy/paddle2onnx/predictor.py

+    def printer(self, infer_result, input_datas):
+        label = infer_result["label"]
+        confidence = infer_result["confidence"]
+        for i, input_data in enumerate(input_datas):


这里看看，能不能改成logger

lugimzzz · 2022-06-15T14:01:15Z

按照comment修改了代码和readme，基于ernie-2.0-en的性能和精度评测结果后续补上，serving部署部分明后两天补上。
裁剪部分的参数width_mul还是按照保留比例设置，因为涉及改的代码比较多，在readme上性能和精度评测结果栏里注明了是裁剪保留比例。

lugimzzz · 2022-06-19T15:52:02Z

已增加triton serving部署部分代码和文档说明

lugimzzz · 2022-06-20T07:11:13Z

完善triton serving部分，并补充ERNIE 2.0在wos数据集上表现和性能（裁剪，fp16，int8）

lugimzzz · 2022-06-21T08:14:38Z

由于pre-commit-config.yaml变化,更新了yapf版本，导致代码codestyle不通过。通过merge develop分支最新代码，更新pre-commit文件

lugimzzz · 2022-06-23T14:25:00Z

新增层次分类读取本地数据方式，修改paddlenlp/dataset/wos.py以支持以内置数据集格式读取本地数据集。目前本地数据集读取支持树状图和有向无环图标签层次结构，支持数据集层次标签不同深度，用训练、预测、部署仅需提供设定格式本地数据集路径和参数配置。

ArtificialZeng · 2022-06-24T03:10:03Z

新增层次分类读取本地数据方式，修改paddlenlp/dataset/wos.py以支持以内置数据集格式读取本地数据集。目前本地数据集读取支持树状图和有向无环图标签层次结构，支持数据集层次标签不同深度，用训练、预测、部署仅需提供设定格式本地数据集路径和参数配置。

Traceback (most recent call last):
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddle\fluid\dataloader\dataloader_iter.py", line 216, in _thread_loop
self._thread_done_event)
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddle\fluid\dataloader\fetcher.py", line 121, in fetch
data.append(self.dataset[idx])
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddlenlp\datasets\dataset.py", line 272, in getitem
idx]) if self._transform_pipline else self.new_data[idx]
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddlenlp\datasets\dataset.py", line 263, in _transform
data = fn(data)
File "G:\googleDownload\PaddleNLP-hierarchical\PaddleNLP-hierarchical\applications\text_classification\hierarchical_classification\utils.py", line 73, in preprocess_function
for i in range(label_nums)
File "G:\googleDownload\PaddleNLP-hierarchical\PaddleNLP-hierarchical\applications\text_classification\hierarchical_classification\utils.py", line 73, in
for i in range(label_nums)
KeyError: 'label'

虽然数据可以读取本地了，但还是报错

lugimzzz · 2022-06-24T03:19:26Z

hierarchical\applications\text_classification\hierarchical_classification\utils.py", line 73, in preprocess_function

utils中preprocess_function.py已经发生变动，如需使用请使用最新代码（其他多个文件也在输入输出部分出现变化，请一并拉取）

ArtificialZeng · 2022-06-24T03:20:04Z

新增层次分类读取本地数据方式，修改paddlenlp/dataset/wos.py以支持以内置数据集格式读取本地数据集。目前本地数据集读取支持树状图和有向无环图标签层次结构，支持数据集层次标签不同深度，用训练、预测、部署仅需提供设定格式本地数据集路径和参数配置。

W0624 11:18:51.985149 25272 gpu_context.cc:306] device: 0, cuDNN Version: 8.0.
Exception in thread Thread-4:
Traceback (most recent call last):
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddle\fluid\dataloader\dataloader_iter.py", line 216, in _thread_loop
self._thread_done_event)
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddle\fluid\dataloader\fetcher.py", line 121, in fetch
data.append(self.dataset[idx])
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddlenlp\datasets\dataset.py", line 268, in getitem
) if self._transform_pipline else self.new_data[idx]
File "E:\A_IDE\Anaconda3\envs\pp212_gpu\lib\site-packages\paddlenlp\datasets\dataset.py", line 258, in _transform
data = fn(data)
TypeError: preprocess_function() got an unexpected keyword argument 'label_nums'

bug是不是有点太多了

ArtificialZeng · 2022-06-24T03:38:31Z

hierarchical\applications\text_classification\hierarchical_classification\utils.py", line 73, in preprocess_function

utils中preprocess_function.py已经发生变动，如需使用请使用最新代码（其他多个文件也在输入输出部分出现变化，请一并拉取）

谢谢，大佬，拉取最新的后跑通了。

请问可以应用于商品同款匹配嘛

lugimzzz · 2022-06-24T03:40:36Z

本层次文本分类任务目前主要用于对文本片段进行分类，确定文本在不同层次标签中所属的类别。同款商品匹配不建议使用文本分类，中文商品匹配可以参考application中语义索引任务进行搭建

ArtificialZeng · 2022-06-24T06:32:15Z

本层次文本分类任务目前主要用于对文本片段进行分类，确定文本在不同层次标签中所属的类别。同款商品匹配不建议使用文本分类，中文商品匹配可以参考application中语义索引任务进行搭建

什么时候有中文的文本分类，怎么示例用英文的呢？
以下网站有一个中文专利的：要是下载不了可以留邮箱我发您
https://opendata.pku.edu.cn/dataset.xhtml?persistentId=doi:10.18170/DVN/ASRTHL&version=2.0

lugimzzz · 2022-06-24T06:43:17Z

什么时候有中文的文本分类，怎么示例用英文的呢？
考虑到目前层次分类相关中文数据集非正式数据集，所以层次分类使用英文数据集进行示例，之后层次分类中文数据集会以notebook的形式呈现，放入到文档中。之后多标签，多分类的文本分类都采用中文数据集示例。

lugimzzz · 2022-06-24T07:15:25Z

加入Paddle Serving部署

wawltor

LGTM

kevindragon · 2022-06-24T10:35:13Z

运行的时候报错：

python -m paddle.distributed.launch --gpus 0 applications/text_classification/hierarchical_classification/train.py --early_stop

paddlepaddle 2.3.0
paddlenlp 2.3.0.dev

lugimzzz · 2022-06-24T15:53:13Z

运行的时候报错：

python -m paddle.distributed.launch --gpus 0 applications/text_classification/hierarchical_classification/train.py --early_stop

因为paddlenlp/dataset/wos.py有变化，目前代码还未merge到develop分支中，如果要加载wos数据集，建议替换paddlenlp安装目录下paddlenlp/dataset/wos.py
note:可以使用pip show paddlenlp确定安装location.

ArtificialZeng · 2022-06-27T02:05:27Z

LGTM

什么时候合并到主分支，刚在本地运行成功了

wawltor self-requested a review June 14, 2022 06:38

wawltor self-assigned this Jun 14, 2022

wawltor added the enhancement New feature or request label Jun 14, 2022

wawltor reviewed Jun 15, 2022

View reviewed changes

add_hierarchical_classification

839225e

lugimzzz mentioned this pull request Jun 21, 2022

层次化文本分类有此功能嘛？能否用paddleNLP实现 #2600

Closed

lugimzzz added 2 commits June 22, 2022 07:17

modify_wos_dataset

ebc5d4d

optimize_codes

c11a44b

lugimzzz added 2 commits June 24, 2022 07:08

add_paddle_serving

3f59f81

add_paddle_serving

18a5f0a

add_paddle_serving

a39786b

wawltor approved these changes Jun 24, 2022

View reviewed changes

PaddlePaddle locked and limited conversation to collaborators Jun 27, 2022

PaddlePaddle unlocked this conversation Jun 27, 2022

wawltor closed this Jun 27, 2022

wawltor reopened this Jun 27, 2022

wawltor merged commit 08f0633 into PaddlePaddle:develop Jun 27, 2022

lugimzzz mentioned this pull request Jun 27, 2022

PaddleNLP 2.3.4 Release Note Candidate #2632

Closed

lugimzzz deleted the hierarchical branch July 4, 2022 09:37

		## 层次分类任务介绍

		多标签层次分类任务指自然语言处理任务中，每个样本具有多个标签标记，并且标签集合中存在预定义的树状结构或有向无环图结构，多标签层次分类需要充分考虑标签集之间的层次结构关系来预测层次化预测结果。在现实场景中，大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在预定义的层次化结构，需要利用算法为文本自动标注更细粒度和更准确的标签。本项目中采用通用多标签层次分类算法将每个结点的标签路径视为一个多类标签，使用单个分类器进行决策。

		## 模型微调

		我们以层次分类公开数据集WOS(Web of Science)为示例，在训练集上进行模型训练，并在开发集上验证，开发集中选出的最优的模型在测试集上进行评估。WOS数据集是一个层次文本分类数据集，包含7个父类和134子类，每个样本对应一个父类标签和子类标签，父类标签和子类标签间具有层次结构关系，WOS数据集已内置到PaddleNLP中。


		程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存开发集上最佳 Macro F1 值和最佳 Micro F1 值的模型在指定的 `save_dir` 中，保存模型文件结构如下所示：

		```text


		1. 数据集：WOS（英文层次分类数据集）

		2. 计算卡：V100、CUDA11.2

	Micro F1	Macro F1	latency(ms)
ERNIE 2.0	85.71	80.82	8.80
ERNIE 2.0+裁剪(保留比例3/4)	86.83(+1.12)	81.78(+0.96)	6.85
ERNIE 2.0+裁剪(保留比例2/3)	86.74(+1.03)	81.64(+0.82)	5.98
ERNIE 2.0+裁剪(保留比例1/4)	85.79(+0.08)	79.53(-1.29)	2.51

		from paddlenlp.utils.log import logger


		# 构建验证集evaluate函数

		@@ -0,0 +1,58 @@
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

		## 环境准备

		模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)


		2. 计算卡：V100、CUDA11.2

		3. CPU 信息：Intel(R) Xeon(R) Gold 6271C CPU

add hierarchical text classification #2501

add hierarchical text classification #2501

Conversation

lugimzzz commented Jun 13, 2022 • edited Loading

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz Jun 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz Jun 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz commented Jun 13, 2022 •

edited

Loading

lugimzzz Jun 20, 2022 •

edited

Loading

lugimzzz Jun 20, 2022 •

edited

Loading