Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Taskflow word_segmentation and ner tasks #1666

Merged
merged 40 commits into from
Mar 15, 2022
Merged

Update Taskflow word_segmentation and ner tasks #1666

merged 40 commits into from
Mar 15, 2022

Conversation

linjieccc
Copy link
Contributor

@linjieccc linjieccc commented Feb 8, 2022

PR types

New features

PR changes

Others

Description

  • 新增快速、精确模式分词

    • 快速模式分词
    >>> from paddlenlp import Taskflow
    
    >>> seg = Taskflow("word_segmentation", mode="fast")
    >>> seg("第十四届全运会在西安举办")
    ['第十四届', '全运会', '在', '西安', '举办']
    • 精确模式分词
    >>> from paddlenlp import Taskflow
    
    >>> seg = Taskflow("word_segmentation", mode="accurate")
    >>> seg("李伟拿出具有科学性、可操作性的《陕西省高校管理体制改革实施方案》")
    ['李伟', '拿出', '具有', '科学性', '、', '可操作性', '的', '《', '陕西省高校管理体制改革实施方案', '》']
  • 新增快速模式NER

    • 快速模式NER
    >>> from paddlenlp import Taskflow
    
    >>> ner = Taskflow("ner", mode="fast")
    >>> ner("三亚是一个美丽的城市")
    [('三亚', 'LOC'), ('是', 'v'), ('一个', 'm'), ('美丽', 'a'), ('的', 'u'), ('城市', 'n')]
  • NER任务支持只返回实体词

    • 快速模式 + 只返回实体词
    >>> from paddlenlp import Taskflow  
    
    >>> ner = Taskflow("ner", mode="fast", entity_only=True)
    >>> ner("三亚是一个美丽的城市")
    [('三亚', 'LOC')]
    • 精确模式 + 只返回实体/概念词
    >>> from paddlenlp import Taskflow
    
    >>> ner = Taskflow("ner", mode="accurate", entity_only=True)
    >>> ner("《孤女》是2010年九州出版社出版的小说,作者是余兼羽")
    [('孤女', '作品类_实体'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('小说', '作品类_概念'), ('作者', '人物类_概念'), ('余兼羽', '人物类_实体')]
  • 新增AutoSplitter/AutoJoiner功能支持无限长文本自动切分:

    • Word Segmentation(LAC & WordTag)
    • Pos Tagging
    • Lexical Analysis
    • Knowledge Mining(WordTag)
    • NER (LAC & WordTag)
    • Text Correction

@linjieccc linjieccc changed the title Support infinite length input for Taskflow Add WordTag for word_segmentation task & add AutoSplitter for some Taskflow task Mar 1, 2022
docs/model_zoo/taskflow.md Outdated Show resolved Hide resolved
paddlenlp/taskflow/knowledge_mining.py Outdated Show resolved Hide resolved
paddlenlp/taskflow/models/lexical_analysis_model.py Outdated Show resolved Hide resolved
paddlenlp/taskflow/task.py Outdated Show resolved Hide resolved
@linjieccc linjieccc changed the title Add WordTag for word_segmentation task & add AutoSplitter for some Taskflow task Update Taskflow word_segmentation and ner tasks Mar 10, 2022
@@ -21,12 +21,16 @@
import itertools

import numpy as np
import jieba
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议不要直接import jieba,只有在JiebaTask里面使用结巴,并且没有安装进行提示

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如沟通,jieba为默认依赖

self.predictor.run()
pred_tags = self.output_handle[0].copy_to_cpu()
all_pred_tags.extend(pred_tags.tolist())
with dygraph_mode_guard():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥是动态图了? 不应该是静态图吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

all_preds_can.extend([pred_id_can.tolist()])
pred_ids.extend([pred_id_can[:, 0].tolist()])

with dygraph_mode_guard():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

tags_ids = self.output_handle[0].copy_to_cpu()
results.extend(tags_ids.tolist())
lens.extend(seq_len.tolist())
with dygraph_mode_guard():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linjieccc linjieccc merged commit 1e2ee01 into PaddlePaddle:develop Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants