Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FasterTokenizer model in experiment #1220

Merged
merged 17 commits into from
Nov 1, 2021

Conversation

Steffy-zxf
Copy link
Contributor

PR types

New features

PR changes

APIs

Description

  1. add FasterTokenizer model in experiment
  2. add demo with FasterTokenizer usage in experiment

@ZeyuChen ZeyuChen self-assigned this Oct 26, 2021
@ZeyuChen ZeyuChen added this to In progress in PaddleNLP v2.2 via automation Oct 26, 2021
Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_string_tensor -> to_tensor, 移入experimental中
动转静需要内置到FasterModelForXXXX类中,在上层动转静接口中屏蔽STRINGS对象暴露
动转静导出建议同时导出probs和argmax后结果,可以使推理结果更加便捷。
不要让用户额外撰写softmax的算子实现

rm -rf *

# same with the demo.cc
DEMO_NAME=demo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要叫DEMO,这不是DEMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改为text_cls_infer

@@ -0,0 +1,64 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要这些莫名其妙的空行

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件名不要定义为demo,改为infer.cc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者是ernie_infer。同时后面应该还得区分下句子分类还是序列标注任务

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, 修改为text_cls_infer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_cls_infer/token_cls_infer可能可以跟类名保持更好一致

"办理入住手续,节省时间。"};

std::vector<float> probs;
Run(predictor.get(), &data, &probs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要要给出print的结果

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输出应该是const引用,保持data输入

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果这个demo就是分类,那就写清楚分类的,和序列标注的分开

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_cls_infer/token_cls_infer可能可以跟类名保持更好一致

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,已修改为seq_cls_infer

}

void Run(Predictor* predictor,
std::vector<std::string>* input_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输入应该是用const引用,输出才是指针

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改为

void Run(Predictor* predictor,
         const std::vector<std::string>& input_data,
         std::vector<float>* logits,
         std::vector<int64_t>* predictions) 


import paddle
import paddlenlp
from paddlenlp.experimental import FastSequenceClassificationModel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster,我们整个技术代号统一使用Faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return logits


class FastSequenceClassificationModel(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FasterModelForSequenceClassification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import paddle.fluid.core as core

__all__ = ['to_string_tensor', 'to_vocab_tensor']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整体挪到paddlenlp.experimental中去

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

raise ValueError("Unknown name %s. Now %s surports %s" %
(pretrained_model_name_or_path, cls.__name__,
list(name_model.keys())))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于这个类新增to_static接口,屏蔽STRINGS类型对外暴露

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else:
raise ValueError("Unknown name %s. Now %s surports %s" %
(pretrained_model_name_or_path, cls.__name__,
list(name_model.keys())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于这个类新增to_static接口,屏蔽STRINGS类型对外暴露

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -12,10 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import paddle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

移入paddlenlp/experimental/中

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Steffy-zxf added 6 commits October 27, 2021 17:00
2. rename to_vocab_tensor to to_vocab_buff
3. add to_static() api to FasterModel
2. suuport from_pretrained() with given a local directory
set(CUDA_LIB "/usr/local/cuda/lib64/" CACHE STRING "CUDA Library")
else()
if(CUDA_LIB STREQUAL "")
set(CUDA_LIB "C:\\Program\ Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\lib\\x64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个命令写死路径可能不一定正确,回头得windows测试验证下

losses.append(loss.numpy())
correct = metric.compute(logits, labels)
metric.update(correct)
accu = metric.accumulate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处的accumulte应该在循环外还是在循环内?



def create_dataloader(dataset, mode='train', batch_size=1):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉无用空行

text = '小说是文学的一种样式,一般描写人物故事,塑造多种多样的人物形象,但亦有例外。它是拥有不完整布局、发展及主题的文学作品。而对话是不是具有鲜明的个性,每个人物说的没有独特的语言风格,是衡量小说水准的一个重要标准。与其他文学样式相比,小说的容量较大,它可以细致的展现人物性格和命运,可以表现错综复杂的矛盾冲突,同时还可以描述人物所处的社会生活环境。小说一词,最早见于《庄子·外物》:“饰小说以干县令,其于大达亦远矣。”这里所说的小说,是指琐碎的言谈、小的道理,与现时所说的小说相差甚远。文学中,小说通常指长篇小说、中篇、短篇小说和诗的形式。小说是文学的一种样式,一般描写人物故事,塑造多种多样的人物形象,但亦有例外。它是拥有不完整布局、发展及主题的文学作品。而对话是不是具有鲜明的个性,每个人物说的没有独特的语言风格,是衡量小说水准的一个重要标准。与其他文学样式相比,小说的容量较大,它可以细致的展现人物性格和命运,可以表现错综复杂的矛盾冲突,同时还可以描述人物所处的社会生活环境。小说一词,最早见于《庄子·外物》:“饰小说以干县令,其于大达亦远矣。”这里所说的小说,是指琐碎的言谈、小的道理,与现时所说的小说相差甚远。文学中'
data = [text[:max_seq_length]] * 100

pp_tokenizer = FasterTokenizer(vocab, do_lower_case=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处接口是否需要与XXXTokenizer.from_pretrained的API体验打平?以及是否需要

PaddleNLP v2.2 automation moved this from In progress to Reviewer approved Nov 1, 2021
Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZeyuChen ZeyuChen merged commit 9e198d0 into PaddlePaddle:develop Nov 1, 2021
PaddleNLP v2.2 automation moved this from Reviewer approved to Done Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants