Upgrade Roberta tokenizer #1821

yingyibiao · 2022-03-23T07:18:15Z

PR types

Performance optimization

PR changes

Models

Description

Upgrade Roberta tokenizer to support both "Bert" style and "BPE" style tokenizer.
Add optional argument "output_hidden_states" to output hidden_states of each hidden layer.
Move community directory one level up in bos.

ZeyuChen

注意自查下链接变动是否会导致哪些模型无法下载。
以及2.3版本发布后，是否能监控到社区模型与自有模型的下载量情况

paddlenlp/utils/downloader.py

yingyibiao · 2022-03-26T09:56:39Z

注意自查下链接变动是否会导致哪些模型无法下载。以及2.3版本发布后，是否能监控到社区模型与自有模型的下载量情况

完成所有模型地址的迁移
模型下载量还未进行监控

guoshengCS · 2022-03-25T03:59:52Z

paddlenlp/transformers/bert/modeling.py

-                attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+                attention_mask = attention_mask.unsqueeze(
+                    axis=[1, 2]).astype(paddle.get_default_dtype())
+                attention_mask = (1.0 - attention_mask) * -1e4


这里和其他模型中ndim==2的行为是一致的不，是已经统一了mask的这个语义了吗

还没有统一，这个pr统一了 bert 和 roberta

guoshengCS · 2022-03-27T02:20:32Z

paddlenlp/transformers/roberta/modeling.py

@@ -360,17 +375,26 @@ def forward(self,
            attention_mask = paddle.unsqueeze(
                attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
            attention_mask = (1.0 - attention_mask) * -1e4
-        attention_mask.stop_gradient = True


这里为什么去掉呢，attention_mask这个确实是可以stop_gradient的吧

attention_mask 默认的 stop_gradient 就是 True，这一行代码是冗余的。

这个是打印出来attention_mask看了是吗

attention_mask 是和参数无关的 tensor，是没有梯度的。（也打印验证过，stop_gradient=True）

guoshengCS · 2022-03-27T16:51:11Z

paddlenlp/transformers/roberta/modeling.py

-        sequence_output = encoder_outputs
-        pooled_output = self.pooler(sequence_output)
-        return sequence_output, pooled_output
+        if output_hidden_states:


还要注意后面所有模型进行统一一致性的方案考虑，另外看看是否能插件化这个功能需求 #1752 (comment)

好的，roberta目前采用的是和bert一致的方案。

guoshengCS · 2022-03-27T16:58:24Z

paddlenlp/transformers/roberta/tokenizer.py

-            "roberta-base-ft-cluener2020-chn":
-            "https://bj.bcebos.com/paddlenlp/models/transformers/community/nosaydomore/uer_roberta_base_finetuned_cluener2020_chinese/vocab.txt",
-            "roberta-base-chn-extractive-qa":
-            "https://bj.bcebos.com/paddlenlp/models/transformers/community/nosaydomore/uer_roberta_base_chinese_extractive_qa/vocab.txt",


需要明确是否会对已有模型用法造成影响

不会造成影响。

guoshengCS · 2022-03-27T17:18:03Z

paddlenlp/transformers/roberta/tokenizer.py

+                    pretrained_model_name_or_path, *model_args, **kwargs)
+            else:
+                return RobertaBPETokenizer.from_pretrained(
+                    pretrained_model_name_or_path, *model_args, **kwargs)


如果为了解决Roberta Tokenizer这种包含多个需要不同resouce file的tokenizer在from_pretrained加载community模型时的问题，是否能提取出公共的方案，对其他包含多个需要不同resouce file的tokenizer也能适用，比如ALBERT。如果按照目前这样，开发者贡献一个这样的tokenizer代价还是比其他tokenizer大不少的

如上周讨论，该方案在具体实现的 tokenizer 上进行了一层轻量级的封装，开发成本比原本方案（如 ALBERT）小很多，且更加不容易出错。

这里的重点在于说看看如何将办法通用化，而不要只考虑一个RobertaTokenizer，要提供一个对于开发者简单的方案。普通的模型贡献者是无法获知COMMUNITY_MODEL_PREFIX这些内容的，如何让他们也能解决这一类问题

开发成本比原本方案（如 ALBERT）小很多

这个其实是因为原来没有使用__getattribute__转发请求到内含的tokenizer来这样来实现，所以显得原来的代码多

OK，这里我后续再看看。

要把当前已知的问题都建立issue记录下来，加入开发计划 @yingyibiao

guoshengCS · 2022-03-28T10:30:42Z

paddlenlp/transformers/t5/tokenizer.py

@@ -88,7 +88,7 @@ def __init__(self,
                 sentencepiece_model_file,
                 do_lower_case=False,
                 remove_space=True,
-                 keep_accents=False,
+                 keep_accents=True,


这个默认值为什么修改了呢，修改了是否example会有影响

hf 相对应的默认值是 True，NLP同学反馈了这个问题。

* update roberta * update roberta tokenizer * update roberta tokenizer * update * update

yingyibiao added 3 commits March 21, 2022 13:24

update roberta

4b71d15

update roberta tokenizer

f6bbcf2

update roberta tokenizer

501f674

yingyibiao marked this pull request as ready for review March 23, 2022 08:18

yingyibiao added 4 commits March 23, 2022 16:18

Merge branch 'develop' into roberta

95c7994

Merge branch 'develop' into roberta

4d3073f

update

d48441c

Merge branch 'roberta' of github.com:yingyibiao/PaddleNLP into roberta

b6c0d98

yingyibiao requested review from ZeyuChen and guoshengCS March 25, 2022 02:34

yingyibiao added 2 commits March 25, 2022 10:35

Merge branch 'develop' into roberta

cd9a79f

Merge branch 'develop' into roberta

b234b4a

ZeyuChen assigned guoshengCS Mar 25, 2022

ZeyuChen reviewed Mar 25, 2022

View reviewed changes

paddlenlp/utils/downloader.py Show resolved Hide resolved

yingyibiao added 2 commits March 26, 2022 18:07

Merge branch 'develop' into roberta

786b98b

update

e53c146

guoshengCS reviewed Mar 27, 2022

View reviewed changes

yingyibiao requested review from ZeyuChen and guoshengCS March 28, 2022 02:56

Merge branch 'develop' into roberta

d77d281

guoshengCS reviewed Mar 28, 2022

View reviewed changes

Merge branch 'develop' into roberta

986f0b8

guoshengCS approved these changes Mar 28, 2022

View reviewed changes

yingyibiao merged commit 3351ab0 into PaddlePaddle:develop Mar 28, 2022

yingyibiao deleted the roberta branch March 28, 2022 10:55

yingyibiao mentioned this pull request Apr 15, 2022

PaddleNLP 2.2.6 Release Note Candidate #1962

Closed

ZeyuChen pushed a commit to ZeyuChen/PaddleNLP that referenced this pull request Apr 17, 2022

Upgrade Roberta tokenizer (PaddlePaddle#1821)

efcc569

* update roberta * update roberta tokenizer * update roberta tokenizer * update * update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Roberta tokenizer #1821

Upgrade Roberta tokenizer #1821

yingyibiao commented Mar 23, 2022 •

edited

ZeyuChen left a comment

yingyibiao commented Mar 26, 2022

guoshengCS Mar 25, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 27, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 28, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 27, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 27, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 27, 2022 •

edited

yingyibiao Mar 28, 2022

guoshengCS Mar 28, 2022

guoshengCS Mar 28, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 28, 2022

yingyibiao Mar 28, 2022

guoshengCS Mar 28, 2022

yingyibiao Mar 28, 2022

Upgrade Roberta tokenizer #1821

Upgrade Roberta tokenizer #1821

Conversation

yingyibiao commented Mar 23, 2022 • edited

PR types

PR changes

Description

ZeyuChen left a comment

Choose a reason for hiding this comment

yingyibiao commented Mar 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS Mar 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingyibiao commented Mar 23, 2022 •

edited

guoshengCS Mar 27, 2022 •

edited