Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么vocab里必须既有[UNK]又有<unk>呢? #60

Closed
guijuzhejiang opened this issue Jul 3, 2021 · 6 comments
Closed

为什么vocab里必须既有[UNK]又有<unk>呢? #60

guijuzhejiang opened this issue Jul 3, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@guijuzhejiang
Copy link

guijuzhejiang commented Jul 3, 2021

看代码的规则,vocab里既要有[UNK]又要有<unk>,否则会报错,这两个token都代表未知词吧,有什么区别吗?
另外我看例子中英语的vocab有些token的ids重复了,如下,不明白为什么,重复的id不会被覆盖吗?自己做vocab的时候也要改成重复的吗?
<unk> 0
<s> 1
</s> 2
[UNK] 0
[PAD] 0
[CLS] 1
[SEP] 2

@sserdoubleh
Copy link
Collaborator

sserdoubleh commented Jul 3, 2021

原因是<unk>是外部分词工具本身使用的OOV表示,Knover本身支持多种tokenizer,需要统一一个[UNK]标识,现在为了方便在vocab上加了一个映射,实际上应该在Tokenizer class上实现个unk_id的property

@guijuzhejiang
Copy link
Author

guijuzhejiang commented Jul 3, 2021

@sserdoubleh 理解了,感谢答疑。
另外关于token的id重复的问题,是不是不应该重复,例如像下面这样才是对的吧?
<unk> 0
<s> 1
</s> 2
[UNK] 3
[PAD] 4
[CLS] 5
[SEP] 6

@sserdoubleh
Copy link
Collaborator

[UNK] <-> <unk>
[CLS] <-> <s>
[SEP] <-> </s>
这几组是完全等价的,需要让它们的id相同
至于PAD因为在代码实现上,padding的习惯上都是0,所以设置成了0,至于与UNK的冲突这件事情,主要是sentencepiece的默认构建的时候指定成了0(应该有办法修正,不过还需要同时修正之前预训练好的模型,就比较麻烦),因此在写代码的时候需要注意一下这个冲突问题

@guijuzhejiang
Copy link
Author

guijuzhejiang commented Jul 3, 2021

@sserdoubleh 感谢这么详细的回复。我的理解是这种等价关系是不是意味着用[UNK] ,[CLS],[SEP]把<unk>,<s>,</s>覆盖了。我猜可能是plato的训练里没用后者吧。但麻烦的就是,PAD和MASK。pad如您所说,如果id为零的话就会覆盖掉unk。所以我想如果自己训练sentencepiece,把PAD和MASK加入control_symbols,生成的spm和vocab包含了PAD和MASK,ID赋值成3和4。这个思路您看合适吗?

@sserdoubleh
Copy link
Collaborator

sserdoubleh commented Jul 3, 2021

嗯,这个等价关系不是说前者覆盖掉后者,仅仅是为了处理方便,适应不同的Tokenizer,后续计划升级下Tokenizer,提升下易用性,可以关注下哈

至于UNK和PAD的冲突问题,如果是新训练的sentencepiece模型,可以像你说的把PAD和MASK加入到control_symbols中(不过最好PAD的id还是0,因为现在代码都是默认PAD id是0),这样就不冲突了。这个冲突问题,如果在现有代码的话,用着应该没有问题,因为经过Reader的相关逻辑处理之后,有通过CLS和SEP作为边界,在这里面的0就是UNK,在这外面的0就是PAD,一般情况都没什么问题。唯一的问题就是代码里不同通过token_ids==0来一下子获取padding_mask了

@sserdoubleh sserdoubleh added the enhancement New feature or request label Jul 3, 2021
@guijuzhejiang
Copy link
Author

@sserdoubleh 感谢大佬的回复,也感谢百度的这波推广,后续持续关注升级。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants