Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XLM tokenizer #2551

Merged
merged 2 commits into from
Jun 27, 2022
Merged

Fix XLM tokenizer #2551

merged 2 commits into from
Jun 27, 2022

Conversation

JunnYu
Copy link
Member

@JunnYu JunnYu commented Jun 17, 2022

PR types

Bug fixes

PR changes

Models

Description

fix xlm tokenizer.
覆盖基类的tokenize方法,给_tokenize添加langbypass_tokenizer参数

注:HF的tokenizer在这里并未添加**kwargs,因此下面HF有关lang="zh"输出结果是错误的,使用的还是默认参数lang="en"
https://github.com/huggingface/transformers/blob/3c7e56fbb11f401de2528c1dcf0e282febc031cd/src/transformers/tokenization_utils.py#L547

from paddlenlp.transformers import XLMTokenizer as PPNLPXLMTokenizer
from transformers import XLMTokenizer as HFXLMTokenizer

hf_tokenizer = HFXLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
ppnlp_tokenizer = PPNLPXLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
text = "今天是个好日子Very good day。"
text_with_special_tokens = "今天是<special0>个好日子Very good day。"
langs = ["en", "zh"]

# (1) 比较_tokenize
for lang in langs:
    print(f"lang: {lang}")
    o1 = ppnlp_tokenizer._tokenize(text, lang=lang)
    o2 = hf_tokenizer._tokenize(text, lang=lang)
    print(o1)
    print(o2)
    print("=" * 50)
    # lang: en
    # ['今</w>', '天</w>', '是</w>', '个</w>', '好</w>', '日</w>', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ['今</w>', '天</w>', '是</w>', '个</w>', '好</w>', '日</w>', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ==================================================
    # lang: zh
    # ['今天</w>', '是</w>', '个</w>', '好', '日', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ['今天</w>', '是</w>', '个</w>', '好', '日', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ==================================================

# (2) 比较tokenize, 发现hf的tokenizer无法识别lang这个参数,因此在无法正确_tokenize,从而使用了默认的lang="en"参数。
for lang in langs:
    print(f"lang: {lang}")
    o1 = ppnlp_tokenizer.tokenize(text_with_special_tokens, lang=lang)
    o2 = hf_tokenizer.tokenize(text_with_special_tokens, lang=lang)
    print(o1)
    print(o2)
    print("=" * 50)
    # Keyword arguments {'lang': 'en'} not recognized.
    # Keyword arguments {'lang': 'zh'} not recognized.
    # lang: en
    # ['今</w>', '天</w>', '是</w>', '<special0>', '个</w>', '好</w>', '日</w>', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ['今</w>', '天</w>', '是</w>', '<special0>', '个</w>', '好</w>', '日</w>', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ==================================================
    # lang: zh
    # ['今天</w>', '是</w>', '<special0>', '个</w>', '好', '日', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ['今</w>', '天</w>', '是</w>', '<special0>', '个</w>', '好</w>', '日</w>', '子</w>', 'very</w>', 'good</w>', 'day</w>', '.</w>']
    # ==================================================

# (3) 比较__call__
for lang in langs:
    print(f"lang: {lang}")
    o1 = ppnlp_tokenizer(
        text_with_special_tokens, lang=lang, return_attention_mask=False
    )
    o2 = hf_tokenizer(text_with_special_tokens, lang=lang, return_attention_mask=False)
    print(o1)
    print(o2)
    print("=" * 50)
    # Keyword arguments {'lang': 'en'} not recognized.
    # Keyword arguments {'lang': 'zh'} not recognized.
    # lang: en
    # {'input_ids': [0, 5363, 3555, 135, 4, 2048, 6431, 3079, 1622, 34868, 22829, 4635, 15, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
    # {'input_ids': [0, 5363, 3555, 135, 4, 2048, 6431, 3079, 1622, 34868, 22829, 4635, 15, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
    # ==================================================
    # lang: zh
    # {'input_ids': [0, 39896, 135, 4, 2048, 14548, 3427, 1622, 34868, 22829, 4635, 15, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
    # {'input_ids': [0, 5363, 3555, 135, 4, 2048, 6431, 3079, 1622, 34868, 22829, 4635, 15, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
    # ==================================================

@gongel gongel self-requested a review June 17, 2022 03:58
Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: #2080

@gongel gongel merged commit 7e1a433 into PaddlePaddle:develop Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants