Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fixes] update chatglm1 tokenizer #7870

Merged

Conversation

wj-Mcat
Copy link
Contributor

@wj-Mcat wj-Mcat commented Jan 19, 2024

PR types

Bug fixes

PR changes

Models

Description

修复 ChatGLM1 中 tokenizer 无法 分词 [gMASK]的问题。

Copy link

paddle-bot bot commented Jan 19, 2024

Thanks for your contribution!

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jan 19, 2024

chatglm1 tokenizer 的结果是:

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
�[32m[2024-01-19 08:12:28,699] [    INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m
�[32m[2024-01-19 08:12:28,699] [    INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm.tokenizer.ChatGLMTokenizer'> to load 'THUDM/chatglm-6b-v1.1'.�[0m
�[32m[2024-01-19 08:12:28,700] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/ice_text.model�[0m
�[32m[2024-01-19 08:12:28,700] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m
�[33m[2024-01-19 08:12:31,287] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json> not exist�[0m
�[32m[2024-01-19 08:12:31,287] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m
�[33m[2024-01-19 08:12:33,327] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json> not exist�[0m
�[32m[2024-01-19 08:12:33,327] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m
�[32m[2024-01-19 08:12:33,328] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/chat_template.json�[0m
�[32m[2024-01-19 08:12:33,587] [    INFO]�[0m - Assigning ['[gMASK]'] to the additional_special_tokens key of the tokenizer�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids    -> [130000]
============================================================
token -> [gMASK]
tokens-> ['[gMASK]']
ids    -> [130001]
============================================================
token -> <sop>
tokens-> ['<sop>']
ids    -> [130004]
============================================================
token -> <eop>
tokens-> ['<eop>']
ids    -> [130005]

Copy link

codecov bot commented Jan 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c1ccafa) 56.67% compared to head (a4f8652) 56.68%.
Report is 7 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #7870   +/-   ##
========================================
  Coverage    56.67%   56.68%           
========================================
  Files          588      588           
  Lines        89243    89249    +6     
========================================
+ Hits         50580    50590   +10     
+ Misses       38663    38659    -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jan 22, 2024

调整之后的 chat_template.json 为:

{
    "conversation": ["[Round {{index}}]\n问:{{user}}\n答:{% if is_last %}[gMASK]<sop>{% endif %}", "{{bot}}\n<eop>"],
    "query": "[Round {{index}}]\n问:{{query}}\n答:[gMASK]<sop>"
}

chatglm1 只支持单轮训练。

@wj-Mcat wj-Mcat marked this pull request as ready for review January 22, 2024 12:04
@wj-Mcat wj-Mcat merged commit b50db1c into PaddlePaddle:develop Jan 23, 2024
8 of 10 checks passed
JunnYu pushed a commit that referenced this pull request Jan 24, 2024
* update chatglm1 tokenizer

* update additional_special_token

* add is_training tag

* fix linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants