Skip to content

[Bug] ChatGLM2 tokenizer dismatch on eos_token_id and eos_token #105

@LZHgrla

Description

@LZHgrla
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
tokenizer.encode(tokenizer.eos_token)  # [64790, 64792, 2893, 30917, 30994]
tokenizer.eos_token_id  # 2

XTuner uses tokenizer.encode(tokenizer.eos_token) instead of tokenizer.eos_token_id to process data. This causes that the fine-tuned ChatGLM2 cannot stop the generation.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions