Skip to content

[Bug]: GPTChinese的tokenizer和model的特殊字符不对应 #6005

@xiaotingyun

Description

@xiaotingyun

软件环境

- paddlepaddle:2.4.0
- paddlepaddle-gpu: 2.4.0
- paddlenlp: 2.5.2

重复问题

  • I have searched the existing issues

错误描述

GPTChinese的tokenizer和model的特殊字符不对应,tokenizer.bos_token_id超出了词表范围

稳定复现步骤 & 代码

image
import paddle
import paddle.nn as nn
import paddlenlp
from paddlenlp.transformers import GPTChineseTokenizer,GPTLMHeadModel
import time
from tqdm import tqdm
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-small-cn-distill')
model = GPTLMHeadModel.from_pretrained('gpt-cpm-small-cn-distill')
print(tokenizer.bos_token_id,tokenizer.eos_token_id)
print(model.bos_token_id,model.eos_token_id)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions