Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: IndexError: piece id is out of range. #1

Open
nuochenpku opened this issue Aug 18, 2023 · 8 comments
Open

Error: IndexError: piece id is out of range. #1

nuochenpku opened this issue Aug 18, 2023 · 8 comments

Comments

@nuochenpku
Copy link

Hi, when I set batchsize more than 1, the error will occur: piece id is out of range.
Could you help me fix it?

@Haskely
Copy link
Owner

Haskely commented Aug 18, 2023

Could you provide the complete error output, please? A screenshot would also be acceptable.

@nuochenpku
Copy link
Author

image

@nuochenpku
Copy link
Author

I use transformers 4.32.0

@Haskely
Copy link
Owner

Haskely commented Aug 19, 2023

It can only be seen that there is an out-of-range situation for the token_id output by the model, meaning its value exceeds the vocabulary size of the tokenizer. May be sth wrong with the tokenizer. However, it is unclear why this situation is occurring.

Have you made any modifications to the script? If so, please provide the complete script. If not, that's really confusing...🤯

@nuochenpku
Copy link
Author

I just modify the batchsize. or can you tell me the tokenizer version?

@Haskely
Copy link
Owner

Haskely commented Aug 21, 2023

I just modify the batchsize. or can you tell me the tokenizer version?

tokenizer and model are all from model_path="OFA-Sys/gsm8k-rft-llama7b-u13b", which is https://huggingface.co/OFA-Sys/gsm8k-rft-llama7b-u13b/tree/main , has only one version, it can't be a version issue.

Additionally, I realized that you mentioned the issue only arises when batch_size > 1, meaning it runs normally when batch_size=1, Right ? If so, we can conclude that it's a problem with the special tokens. Try use tokenizer.pad_token_id = 0 to set the pad id manually and see if it helps.

@nuochenpku
Copy link
Author

Sorry for the late reponse. In fact, tokenizer.pad_token_id = 0 already in LlmaTokenizer, so the error still exits

@nuochenpku
Copy link
Author

Update: I find this error will happen when I use OFA-Sys/gsm8k-rft-llama7b2-u13b. No error for OFA-Sys/gsm8k-rft-llama7b-u13b ckpt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants