Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

Closed
cgisky1980 opened this issue Jun 15, 2023 · 6 comments · Fixed by #104
Closed
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@cgisky1980
Copy link

cgisky1980 commented Jun 15, 2023

rwkv world 3b or 7b Q8_0
input
Translate the following text into Korean: "Hello"
output
File "/www/wenda-pi/llms/rwkvcpp/rwkv_tokenizer.py", line 94, in decode
return self.decodeBytes(tokens).decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')

output
안하세요.
but the correct one should be
안녕하세요

lost character 녕

by model rwkv world fp16 is correct

@saharNooby
Copy link
Collaborator

by model rwkv world fp16 is correct

Do you mean that the same code works correctly when using FP16 model?

@saharNooby saharNooby changed the title Some character output errors "Unexpected end of data" when decoding partial Unicode characters with World tokenizer Jun 15, 2023
@cgisky1980
Copy link
Author

cgisky1980 commented Jun 15, 2023

by model rwkv world fp16 is correct

Do you mean that the same code works correctly when using FP16 model?
yes

I mean, models without quantization don't have this problem

@saharNooby saharNooby added bug Something isn't working good first issue Good for newcomers labels Jun 16, 2023
@saharNooby
Copy link
Collaborator

Then I think this issue actually points at 2 separate problems:

  1. quantized model produces less correct text that non-quantized model
  2. UnicodeDecodeError error is thrown

For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller.

2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it.

@konflictue
Copy link

Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')
lost character 녕

@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv.

@cgisky1980
Copy link
Author

cgisky1980 commented Jun 17, 2023

Then I think this issue actually points at 2 separate problems:

  1. quantized model produces less correct text that non-quantized model
  2. UnicodeDecodeError error is thrown

For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller.

2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it.

Because this issue is 100% reproducible in models 3B and 7B, I don't think it's a problem of accuracy loss

ps. https:github.com//issues/19 this is the good first issue. LOL

@cgisky1980
Copy link
Author

Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')
lost character 녕

@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv.

yes,it works

cgisky1980 added a commit to cgisky1980/wenda-pi that referenced this issue Jun 19, 2023
@saharNooby saharNooby linked a pull request Jun 21, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants