-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102
Comments
Do you mean that the same code works correctly when using FP16 model? |
I mean, models without quantization don't have this problem |
Then I think this issue actually points at 2 separate problems:
For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller. 2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it. |
@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv. |
Because this issue is 100% reproducible in models 3B and 7B, I don't think it's a problem of accuracy loss ps. https:github.com//issues/19 this is the good first issue. LOL |
yes,it works |
rwkv world 3b or 7b Q8_0
input
Translate the following text into Korean: "Hello"
output
File "/www/wenda-pi/llms/rwkvcpp/rwkv_tokenizer.py", line 94, in decode
return self.decodeBytes(tokens).decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')
output
안하세요.
but the correct one should be
안녕하세요
lost character 녕
by model rwkv world fp16 is correct
The text was updated successfully, but these errors were encountered: