"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

cgisky1980 · 2023-06-15T06:56:11Z

rwkv world 3b or 7b Q8_0
input
Translate the following text into Korean: "Hello"
output
File "/www/wenda-pi/llms/rwkvcpp/rwkv_tokenizer.py", line 94, in decode
return self.decodeBytes(tokens).decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')

output
안하세요.
but the correct one should be
안녕하세요

lost character 녕

by model rwkv world fp16 is correct

saharNooby · 2023-06-15T11:16:32Z

by model rwkv world fp16 is correct

Do you mean that the same code works correctly when using FP16 model?

cgisky1980 · 2023-06-15T15:20:23Z

by model rwkv world fp16 is correct

Do you mean that the same code works correctly when using FP16 model?
yes

I mean, models without quantization don't have this problem

saharNooby · 2023-06-16T11:15:11Z

Then I think this issue actually points at 2 separate problems:

quantized model produces less correct text that non-quantized model
UnicodeDecodeError error is thrown

For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller.

2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it.

konflictue · 2023-06-16T17:52:37Z

Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')
lost character 녕

@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv.

cgisky1980 · 2023-06-17T03:29:53Z

Then I think this issue actually points at 2 separate problems:

quantized model produces less correct text that non-quantized model

UnicodeDecodeError error is thrown

For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller.

2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it.

Because this issue is 100% reproducible in models 3B and 7B, I don't think it's a problem of accuracy loss

ps. https:github.com//issues/19 this is the good first issue. LOL

cgisky1980 · 2023-06-19T16:51:06Z

Modify the above file
return self.decodeBytes(tokens).decode('utf-8','ignore')
lost character 녕

@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv.

yes，it works

RWKV/rwkv.cpp#102

saharNooby changed the title ~~Some character output errors~~ "Unexpected end of data" when decoding partial Unicode characters with World tokenizer Jun 15, 2023

saharNooby added bug Something isn't working good first issue Good for newcomers labels Jun 16, 2023

cgisky1980 added a commit to cgisky1980/wenda-pi that referenced this issue Jun 19, 2023

fix UTF8编码偶发性错误

0bb78d9

RWKV/rwkv.cpp#102

cgisky1980 mentioned this issue Jun 19, 2023

fix UTF8编码偶发性错误 wenda-LLM/wenda#394

Merged

saharNooby linked a pull request Jun 21, 2023 that will close this issue

Various improvements #104

Merged

saharNooby closed this as completed in #104 Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

cgisky1980 commented Jun 15, 2023 •

edited

saharNooby commented Jun 15, 2023

cgisky1980 commented Jun 15, 2023 •

edited

saharNooby commented Jun 16, 2023

konflictue commented Jun 16, 2023

cgisky1980 commented Jun 17, 2023 •

edited

cgisky1980 commented Jun 19, 2023

"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

Comments

cgisky1980 commented Jun 15, 2023 • edited

saharNooby commented Jun 15, 2023

cgisky1980 commented Jun 15, 2023 • edited

saharNooby commented Jun 16, 2023

konflictue commented Jun 16, 2023

cgisky1980 commented Jun 17, 2023 • edited

cgisky1980 commented Jun 19, 2023

cgisky1980 commented Jun 15, 2023 •

edited

cgisky1980 commented Jun 15, 2023 •

edited

cgisky1980 commented Jun 17, 2023 •

edited