-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] UTF-8 Decode error #707
Comments
I can confirm that Java edition generates the byte sequence |
Using this website I am able to decode it correctly. |
Right so after quite a bit of digging I have worked out that the encoding java edition uses is a modified version of utf-8 which python is not able to parse after a certain point. |
We might have to include something like this to be able to correctly decode the data. I will have to do some more thinking on what is the best way to handle the various encoding schemes |
I think it would make sense to have a customisable encoding and decoding scheme so that the calling code can handle the encoding as it wishes. |
rebuilds to the correct encoding if there is any issues
converts the noncompatible 4 byte code points from utf-8 to the 6 byte code point encoding minecraft java is using aka > CESU-8 |
Here are the cases we have to deal with
Modified utf-8 (java)
Malformed data. Bedrock has been known to save malformed utf-8 byte sequences. Arbitrary data. Bedrock has recently used the TAG_String class to store arbitrary bytes. We should also design this with the assumption that other string encoding schemes could be found in the future. |
If we can somehow sort out an escaping system that gets automatically unescaped at encoding I would like to switch back to the string method. |
An option for decoding the error values is to escape them To reencode you do this (there might be a better way) This does have the issue that I am tempted to say we should do something like this but pick a rarely used unicode character as the escape character. |
Doing so would mean that the escape character we pick cannot be used in a normal string.
|
So the byte sequence |
Here is my utf-8 escape implementation. def _escape_replace(err):
if isinstance(err, UnicodeDecodeError):
return f"␛x{err.object[err.start]:02X}", err.start+1
raise err
codecs.register_error("escapereplace", _escape_replace)
def utf8_escape_decoder(b: bytes) -> str:
"""UTF-8 decoder that escapes error bytes to the form ␛xFF"""
return b.decode(errors="escapereplace")
EscapePattern = re.compile(b"\xe2\x90\x9bx([0-9a-zA-Z]{2})") # ␛xFF
def utf8_escape_encoder(s: str) -> bytes:
"""UTF-8 encoder that converts ␛x[0-9a-fA-F]{2} back to individual bytes"""
return EscapePattern.sub(lambda m: bytes([int(m.groups()[0], 16)]), s.encode()) |
This should be fixed in 0.10.0b1 with the new NBT library |
Bug Report
Current Behaviour:
Unable to load chunks containing strings with special characters.
Expected behavior:
The chunk loads I guess.
Steps To Reproduce:
Create a world and place an item with special characters in its name:
/setblock ~ ~ ~ minecraft:chest{Items:[{Count:1b,Slot:0b,id:"minecraft:acacia_boat",tag:{display:{Name:'{"text":"🏹"}'}}}]}
Attempt to load the world with amulet. When trying to access that chunk, an error will occur instead.
Environment:
Additional context
Possibly related to Amulet NBT's #13
The issue is not present on amulet 0.9.1. Haven't tested .2 or .3
Attachments
Screenshots
Example of problematic chest
Worlds
Amulet_UTF8_Error.zip (A world containing a chest that causes the issue in chunk
0 1
)The text was updated successfully, but these errors were encountered: