[Bug Report] UTF-8 Decode error #707

5uso · 2022-06-10T20:29:35Z

Bug Report

Current Behaviour:

Unable to load chunks containing strings with special characters.

Expected behavior:

The chunk loads I guess.

Steps To Reproduce:

Create a world and place an item with special characters in its name: /setblock ~ ~ ~ minecraft:chest{Items:[{Count:1b,Slot:0b,id:"minecraft:acacia_boat",tag:{display:{Name:'{"text":"🏹"}'}}}]}
Attempt to load the world with amulet. When trying to access that chunk, an error will occur instead.

Environment:

OS: Windows
Minecraft Platform: Java
Minecraft Version: 1.19 (Also tested on 1.16.5)
Amulet Version: 0.9.4

Additional context

Possibly related to Amulet NBT's #13

The issue is not present on amulet 0.9.1. Haven't tested .2 or .3

Attachments

Screenshots

Example of problematic chest

Worlds

Amulet_UTF8_Error.zip (A world containing a chest that causes the issue in chunk 0 1)

The text was updated successfully, but these errors were encountered:

gentlegiantJGC · 2022-06-11T14:36:34Z

I can confirm that Java edition generates the byte sequence \xed\xa0\xbc\xed\xbf\xb9 when you type in 🏹 however I cannot find an encoding that decodes that correctly.

gentlegiantJGC · 2022-06-11T15:35:14Z

Using this website I am able to decode it correctly.
https://mothereff.in/utf-8

gentlegiantJGC · 2022-06-11T15:57:48Z

Right so after quite a bit of digging I have worked out that the encoding java edition uses is a modified version of utf-8 which python is not able to parse after a certain point.

gentlegiantJGC · 2022-06-11T16:21:32Z

We might have to include something like this to be able to correctly decode the data.
https://pypi.org/project/mutf8/

I will have to do some more thinking on what is the best way to handle the various encoding schemes

gentlegiantJGC · 2022-06-11T16:36:42Z

I think it would make sense to have a customisable encoding and decoding scheme so that the calling code can handle the encoding as it wishes.

PREMIEREHELL · 2022-06-11T18:39:11Z

@property
    def py_str(self) -> str:
        s_8_data = self.py_bytes.decode('utf_8', errors='surrogatepass')
        b_16_data = s_8_data.encode('utf_16', errors='surrogatepass')
        s_16_data = b_16_data.decode('utf_16')
        return s_16_data

rebuilds to the correct encoding if there is any issues

def utf8s_to_utf8m(e_bytes):
    new_str = []
    i = 0
    while i < len(e_bytes):
        byte1 = e_bytes[i]
        if (byte1 & 0x80) == 0:
            if byte1 == 0:
                new_str.append(0xC0)
                new_str.append(0x80)
            else:
                new_str.append(byte1)
        elif (byte1 & 0xE0) == 0xC0:
            new_str.append(byte1)
            i += 1
            new_str.append(e_bytes[i])
        elif (byte1 & 0xF0) == 0xE0: 
            new_str.append(byte1)
            i += 1
            new_str.append(e_bytes[i])
            i += 1
            new_str.append(e_bytes[i])
        elif (byte1 & 0xF8) == 0xF0:
            i += 1
            byte2 = e_bytes[i]
            i += 1
            byte3 = e_bytes[i]
            i += 1
            byte4 = e_bytes[i]
            u21 = (byte1 & 0x07) << 18
            u21 += (byte2 & 0x3F) << 12
            u21 += (byte3 & 0x3F) << 6
            u21 += (byte4 & 0x3F)
            new_str.append(0xED)
            new_str.append((0xA0 + (((u21 >> 16) - 1) & 0x0F)))
            new_str.append((0x80 + ((u21 >> 10) & 0x3F)))
            new_str.append(0xED)
            new_str.append((0xB0 + ((u21 >> 6) & 0x0F)))
            new_str.append(byte4)
        i += 1
    return bytes(new_str)

cdef void write_string(bytes value, object buffer, bint little_endian):   
    if not little_endian:
        value = utf8s_to_utf8m(value)
    cdef short length = <short> len(value)
    cdef char*s = value
    to_little_endian(&length, 2, little_endian)
    cwrite(buffer, <char*> &length, 2)
    cwrite(buffer, s, len(value))

converts the noncompatible 4 byte code points from utf-8 to the 6 byte code point encoding minecraft java is using aka > CESU-8

gentlegiantJGC · 2022-06-12T08:40:56Z

Here are the cases we have to deal with
Standard utf-8 (bedrock)

\x00 null
\x41 a
\xc2\xb6 ¶
\xe3\x8b\xa0 ㋠
\xF0\x90\x8C\xB0 𐌰

Modified utf-8 (java)

\xc0\x80 null
\x41 a
\xc2\xb6 ¶
\xe3\x8b\xa0 ㋠
\xed\xa0\x80\xed\xbc\xb0 𐌰

Malformed data. Bedrock has been known to save malformed utf-8 byte sequences.
\xa7l incorrectly sliced from \xc2\xa7l

Arbitrary data. Bedrock has recently used the TAG_String class to store arbitrary bytes.
\x00\x00\x00\x03\x00\x00\x00\x06
\x06\x00\x00\x00\xFD\xFF\xFF\xFF

We should also design this with the assumption that other string encoding schemes could be found in the future.

gentlegiantJGC · 2022-06-12T09:01:11Z

Store everything as bytes
1. Everything that can be stored in the raw format can be stored in a way that can be saved back without data loss
2. If there is a bad encoding it won't be noticed until accessing
3. This requires the accessing code to know the encoding
4. Copying the tag to a platform with a different encoding will cause errors
Store everything as string
1. The decoding and encoding methods would have to specify the encoding scheme
2. Invalid byte sequences would have to get escaped somehow so they get encoded back to the same byte sequence
3. Tags can easily be copied between platforms because the encoding is done at the end.

If we can somehow sort out an escaping system that gets automatically unescaped at encoding I would like to switch back to the string method.

gentlegiantJGC · 2022-06-12T09:26:44Z

An option for decoding the error values is to escape them
b"\x06\x00\x00\x00\xFD\xFF\xFF\xFF".decode(errors="backslashreplace") -> "\x06\x00\x00\x00\\xfd\\xff\\xff\\xff"
Note that on the left \xFD is one byte but on the right \\xfd is four characters.

To reencode you do this (there might be a better way)
"\x06\x00\x00\x00\\xfd\\xff\\xff\\xff".encode("utf-8").decode("unicode_escape").encode("latin") -> b'\x06\x00\x00\x00\xfd\xff\xff\xff'

This does have the issue that b"\\xfd" would be decoded correctly but upon reencoding would be converted to one byte.
b"\\xfd \xfd" -> "\\xfd \\xfd" -> b"\xfd \xfd"

I am tempted to say we should do something like this but pick a rarely used unicode character as the escape character.

gentlegiantJGC · 2022-06-12T09:40:25Z

Doing so would mean that the escape character we pick cannot be used in a normal string.
Escape suggestions

Stick with \xFF. This is fairly standard but may potentially conflict if someone wanted to use \x in a normal string which I think is non-negligible.
␛FF I did a search of unicode and found this symbol which seems to be designed as a visual escape symbol. Again there is a non-negligible chance this could be used in a normal string
Combine the two. ␛xFF. I think it is very unlikely that ␛x would be used in combination in normal text.

gentlegiantJGC · 2022-06-12T09:42:54Z

So the byte sequence b"\x06\x00\x00\x00\xFD\xFF\xFF\xFF" when decoded using utf-8 with our special error function would create the string "\x06\x00\x00\x00␛xFD␛xFF␛xFF␛xFF" where the first half are individual characters and the second half are each made up of 4 characters.

gentlegiantJGC · 2022-06-12T13:00:19Z

Here is my utf-8 escape implementation.
~~There may be some issues with the encoder if the byte sequence that is being matched happened to match some other characters but I think the chance is low.~~
Edit: I just tried putting every byte before this sequence and parsing it and only the first 128 were valid (1 byte per character with the escape character being handled correctly.

def _escape_replace(err):
    if isinstance(err, UnicodeDecodeError):
        return f"␛x{err.object[err.start]:02X}", err.start+1
    raise err


codecs.register_error("escapereplace", _escape_replace)


def utf8_escape_decoder(b: bytes) -> str:
    """UTF-8 decoder that escapes error bytes to the form ␛xFF"""
    return b.decode(errors="escapereplace")

EscapePattern = re.compile(b"\xe2\x90\x9bx([0-9a-zA-Z]{2})")  # ␛xFF

def utf8_escape_encoder(s: str) -> bytes:
    """UTF-8 encoder that converts ␛x[0-9a-fA-F]{2} back to individual bytes"""
    return EscapePattern.sub(lambda m: bytes([int(m.groups()[0], 16)]), s.encode())

gentlegiantJGC · 2022-07-31T14:58:11Z

This should be fixed in 0.10.0b1 with the new NBT library

5uso added state: triage the severity of this ticket needs evaluating type: bug Something isn't working labels Jun 10, 2022

gentlegiantJGC added priority: high this ticket needs resolving quickly and removed state: triage the severity of this ticket needs evaluating labels Jun 11, 2022

gentlegiantJGC mentioned this issue Jun 14, 2022

Fix TAG_String Amulet-Team/Amulet-NBT#29

Closed

gentlegiantJGC closed this as completed Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] UTF-8 Decode error #707

[Bug Report] UTF-8 Decode error #707

5uso commented Jun 10, 2022 •

edited

Loading

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

PREMIEREHELL commented Jun 11, 2022 •

edited

Loading

gentlegiantJGC commented Jun 12, 2022 •

edited

Loading

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022 •

edited

Loading

gentlegiantJGC commented Jul 31, 2022

[Bug Report] UTF-8 Decode error #707

[Bug Report] UTF-8 Decode error #707

Comments

5uso commented Jun 10, 2022 • edited Loading

Bug Report

Current Behaviour:

Expected behavior:

Steps To Reproduce:

Environment:

Additional context

Attachments

Screenshots

Worlds

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

gentlegiantJGC commented Jun 11, 2022

PREMIEREHELL commented Jun 11, 2022 • edited Loading

gentlegiantJGC commented Jun 12, 2022 • edited Loading

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022

gentlegiantJGC commented Jun 12, 2022 • edited Loading

gentlegiantJGC commented Jul 31, 2022

5uso commented Jun 10, 2022 •

edited

Loading

PREMIEREHELL commented Jun 11, 2022 •

edited

Loading

gentlegiantJGC commented Jun 12, 2022 •

edited

Loading

gentlegiantJGC commented Jun 12, 2022 •

edited

Loading