Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] UTF-8 Decode error #707

Closed
5uso opened this issue Jun 10, 2022 · 13 comments
Closed

[Bug Report] UTF-8 Decode error #707

5uso opened this issue Jun 10, 2022 · 13 comments
Labels
priority: high this ticket needs resolving quickly type: bug Something isn't working

Comments

@5uso
Copy link

5uso commented Jun 10, 2022

Bug Report

Current Behaviour:

Unable to load chunks containing strings with special characters.

Expected behavior:

The chunk loads I guess.

Steps To Reproduce:

  1. Create a world and place an item with special characters in its name: /setblock ~ ~ ~ minecraft:chest{Items:[{Count:1b,Slot:0b,id:"minecraft:acacia_boat",tag:{display:{Name:'{"text":"🏹"}'}}}]}

  2. Attempt to load the world with amulet. When trying to access that chunk, an error will occur instead.

Environment:

  • OS: Windows
  • Minecraft Platform: Java
  • Minecraft Version: 1.19 (Also tested on 1.16.5)
  • Amulet Version: 0.9.4

Additional context

Possibly related to Amulet NBT's #13

The issue is not present on amulet 0.9.1. Haven't tested .2 or .3

Attachments

Screenshots

image
Example of problematic chest

Worlds

Amulet_UTF8_Error.zip (A world containing a chest that causes the issue in chunk 0 1)

@5uso 5uso added state: triage the severity of this ticket needs evaluating type: bug Something isn't working labels Jun 10, 2022
@gentlegiantJGC
Copy link
Member

I can confirm that Java edition generates the byte sequence \xed\xa0\xbc\xed\xbf\xb9 when you type in 🏹 however I cannot find an encoding that decodes that correctly.

@gentlegiantJGC
Copy link
Member

Using this website I am able to decode it correctly.
https://mothereff.in/utf-8

@gentlegiantJGC
Copy link
Member

Right so after quite a bit of digging I have worked out that the encoding java edition uses is a modified version of utf-8 which python is not able to parse after a certain point.

@gentlegiantJGC
Copy link
Member

We might have to include something like this to be able to correctly decode the data.
https://pypi.org/project/mutf8/

I will have to do some more thinking on what is the best way to handle the various encoding schemes

@gentlegiantJGC gentlegiantJGC added priority: high this ticket needs resolving quickly and removed state: triage the severity of this ticket needs evaluating labels Jun 11, 2022
@gentlegiantJGC
Copy link
Member

I think it would make sense to have a customisable encoding and decoding scheme so that the calling code can handle the encoding as it wishes.

@PREMIEREHELL
Copy link

PREMIEREHELL commented Jun 11, 2022

@property
    def py_str(self) -> str:
        s_8_data = self.py_bytes.decode('utf_8', errors='surrogatepass')
        b_16_data = s_8_data.encode('utf_16', errors='surrogatepass')
        s_16_data = b_16_data.decode('utf_16')
        return s_16_data 

rebuilds to the correct encoding if there is any issues

def utf8s_to_utf8m(e_bytes):
    new_str = []
    i = 0
    while i < len(e_bytes):
        byte1 = e_bytes[i]
        if (byte1 & 0x80) == 0:
            if byte1 == 0:
                new_str.append(0xC0)
                new_str.append(0x80)
            else:
                new_str.append(byte1)
        elif (byte1 & 0xE0) == 0xC0:
            new_str.append(byte1)
            i += 1
            new_str.append(e_bytes[i])
        elif (byte1 & 0xF0) == 0xE0: 
            new_str.append(byte1)
            i += 1
            new_str.append(e_bytes[i])
            i += 1
            new_str.append(e_bytes[i])
        elif (byte1 & 0xF8) == 0xF0:
            i += 1
            byte2 = e_bytes[i]
            i += 1
            byte3 = e_bytes[i]
            i += 1
            byte4 = e_bytes[i]
            u21 = (byte1 & 0x07) << 18
            u21 += (byte2 & 0x3F) << 12
            u21 += (byte3 & 0x3F) << 6
            u21 += (byte4 & 0x3F)
            new_str.append(0xED)
            new_str.append((0xA0 + (((u21 >> 16) - 1) & 0x0F)))
            new_str.append((0x80 + ((u21 >> 10) & 0x3F)))
            new_str.append(0xED)
            new_str.append((0xB0 + ((u21 >> 6) & 0x0F)))
            new_str.append(byte4)
        i += 1
    return bytes(new_str)

cdef void write_string(bytes value, object buffer, bint little_endian):   
    if not little_endian:
        value = utf8s_to_utf8m(value)
    cdef short length = <short> len(value)
    cdef char*s = value
    to_little_endian(&length, 2, little_endian)
    cwrite(buffer, <char*> &length, 2)
    cwrite(buffer, s, len(value))

converts the noncompatible 4 byte code points from utf-8 to the 6 byte code point encoding minecraft java is using aka > CESU-8

@gentlegiantJGC
Copy link
Member

gentlegiantJGC commented Jun 12, 2022

Here are the cases we have to deal with
Standard utf-8 (bedrock)

\x00 null
\x41 a
\xc2\xb6 ¶
\xe3\x8b\xa0 ㋠
\xF0\x90\x8C\xB0 𐌰

Modified utf-8 (java)

\xc0\x80 null
\x41 a
\xc2\xb6 ¶
\xe3\x8b\xa0 ㋠
\xed\xa0\x80\xed\xbc\xb0 𐌰

Malformed data. Bedrock has been known to save malformed utf-8 byte sequences.
\xa7l incorrectly sliced from \xc2\xa7l

Arbitrary data. Bedrock has recently used the TAG_String class to store arbitrary bytes.
\x00\x00\x00\x03\x00\x00\x00\x06
\x06\x00\x00\x00\xFD\xFF\xFF\xFF

We should also design this with the assumption that other string encoding schemes could be found in the future.

@gentlegiantJGC
Copy link
Member

  1. Store everything as bytes

    1. Everything that can be stored in the raw format can be stored in a way that can be saved back without data loss
    2. If there is a bad encoding it won't be noticed until accessing
    3. This requires the accessing code to know the encoding
    4. Copying the tag to a platform with a different encoding will cause errors
  2. Store everything as string

    1. The decoding and encoding methods would have to specify the encoding scheme
    2. Invalid byte sequences would have to get escaped somehow so they get encoded back to the same byte sequence
    3. Tags can easily be copied between platforms because the encoding is done at the end.

If we can somehow sort out an escaping system that gets automatically unescaped at encoding I would like to switch back to the string method.

@gentlegiantJGC
Copy link
Member

An option for decoding the error values is to escape them
b"\x06\x00\x00\x00\xFD\xFF\xFF\xFF".decode(errors="backslashreplace") -> "\x06\x00\x00\x00\\xfd\\xff\\xff\\xff"
Note that on the left \xFD is one byte but on the right \\xfd is four characters.

To reencode you do this (there might be a better way)
"\x06\x00\x00\x00\\xfd\\xff\\xff\\xff".encode("utf-8").decode("unicode_escape").encode("latin") -> b'\x06\x00\x00\x00\xfd\xff\xff\xff'

This does have the issue that b"\\xfd" would be decoded correctly but upon reencoding would be converted to one byte.
b"\\xfd \xfd" -> "\\xfd \\xfd" -> b"\xfd \xfd"

I am tempted to say we should do something like this but pick a rarely used unicode character as the escape character.

@gentlegiantJGC
Copy link
Member

Doing so would mean that the escape character we pick cannot be used in a normal string.
Escape suggestions

  1. Stick with \xFF. This is fairly standard but may potentially conflict if someone wanted to use \x in a normal string which I think is non-negligible.
  2. ␛FF I did a search of unicode and found this symbol which seems to be designed as a visual escape symbol. Again there is a non-negligible chance this could be used in a normal string
  3. Combine the two. ␛xFF. I think it is very unlikely that ␛x would be used in combination in normal text.

@gentlegiantJGC
Copy link
Member

So the byte sequence b"\x06\x00\x00\x00\xFD\xFF\xFF\xFF" when decoded using utf-8 with our special error function would create the string "\x06\x00\x00\x00␛xFD␛xFF␛xFF␛xFF" where the first half are individual characters and the second half are each made up of 4 characters.

@gentlegiantJGC
Copy link
Member

gentlegiantJGC commented Jun 12, 2022

Here is my utf-8 escape implementation.
There may be some issues with the encoder if the byte sequence that is being matched happened to match some other characters but I think the chance is low.
Edit: I just tried putting every byte before this sequence and parsing it and only the first 128 were valid (1 byte per character with the escape character being handled correctly.

def _escape_replace(err):
    if isinstance(err, UnicodeDecodeError):
        return f"␛x{err.object[err.start]:02X}", err.start+1
    raise err


codecs.register_error("escapereplace", _escape_replace)


def utf8_escape_decoder(b: bytes) -> str:
    """UTF-8 decoder that escapes error bytes to the form ␛xFF"""
    return b.decode(errors="escapereplace")

EscapePattern = re.compile(b"\xe2\x90\x9bx([0-9a-zA-Z]{2})")  # ␛xFF

def utf8_escape_encoder(s: str) -> bytes:
    """UTF-8 encoder that converts ␛x[0-9a-fA-F]{2} back to individual bytes"""
    return EscapePattern.sub(lambda m: bytes([int(m.groups()[0], 16)]), s.encode())

@gentlegiantJGC
Copy link
Member

This should be fixed in 0.10.0b1 with the new NBT library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high this ticket needs resolving quickly type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants