Error in python code #5

VectorASD · 2023-01-09T17:11:24Z

While decoding a 6-byte value, you have "0x10000 |". It's not right to do so. Due to the fact that the usual unicode in the construction 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx allows you to encode 21 bits, in MUTF8 you only have 20 bits available, so you need to ADD 0x10000, and not turn the OR operation. In coding, these 0x10000 are not taken into account at all. Try to encode for example "📋" yourself, and then decode it. As a result, we get what 🥴.

armijnhemel · 2023-03-15T16:32:07Z

I tried to replicate it but couldn't:

>>> import mutf8
>>> emoji = '\U0001f4cb'
>>> encoded = mutf8.encode_modified_utf8(emoji)
>>> encoded
b'\xed\xa1\xbd\xed\xb3\x8b'
>>> mutf8.decode_modified_utf8(encoded) == emoji
True

TkTech · 2023-03-25T20:51:10Z

@VectorASD Would you be able to provide a minimal reproduction example?

VectorASD · 2023-03-26T05:33:40Z

If you miss the fact that str_data_b uses is my own class, which allows you to write sectors of a dex file independently of each other, and at the end to glue and put down the binding data, then this is exactly what a fully working MUTF-8 will look like:

import io

def MUTF8(Str):
    # sdb = self.str_data_b
    sdb = io.BytesIO()
    Str = [ord(let) for let in Str]
    L = len(Str) + sum(1 for c in Str if c >= 0x10000)
    #pos = sdb.tell()
    #  sdb.pos()
    #  sdb.uleb128(L)
    for c in Str:
      if c == 0: data = 192, 128
      elif c < 128: data = (c,) # 7 битов
      elif c < 0x800: data = 192 | c >> 6, 128 | (c & 63) # 5 + 6 = 11 битов
      elif c < 0x10000: data = 224 | c >> 12, 128 | (c >> 6 & 63), 128 | (c & 63) # 4 + 6 + 6 = 16 битов
      else:
        c -= 0x10000
        data = ( # 4 + 6 + 4 + 6 = 20 битов
          237, 160 | c >> 16, 128 | (c >> 10 & 63),
          237, 176 | (c >> 6 & 15), 128 | (c & 63))
      sdb.write(bytes(data))
    #  sdb.write(b"\0")
    #pos2 = sdb.tell()
    #sdb.seek(pos)
    #print("•", sdb.data.read(pos2 - pos).hex())
    #assert sdb.tell() == pos2
    print(sdb.getvalue().hex())
MUTF8("\U0001f4cb")

it will print: eda0bdedb38b, instead of the erroneous eda1bdedb38b

armijnhemel · 2023-03-26T12:51:20Z

I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401 to b'\xed\xa0\x81\xed\xb0\x81' whereas the mutf8 python package gives b'\xed\xa1\x81\xed\xb0\x81' as the result:

>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'

When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.

armijnhemel · 2023-03-26T14:03:21Z

I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401 to b'\xed\xa0\x81\xed\xb0\x81' whereas the mutf8 python package gives b'\xed\xa1\x81\xed\xb0\x81' as the result:
>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'
When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.

It seems that the problem is in the encoding, not the decoding:

>>> bla = '\U00010401'
>>> encoded = mutf8.encode_modified_utf8(bla)
>>> encoded
b'\xed\xa1\x81\xed\xb0\x81'
>>> encoded2 = b'\xed\xa0\x81\xed\xb0\x81'
>>> encoded == encoded2
False
>>> bla == mutf8.decode_modified_utf8(encoded)
True
>>> bla == mutf8.decode_modified_utf8(encoded2)
True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in python code #5

Error in python code #5

VectorASD commented Jan 9, 2023

armijnhemel commented Mar 15, 2023

TkTech commented Mar 25, 2023

VectorASD commented Mar 26, 2023

armijnhemel commented Mar 26, 2023

armijnhemel commented Mar 26, 2023 •

edited

Error in python code #5

Error in python code #5

Comments

VectorASD commented Jan 9, 2023

armijnhemel commented Mar 15, 2023

TkTech commented Mar 25, 2023

VectorASD commented Mar 26, 2023

armijnhemel commented Mar 26, 2023

armijnhemel commented Mar 26, 2023 • edited

armijnhemel commented Mar 26, 2023 •

edited