Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in python code #5

Open
VectorASD opened this issue Jan 9, 2023 · 5 comments
Open

Error in python code #5

VectorASD opened this issue Jan 9, 2023 · 5 comments

Comments

@VectorASD
Copy link

While decoding a 6-byte value, you have "0x10000 |". It's not right to do so. Due to the fact that the usual unicode in the construction 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx allows you to encode 21 bits, in MUTF8 you only have 20 bits available, so you need to ADD 0x10000, and not turn the OR operation. In coding, these 0x10000 are not taken into account at all. Try to encode for example "📋" yourself, and then decode it. As a result, we get what 🥴.

@armijnhemel
Copy link

I tried to replicate it but couldn't:

>>> import mutf8
>>> emoji = '\U0001f4cb'
>>> encoded = mutf8.encode_modified_utf8(emoji)
>>> encoded
b'\xed\xa1\xbd\xed\xb3\x8b'
>>> mutf8.decode_modified_utf8(encoded) == emoji
True

@TkTech
Copy link
Owner

TkTech commented Mar 25, 2023

@VectorASD Would you be able to provide a minimal reproduction example?

@VectorASD
Copy link
Author

If you miss the fact that str_data_b uses is my own class, which allows you to write sectors of a dex file independently of each other, and at the end to glue and put down the binding data, then this is exactly what a fully working MUTF-8 will look like:

import io

def MUTF8(Str):
    # sdb = self.str_data_b
    sdb = io.BytesIO()
    Str = [ord(let) for let in Str]
    L = len(Str) + sum(1 for c in Str if c >= 0x10000)
    #pos = sdb.tell()
    #  sdb.pos()
    #  sdb.uleb128(L)
    for c in Str:
      if c == 0: data = 192, 128
      elif c < 128: data = (c,) # 7 битов
      elif c < 0x800: data = 192 | c >> 6, 128 | (c & 63) # 5 + 6 = 11 битов
      elif c < 0x10000: data = 224 | c >> 12, 128 | (c >> 6 & 63), 128 | (c & 63) # 4 + 6 + 6 = 16 битов
      else:
        c -= 0x10000
        data = ( # 4 + 6 + 4 + 6 = 20 битов
          237, 160 | c >> 16, 128 | (c >> 10 & 63),
          237, 176 | (c >> 6 & 15), 128 | (c & 63))
      sdb.write(bytes(data))
    #  sdb.write(b"\0")
    #pos2 = sdb.tell()
    #sdb.seek(pos)
    #print("•", sdb.data.read(pos2 - pos).hex())
    #assert sdb.tell() == pos2
    print(sdb.getvalue().hex())
MUTF8("\U0001f4cb")

it will print: eda0bdedb38b, instead of the erroneous eda1bdedb38b

@armijnhemel
Copy link

I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401 to b'\xed\xa0\x81\xed\xb0\x81' whereas the mutf8 python package gives b'\xed\xa1\x81\xed\xb0\x81' as the result:

>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'

When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.

@armijnhemel
Copy link

armijnhemel commented Mar 26, 2023

I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401 to b'\xed\xa0\x81\xed\xb0\x81' whereas the mutf8 python package gives b'\xed\xa1\x81\xed\xb0\x81' as the result:

>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'

When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.

It seems that the problem is in the encoding, not the decoding:

>>> bla = '\U00010401'
>>> encoded = mutf8.encode_modified_utf8(bla)
>>> encoded
b'\xed\xa1\x81\xed\xb0\x81'
>>> encoded2 = b'\xed\xa0\x81\xed\xb0\x81'
>>> encoded == encoded2
False
>>> bla == mutf8.decode_modified_utf8(encoded)
True
>>> bla == mutf8.decode_modified_utf8(encoded2)
True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants