Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surrogate pair encoding error #6

Open
gentlegiantJGC opened this issue Jun 27, 2023 · 6 comments · May be fixed by #7
Open

Surrogate pair encoding error #6

gentlegiantJGC opened this issue Jun 27, 2023 · 6 comments · May be fixed by #7

Comments

@gentlegiantJGC
Copy link

I have been doing a deep dive into the the code of this library and found something funky going on with the surrogate pair encoding.

The issue seems to be from the top 4 bits but I can't find a simple document explaining how they are supposed to work.

I will update this with more info when I have it but my current findings are below.

Column 1 is v
Column 2 is ord(decode_modified_utf8(encode_modified_utf8(chr(v)))) (note how the last 4 values do not match the input
Column 3 is encode_modified_utf8(chr(v))
Column 4 is column 3 in binary

65536		65536		b'\xed\xa1\x80\xed\xb0\x80'	    0000000000000000111011011010000110000000111011011011000010000000
131072		196608		b'\xed\xa2\x80\xed\xb0\x80'	    0000000000000000111011011010001010000000111011011011000010000000
262144		327680		b'\xed\xa4\x80\xed\xb0\x80'	    0000000000000000111011011010010010000000111011011011000010000000
524288		589824		b'\xed\xa8\x80\xed\xb0\x80'	    0000000000000000111011011010100010000000111011011011000010000000
1048576		65536		b'\xed\xa0\x80\xed\xb0\x80'	    0000000000000000111011011010000010000000111011011011000010000000

I think the issue is on encoding because using other decoding tools gives the same value decoding the encoded value.

@gentlegiantJGC
Copy link
Author

This is the best documentation I have found on this part of the format.
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

It looks like you need to add 0x10000 not or it like you are

@gentlegiantJGC gentlegiantJGC linked a pull request Jun 27, 2023 that will close this issue
@TkTech
Copy link
Owner

TkTech commented Jun 28, 2023

Thanks for the effort on this. Do you have test strings (or a file) this was failing to parse so I can write tests for it and verify your fixes?

@gentlegiantJGC
Copy link
Author

Here are some unicode indexes and the mutf-8 byte sequences that they should encode to.

(
    (0, b"\xC0\x80"),
    (1, b"\x01"),
    (2, b"\x02"),
    (4, b"\x04"),
    (8, b"\x08"),
    (16, b"\x10"),
    (32, b"\x20"),
    (64, b"\x40"),
    (128, b"\xc2\x80"),
    (256, b"\xc4\x80"),
    (512, b"\xc8\x80"),
    (1024, b"\xd0\x80"),
    (2048, b"\xe0\xa0\x80"),
    (4096, b"\xe1\x80\x80"),
    (8192, b"\xe2\x80\x80"),
    (16384, b"\xe4\x80\x80"),
    (32768, b"\xe8\x80\x80"),
    (65536, b"\xed\xa0\x80\xed\xb0\x80"),
    (131072, b"\xed\xa1\x80\xed\xb0\x80"),
    (262144, b"\xed\xa3\x80\xed\xb0\x80"),
    (524288, b"\xed\xa7\x80\xed\xb0\x80"),
    (1048576, b"\xed\xaf\x80\xed\xb0\x80"),
)

@TkTech
Copy link
Owner

TkTech commented Jun 30, 2023

I took a quick look and your changes break other tests that I'm fairly confident are correct, so I will need to take a closer look at this after the long weekend.

@tuxor1337
Copy link

tuxor1337 commented Jul 21, 2023

@TkTech Any news on this? Here's an implementation of the conversion that doesn't have this bug: https://gist.github.com/BarelyAliveMau5/000e7e453b6d4ebd0cb06f39bc2e7aec Unfortunately, it's just a random Gist without PyPI package or so. Of course, it would be desirable to have a working implementation as part of a well-maintained package.

Here's the example that lead me here:

>>> import mutf8
>>> mutf8.encode_modified_utf8("𝕭")
b'\xed\xa1\xb5\xed\xb5\xad'
>>> utf8s_to_utf8m("𝕭".encode())
b'\xed\xa0\xb5\xed\xb5\xad'

@gentlegiantJGC
Copy link
Author

The thing that originally brought this to my attention was the bow and arrow emoji.
This 🏹 should encode to \xed\xa0\xbc\xed\xbf\xb9 in mutf8 but your library encodes it to \xed\xa1\xbc\xed\xbf\xb9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants