-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surrogate pair encoding error #6
Comments
This is the best documentation I have found on this part of the format. It looks like you need to add |
Thanks for the effort on this. Do you have test strings (or a file) this was failing to parse so I can write tests for it and verify your fixes? |
Here are some unicode indexes and the mutf-8 byte sequences that they should encode to. (
(0, b"\xC0\x80"),
(1, b"\x01"),
(2, b"\x02"),
(4, b"\x04"),
(8, b"\x08"),
(16, b"\x10"),
(32, b"\x20"),
(64, b"\x40"),
(128, b"\xc2\x80"),
(256, b"\xc4\x80"),
(512, b"\xc8\x80"),
(1024, b"\xd0\x80"),
(2048, b"\xe0\xa0\x80"),
(4096, b"\xe1\x80\x80"),
(8192, b"\xe2\x80\x80"),
(16384, b"\xe4\x80\x80"),
(32768, b"\xe8\x80\x80"),
(65536, b"\xed\xa0\x80\xed\xb0\x80"),
(131072, b"\xed\xa1\x80\xed\xb0\x80"),
(262144, b"\xed\xa3\x80\xed\xb0\x80"),
(524288, b"\xed\xa7\x80\xed\xb0\x80"),
(1048576, b"\xed\xaf\x80\xed\xb0\x80"),
) |
I took a quick look and your changes break other tests that I'm fairly confident are correct, so I will need to take a closer look at this after the long weekend. |
@TkTech Any news on this? Here's an implementation of the conversion that doesn't have this bug: https://gist.github.com/BarelyAliveMau5/000e7e453b6d4ebd0cb06f39bc2e7aec Unfortunately, it's just a random Gist without PyPI package or so. Of course, it would be desirable to have a working implementation as part of a well-maintained package. Here's the example that lead me here: >>> import mutf8
>>> mutf8.encode_modified_utf8("𝕭")
b'\xed\xa1\xb5\xed\xb5\xad'
>>> utf8s_to_utf8m("𝕭".encode())
b'\xed\xa0\xb5\xed\xb5\xad' |
The thing that originally brought this to my attention was the bow and arrow emoji. |
I have been doing a deep dive into the the code of this library and found something funky going on with the surrogate pair encoding.
The issue seems to be from the top 4 bits but I can't find a simple document explaining how they are supposed to work.
I will update this with more info when I have it but my current findings are below.
Column 1 is
v
Column 2 is
ord(decode_modified_utf8(encode_modified_utf8(chr(v))))
(note how the last 4 values do not match the inputColumn 3 is
encode_modified_utf8(chr(v))
Column 4 is column 3 in binary
I think the issue is on encoding because using other decoding tools gives the same value decoding the encoded value.
The text was updated successfully, but these errors were encountered: