Surrogate pair encoding error #6

gentlegiantJGC · 2023-06-27T08:56:01Z

I have been doing a deep dive into the the code of this library and found something funky going on with the surrogate pair encoding.

The issue seems to be from the top 4 bits but I can't find a simple document explaining how they are supposed to work.

I will update this with more info when I have it but my current findings are below.

Column 1 is v
Column 2 is ord(decode_modified_utf8(encode_modified_utf8(chr(v)))) (note how the last 4 values do not match the input
Column 3 is encode_modified_utf8(chr(v))
Column 4 is column 3 in binary

65536		65536		b'\xed\xa1\x80\xed\xb0\x80'	    0000000000000000111011011010000110000000111011011011000010000000
131072		196608		b'\xed\xa2\x80\xed\xb0\x80'	    0000000000000000111011011010001010000000111011011011000010000000
262144		327680		b'\xed\xa4\x80\xed\xb0\x80'	    0000000000000000111011011010010010000000111011011011000010000000
524288		589824		b'\xed\xa8\x80\xed\xb0\x80'	    0000000000000000111011011010100010000000111011011011000010000000
1048576		65536		b'\xed\xa0\x80\xed\xb0\x80'	    0000000000000000111011011010000010000000111011011011000010000000

I think the issue is on encoding because using other decoding tools gives the same value decoding the encoded value.

The text was updated successfully, but these errors were encountered:

gentlegiantJGC · 2023-06-27T10:33:57Z

This is the best documentation I have found on this part of the format.
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

It looks like you need to add 0x10000 not or it like you are

TkTech · 2023-06-28T09:47:43Z

Thanks for the effort on this. Do you have test strings (or a file) this was failing to parse so I can write tests for it and verify your fixes?

gentlegiantJGC · 2023-06-28T12:24:09Z

Here are some unicode indexes and the mutf-8 byte sequences that they should encode to.

(
    (0, b"\xC0\x80"),
    (1, b"\x01"),
    (2, b"\x02"),
    (4, b"\x04"),
    (8, b"\x08"),
    (16, b"\x10"),
    (32, b"\x20"),
    (64, b"\x40"),
    (128, b"\xc2\x80"),
    (256, b"\xc4\x80"),
    (512, b"\xc8\x80"),
    (1024, b"\xd0\x80"),
    (2048, b"\xe0\xa0\x80"),
    (4096, b"\xe1\x80\x80"),
    (8192, b"\xe2\x80\x80"),
    (16384, b"\xe4\x80\x80"),
    (32768, b"\xe8\x80\x80"),
    (65536, b"\xed\xa0\x80\xed\xb0\x80"),
    (131072, b"\xed\xa1\x80\xed\xb0\x80"),
    (262144, b"\xed\xa3\x80\xed\xb0\x80"),
    (524288, b"\xed\xa7\x80\xed\xb0\x80"),
    (1048576, b"\xed\xaf\x80\xed\xb0\x80"),
)

TkTech · 2023-06-30T09:32:55Z

I took a quick look and your changes break other tests that I'm fairly confident are correct, so I will need to take a closer look at this after the long weekend.

tuxor1337 · 2023-07-21T22:37:55Z

@TkTech Any news on this? Here's an implementation of the conversion that doesn't have this bug: https://gist.github.com/BarelyAliveMau5/000e7e453b6d4ebd0cb06f39bc2e7aec Unfortunately, it's just a random Gist without PyPI package or so. Of course, it would be desirable to have a working implementation as part of a well-maintained package.

Here's the example that lead me here:

>>> import mutf8
>>> mutf8.encode_modified_utf8("𝕭")
b'\xed\xa1\xb5\xed\xb5\xad'
>>> utf8s_to_utf8m("𝕭".encode())
b'\xed\xa0\xb5\xed\xb5\xad'

gentlegiantJGC · 2024-02-07T10:35:08Z

The thing that originally brought this to my attention was the bow and arrow emoji.
This 🏹 should encode to \xed\xa0\xbc\xed\xbf\xb9 in mutf8 but your library encodes it to \xed\xa1\xbc\xed\xbf\xb9

gentlegiantJGC linked a pull request Jun 27, 2023 that will close this issue

Fixed surrogate pair encoding #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surrogate pair encoding error #6

Surrogate pair encoding error #6

gentlegiantJGC commented Jun 27, 2023

gentlegiantJGC commented Jun 27, 2023

TkTech commented Jun 28, 2023 •

edited

gentlegiantJGC commented Jun 28, 2023

TkTech commented Jun 30, 2023

tuxor1337 commented Jul 21, 2023 •

edited

gentlegiantJGC commented Feb 7, 2024

Surrogate pair encoding error #6

Surrogate pair encoding error #6

Comments

gentlegiantJGC commented Jun 27, 2023

gentlegiantJGC commented Jun 27, 2023

TkTech commented Jun 28, 2023 • edited

gentlegiantJGC commented Jun 28, 2023

TkTech commented Jun 30, 2023

tuxor1337 commented Jul 21, 2023 • edited

gentlegiantJGC commented Feb 7, 2024

TkTech commented Jun 28, 2023 •

edited

tuxor1337 commented Jul 21, 2023 •

edited