get_num_bytes() falls back to the number of codepoints? #14

vadim-berman · 2018-06-22T07:24:26Z

Hi Jakob,

It's your fan club again.

Looks like there might be an issue in the non-sso mode for get_num_bytes(). In this case, I called substr() on a Russian string (92 bytes long, 52 code points) with a question mark in the middle, and got it butchered because get_num_bytes() returned 52.

I tried to figure out the logic following if (sso_inactive()) - that is, line 839 onwards - but couldn't, and instead substituted it by something crude but working:

	size_type byte_count = 0;
	for (size_type current_code_point = index; current_code_point< index + cp_count; current_code_point++)
		byte_count += get_codepoint_bytes(at(current_code_point));
	return byte_count;

I'm pretty sure it's not as optimised as what you planned originally but that's the best I could do...

I still get some parts butchered further down the line, for some reason, no idea why...

The text was updated successfully, but these errors were encountered:

DuffsDevice · 2018-06-23T16:24:43Z

Dear Vadim,

Good to hear from you! ;)

Do you have a snippet with which I can reporduce the failiure?
The idea behind the referenced lines is the following: If the Look-Up-Table (which contains byte indices at which multbyte codepoints reside) is active and has length '0', then there are no codepoints with more than ony byte. That means: The number of bytes equals the number of codepoints. That's why it says "return cp_count;".

So I think this part of code is fine, but somehow the LUT size or the LUT mode is broken.
Can you follow the trail?
I can have a look at your code if you can strip it down - Maybe we're in luck and this bug got silently fixed by fixing #15 :)

vadim-berman · 2018-06-24T03:18:28Z

Hi Jakob,

Thanks for the speedy fixes!

The good news: the #15 did fix some of the cases.
The bad news: not all :) .

I had to play a bit to reproduce where it fails, and in order for this to happen (return codepoints instead of the bytes), I had to:

initialise the string using utf8_string::append(utf8_string)
make the string longer than 50 characters
invoke substr on 50+ characters

I imagine it means sso=off?

Example code (sorry that it's Russian but I inserted a numeral to make it easier for you to track):

text_string test;
test.append(u8"длинная строка длиннее чем 50 знаков и с многобайтовыми знаками");
cout << subtest.substr(0, 52);

The result will be a string truncated after 29 characters, right after the numeral, with the last codepoint chopped off after the first byte.

In my case, append was unnecessary, so I replaced it with a normal assignment, and the problem was resolved. Calling a substring with a shorter argument also does not cause the issue.

DuffsDevice · 2018-06-24T19:35:01Z

Hi Vadim,

Thank you for the (very 👍) minimal example and also the hints abou the circumstances in which the error occours, this incredibly speeds up the debugging process!
I'll have a look hopefully tomorrow and will get back to you ASAP.

Have a nice sunday maybe whaching soccer :D

Jakob

vadim-berman · 2018-06-25T01:15:54Z

Great - thanks, Jakob! Glad it helped.

Best regards, Vadim

DuffsDevice · 2018-06-27T18:19:20Z

Hi Vadim,

I have found and fixed the bug. Your bug has revealed two other ones in raw_replace and raw_insert!
Additionally, I also took the time to add logic that when deciding about whether or not to have a LUT takes into account whether there already is a lut and adjusts the threshold accordingly.

Thank you, your help has made finding the bug rather easy!

Best Regards
Jakob

vadim-berman · 2018-06-28T02:38:18Z

Hi Jakob, Excellent! Glad it helped. Thanks a lot for maintaining the library. PS. I don’t know whether it’s an unwanted advice, but while browsing the code, I noticed that many times pieces with similar or identical logic seem to be duplicated (on the evils of duplication: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself; and I definitely recommend The Pragmatic Programmer). I imagine this is because you want the library to be highly optimised. If the level of optimisation is high enough for the function calls to play a role, then maybe you can consider macros? Best regards, Vadim From: Jakob Riedle <notifications@github.com> Sent: Thursday, 28 June 2018 2:19 AM To: DuffsDevice/tinyutf8 <tinyutf8@noreply.github.com> Cc: vadim-berman <vadim.berman@gmail.com>; Author <author@noreply.github.com> Subject: Re: [DuffsDevice/tinyutf8] get_num_bytes() falls back to the number of codepoints? (#14) Hi Vadim, I have found and fixed the bug. Your bug has revealed two other ones in raw_replace and raw_insert! Additionally, I also took the time to add logic that when deciding about whether or not to have a LUT takes into account whether there already is a lut and adjusts the threshold accordingly. Thank you, your help has made finding the bug rather easy! Best Regards Jakob — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJxfuH9_c9LzTgLE_psz-CbFcxufF-udks5uA8yogaJpZM4UzRBl> .

DuffsDevice · 2018-06-28T07:40:42Z

You definitely got a point. What lines would you try to outsource (regardless of in what way)?

vadim-berman · 2018-06-28T08:06:28Z

Thanks, Jakob. Off the top of my head: * get_num_bytes_from_start() should probably make use of get_num_bytes() – currently it has its own logic * there are multiple calls to get_codepoint_bytes() for the same reason, basically converting code points to bytes These are the ones that I saw. It could be the other way around (bytes to codepoints), too. Best regards, Vadim From: Jakob Riedle <notifications@github.com> Sent: Thursday, 28 June 2018 3:41 PM To: DuffsDevice/tinyutf8 <tinyutf8@noreply.github.com> Cc: vadim-berman <vadim.berman@gmail.com>; Author <author@noreply.github.com> Subject: Re: [DuffsDevice/tinyutf8] get_num_bytes() falls back to the number of codepoints? (#14) You definitely got a point. What lines would you try to outsource (regardless of in what way)? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJxfuDRIFodikQbppKGHoXPKbbatXtpkks5uBIh6gaJpZM4UzRBl> .

DuffsDevice · 2018-06-29T20:56:05Z

Yes, get_num_bytes and get_num_bytes_from_start share a lot of code, since the latter is a specialization of the former. Similarly append, raw_erase and raw_insert are all specializations of raw_replace. I decided to optimize them inside their own functions, which of course means redundancy to some extent :-|

vadim-berman · 2018-06-30T05:13:54Z

Understood. Thanks for making the effort to review! Best regards, Vadim From: Jakob Riedle <notifications@github.com> Sent: Saturday, 30 June 2018 4:56 AM To: DuffsDevice/tinyutf8 <tinyutf8@noreply.github.com> Cc: vadim-berman <vadim.berman@gmail.com>; Author <author@noreply.github.com> Subject: Re: [DuffsDevice/tinyutf8] get_num_bytes() falls back to the number of codepoints? (#14) Yes, get_num_bytes and get_num_bytes_from_start share a lot of code, since the latter is a specialization of the former. Similarly append, raw_erase and raw_insert are all specializations of raw_replace. I decided to optimize them inside their own functions, which of course means redundancy to some extent :-| — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJxfuDN0k9S-sueoaPNVmOT6miQcWvrwks5uBpRlgaJpZM4UzRBl> .

vadim-berman changed the title ~~get_num_bytes() falls back to number of codepoints?~~ get_num_bytes() falls back to the number of codepoints? Jun 24, 2018

DuffsDevice closed this as completed Jun 27, 2018

vadim-berman mentioned this issue Jan 2, 2020

utf8_string::get_num_bytes_from_start returns incorrect value #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_num_bytes() falls back to the number of codepoints? #14

get_num_bytes() falls back to the number of codepoints? #14

vadim-berman commented Jun 22, 2018 •

edited

DuffsDevice commented Jun 23, 2018

vadim-berman commented Jun 24, 2018 •

edited

DuffsDevice commented Jun 24, 2018

vadim-berman commented Jun 25, 2018

DuffsDevice commented Jun 27, 2018

vadim-berman commented Jun 28, 2018 via email

DuffsDevice commented Jun 28, 2018

vadim-berman commented Jun 28, 2018 via email

DuffsDevice commented Jun 29, 2018

vadim-berman commented Jun 30, 2018 via email

get_num_bytes() falls back to the number of codepoints? #14

get_num_bytes() falls back to the number of codepoints? #14

Comments

vadim-berman commented Jun 22, 2018 • edited

DuffsDevice commented Jun 23, 2018

vadim-berman commented Jun 24, 2018 • edited

DuffsDevice commented Jun 24, 2018

vadim-berman commented Jun 25, 2018

DuffsDevice commented Jun 27, 2018

vadim-berman commented Jun 28, 2018 via email

DuffsDevice commented Jun 28, 2018

vadim-berman commented Jun 28, 2018 via email

DuffsDevice commented Jun 29, 2018

vadim-berman commented Jun 30, 2018 via email

vadim-berman commented Jun 22, 2018 •

edited

vadim-berman commented Jun 24, 2018 •

edited