New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_num_bytes() falls back to the number of codepoints? #14
Comments
Dear Vadim, Good to hear from you! ;) Do you have a snippet with which I can reporduce the failiure? So I think this part of code is fine, but somehow the LUT size or the LUT mode is broken. |
Hi Jakob, Thanks for the speedy fixes! The good news: the #15 did fix some of the cases. I had to play a bit to reproduce where it fails, and in order for this to happen (return codepoints instead of the bytes), I had to:
I imagine it means sso=off? Example code (sorry that it's Russian but I inserted a numeral to make it easier for you to track):
The result will be a string truncated after 29 characters, right after the numeral, with the last codepoint chopped off after the first byte. In my case, |
Hi Vadim, Thank you for the (very 👍) minimal example and also the hints abou the circumstances in which the error occours, this incredibly speeds up the debugging process! Have a nice sunday maybe whaching soccer :D Jakob |
Great - thanks, Jakob! Glad it helped. Best regards, Vadim |
Hi Vadim, I have found and fixed the bug. Your bug has revealed two other ones in raw_replace and raw_insert! Thank you, your help has made finding the bug rather easy! Best Regards |
Hi Jakob,
Excellent! Glad it helped.
Thanks a lot for maintaining the library.
PS. I don’t know whether it’s an unwanted advice, but while browsing the code, I noticed that many times pieces with similar or identical logic seem to be duplicated (on the evils of duplication: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself; and I definitely recommend The Pragmatic Programmer). I imagine this is because you want the library to be highly optimised. If the level of optimisation is high enough for the function calls to play a role, then maybe you can consider macros?
Best regards, Vadim
From: Jakob Riedle <notifications@github.com>
Sent: Thursday, 28 June 2018 2:19 AM
To: DuffsDevice/tinyutf8 <tinyutf8@noreply.github.com>
Cc: vadim-berman <vadim.berman@gmail.com>; Author <author@noreply.github.com>
Subject: Re: [DuffsDevice/tinyutf8] get_num_bytes() falls back to the number of codepoints? (#14)
Hi Vadim,
I have found and fixed the bug. Your bug has revealed two other ones in raw_replace and raw_insert!
Additionally, I also took the time to add logic that when deciding about whether or not to have a LUT takes into account whether there already is a lut and adjusts the threshold accordingly.
Thank you, your help has made finding the bug rather easy!
Best Regards
Jakob
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#14 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJxfuH9_c9LzTgLE_psz-CbFcxufF-udks5uA8yogaJpZM4UzRBl> .
|
You definitely got a point. What lines would you try to outsource (regardless of in what way)? |
Thanks, Jakob.
Off the top of my head:
* get_num_bytes_from_start() should probably make use of get_num_bytes() – currently it has its own logic
* there are multiple calls to get_codepoint_bytes() for the same reason, basically converting code points to bytes
These are the ones that I saw. It could be the other way around (bytes to codepoints), too.
Best regards, Vadim
From: Jakob Riedle <notifications@github.com>
Sent: Thursday, 28 June 2018 3:41 PM
To: DuffsDevice/tinyutf8 <tinyutf8@noreply.github.com>
Cc: vadim-berman <vadim.berman@gmail.com>; Author <author@noreply.github.com>
Subject: Re: [DuffsDevice/tinyutf8] get_num_bytes() falls back to the number of codepoints? (#14)
You definitely got a point. What lines would you try to outsource (regardless of in what way)?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#14 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJxfuDRIFodikQbppKGHoXPKbbatXtpkks5uBIh6gaJpZM4UzRBl> .
|
Yes, get_num_bytes and get_num_bytes_from_start share a lot of code, since the latter is a specialization of the former. Similarly append, raw_erase and raw_insert are all specializations of raw_replace. I decided to optimize them inside their own functions, which of course means redundancy to some extent :-| |
Understood. Thanks for making the effort to review!
Best regards, Vadim
From: Jakob Riedle <notifications@github.com>
Sent: Saturday, 30 June 2018 4:56 AM
To: DuffsDevice/tinyutf8 <tinyutf8@noreply.github.com>
Cc: vadim-berman <vadim.berman@gmail.com>; Author <author@noreply.github.com>
Subject: Re: [DuffsDevice/tinyutf8] get_num_bytes() falls back to the number of codepoints? (#14)
Yes, get_num_bytes and get_num_bytes_from_start share a lot of code, since the latter is a specialization of the former. Similarly append, raw_erase and raw_insert are all specializations of raw_replace. I decided to optimize them inside their own functions, which of course means redundancy to some extent :-|
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#14 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJxfuDN0k9S-sueoaPNVmOT6miQcWvrwks5uBpRlgaJpZM4UzRBl> .
|
Hi Jakob,
It's your fan club again.
Looks like there might be an issue in the non-sso mode for
get_num_bytes()
. In this case, I called substr() on a Russian string (92 bytes long, 52 code points) with a question mark in the middle, and got it butchered because get_num_bytes() returned 52.I tried to figure out the logic following
if (sso_inactive())
- that is, line 839 onwards - but couldn't, and instead substituted it by something crude but working:I'm pretty sure it's not as optimised as what you planned originally but that's the best I could do...
I still get some parts butchered further down the line, for some reason, no idea why...
The text was updated successfully, but these errors were encountered: