Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_num_bytes() falls back to the number of codepoints? #14

Closed
vadim-berman opened this issue Jun 22, 2018 · 10 comments
Closed

get_num_bytes() falls back to the number of codepoints? #14

vadim-berman opened this issue Jun 22, 2018 · 10 comments

Comments

@vadim-berman
Copy link

vadim-berman commented Jun 22, 2018

Hi Jakob,

It's your fan club again.

Looks like there might be an issue in the non-sso mode for get_num_bytes(). In this case, I called substr() on a Russian string (92 bytes long, 52 code points) with a question mark in the middle, and got it butchered because get_num_bytes() returned 52.

I tried to figure out the logic following if (sso_inactive()) - that is, line 839 onwards - but couldn't, and instead substituted it by something crude but working:

	size_type byte_count = 0;
	for (size_type current_code_point = index; current_code_point< index + cp_count; current_code_point++)
		byte_count += get_codepoint_bytes(at(current_code_point));
	return byte_count;

I'm pretty sure it's not as optimised as what you planned originally but that's the best I could do...

I still get some parts butchered further down the line, for some reason, no idea why...

@DuffsDevice
Copy link
Owner

Dear Vadim,

Good to hear from you! ;)

Do you have a snippet with which I can reporduce the failiure?
The idea behind the referenced lines is the following: If the Look-Up-Table (which contains byte indices at which multbyte codepoints reside) is active and has length '0', then there are no codepoints with more than ony byte. That means: The number of bytes equals the number of codepoints. That's why it says "return cp_count;".

So I think this part of code is fine, but somehow the LUT size or the LUT mode is broken.
Can you follow the trail?
I can have a look at your code if you can strip it down - Maybe we're in luck and this bug got silently fixed by fixing #15 :)

@vadim-berman
Copy link
Author

vadim-berman commented Jun 24, 2018

Hi Jakob,

Thanks for the speedy fixes!

The good news: the #15 did fix some of the cases.
The bad news: not all :) .

I had to play a bit to reproduce where it fails, and in order for this to happen (return codepoints instead of the bytes), I had to:

  • initialise the string using utf8_string::append(utf8_string)
  • make the string longer than 50 characters
  • invoke substr on 50+ characters

I imagine it means sso=off?

Example code (sorry that it's Russian but I inserted a numeral to make it easier for you to track):

text_string test;
test.append(u8"длинная строка длиннее чем 50 знаков и с многобайтовыми знаками");
cout << subtest.substr(0, 52);

The result will be a string truncated after 29 characters, right after the numeral, with the last codepoint chopped off after the first byte.

In my case, append was unnecessary, so I replaced it with a normal assignment, and the problem was resolved. Calling a substring with a shorter argument also does not cause the issue.

@vadim-berman vadim-berman changed the title get_num_bytes() falls back to number of codepoints? get_num_bytes() falls back to the number of codepoints? Jun 24, 2018
@DuffsDevice
Copy link
Owner

Hi Vadim,

Thank you for the (very 👍) minimal example and also the hints abou the circumstances in which the error occours, this incredibly speeds up the debugging process!
I'll have a look hopefully tomorrow and will get back to you ASAP.

Have a nice sunday maybe whaching soccer :D

Jakob

@vadim-berman
Copy link
Author

Great - thanks, Jakob! Glad it helped.

Best regards, Vadim

@DuffsDevice
Copy link
Owner

Hi Vadim,

I have found and fixed the bug. Your bug has revealed two other ones in raw_replace and raw_insert!
Additionally, I also took the time to add logic that when deciding about whether or not to have a LUT takes into account whether there already is a lut and adjusts the threshold accordingly.

Thank you, your help has made finding the bug rather easy!

Best Regards
Jakob

@vadim-berman
Copy link
Author

vadim-berman commented Jun 28, 2018 via email

@DuffsDevice
Copy link
Owner

You definitely got a point. What lines would you try to outsource (regardless of in what way)?

@vadim-berman
Copy link
Author

vadim-berman commented Jun 28, 2018 via email

@DuffsDevice
Copy link
Owner

Yes, get_num_bytes and get_num_bytes_from_start share a lot of code, since the latter is a specialization of the former. Similarly append, raw_erase and raw_insert are all specializations of raw_replace. I decided to optimize them inside their own functions, which of course means redundancy to some extent :-|

@vadim-berman
Copy link
Author

vadim-berman commented Jun 30, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants