utf8_string::get_num_bytes_from_start returns incorrect value #33

vadim-berman · 2020-01-02T08:06:00Z

Hi Jakob,

Happy New Year!

Looks like the issue you kept fixing, strikes again.

It's pretty much the same pattern: mostly plain Western European text with one multibyte interloper.

Similar to #14, but a different point, specifically get_num_bytes_from_start. I came across it when using find_first_of.

Another glitch, which may be stemming from the same piece of code, is that substr truncates the result. Having said that, if the block starting with if( utf8_string::is_lut_active( lut_iter ) )... under get_num_bytes_from_start is disabled, the find_first_of returns a correct result.

Here is the sample snippet demonstrating both:

    utf8_string findFirstBug = u8"The project, therefore, “is an investment in the power of the adolescent girls which is so important to breaking the inter-generational transmission of poverty, violence, exclusion and discrimination in building our societies for a better future”";

    std::cout << "White space found at: " << findFirstBug.find_first_of(U" \t\r\n", 26) << endl; // returns 29 instead of 27
    std::cout << "Total len: " << findFirstBug.length() << " but the substring is truncated by 2 characters: " << findFirstBug.substr(0, 246) << endl;

BTW, I see that we were talking about the code reuse in #14. I am wondering if you can take that encapsulate that lut block that you use in several functions, seems like it might save some efforts in the future.

The text was updated successfully, but these errors were encountered:

vadim-berman · 2020-01-03T06:23:01Z

Update: the quick workaround for both, and additional, glitches is to return false by is_lut_active, essentially disable the LUT.

DuffsDevice · 2020-01-06T11:52:56Z

Hi Vadim,

Happy New Year! 🍾
the issue was a really minor one but quite significant!
Have a look at the Commit. The issue was, that uint8-max is 255, but you only need this number to address a buffer of 256 bytes 😜

Thanks for pointing this out, it helped to improve correctness and code-safety alot!

Cheers!
Jakob

vadim-berman · 2020-01-06T12:08:51Z

Great, thank you very much for the prompt turnaround, Jakob! I'll test it tomorrow.

DuffsDevice · 2020-01-06T12:15:44Z

You're very welcome! 😃

vadim-berman · 2020-01-07T01:17:47Z

Excellent. All the test cases worked, thank you!

DuffsDevice · 2020-01-07T06:12:51Z

Super 👍

DuffsDevice closed this as completed in 4fc49c8 Jan 6, 2020

DuffsDevice added the bug label Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8_string::get_num_bytes_from_start returns incorrect value #33

utf8_string::get_num_bytes_from_start returns incorrect value #33

vadim-berman commented Jan 2, 2020 •

edited

vadim-berman commented Jan 3, 2020

DuffsDevice commented Jan 6, 2020 •

edited

vadim-berman commented Jan 6, 2020

DuffsDevice commented Jan 6, 2020

vadim-berman commented Jan 7, 2020

DuffsDevice commented Jan 7, 2020

utf8_string::get_num_bytes_from_start returns incorrect value #33

utf8_string::get_num_bytes_from_start returns incorrect value #33

Comments

vadim-berman commented Jan 2, 2020 • edited

vadim-berman commented Jan 3, 2020

DuffsDevice commented Jan 6, 2020 • edited

vadim-berman commented Jan 6, 2020

DuffsDevice commented Jan 6, 2020

vadim-berman commented Jan 7, 2020

DuffsDevice commented Jan 7, 2020

vadim-berman commented Jan 2, 2020 •

edited

DuffsDevice commented Jan 6, 2020 •

edited