Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8_string::get_num_bytes_from_start returns incorrect value #33

Closed
vadim-berman opened this issue Jan 2, 2020 · 6 comments
Closed
Labels

Comments

@vadim-berman
Copy link

vadim-berman commented Jan 2, 2020

Hi Jakob,

Happy New Year!

Looks like the issue you kept fixing, strikes again.

It's pretty much the same pattern: mostly plain Western European text with one multibyte interloper.

Similar to #14, but a different point, specifically get_num_bytes_from_start. I came across it when using find_first_of.

Another glitch, which may be stemming from the same piece of code, is that substr truncates the result. Having said that, if the block starting with if( utf8_string::is_lut_active( lut_iter ) )... under get_num_bytes_from_start is disabled, the find_first_of returns a correct result.

Here is the sample snippet demonstrating both:

    utf8_string findFirstBug = u8"The project, therefore, “is an investment in the power of the adolescent girls which is so important to breaking the inter-generational transmission of poverty, violence, exclusion and discrimination in building our societies for a better future”";

    std::cout << "White space found at: " << findFirstBug.find_first_of(U" \t\r\n", 26) << endl; // returns 29 instead of 27
    std::cout << "Total len: " << findFirstBug.length() << " but the substring is truncated by 2 characters: " << findFirstBug.substr(0, 246) << endl;

BTW, I see that we were talking about the code reuse in #14. I am wondering if you can take that encapsulate that lut block that you use in several functions, seems like it might save some efforts in the future.

@vadim-berman
Copy link
Author

Update: the quick workaround for both, and additional, glitches is to return false by is_lut_active, essentially disable the LUT.

@DuffsDevice
Copy link
Owner

DuffsDevice commented Jan 6, 2020

Hi Vadim,

Happy New Year! 🍾
the issue was a really minor one but quite significant!
Have a look at the Commit. The issue was, that uint8-max is 255, but you only need this number to address a buffer of 256 bytes 😜

Thanks for pointing this out, it helped to improve correctness and code-safety alot!

Cheers!
Jakob

@vadim-berman
Copy link
Author

Great, thank you very much for the prompt turnaround, Jakob! I'll test it tomorrow.

@DuffsDevice
Copy link
Owner

You're very welcome! 😃

@vadim-berman
Copy link
Author

Excellent. All the test cases worked, thank you!

@DuffsDevice
Copy link
Owner

Super 👍

@DuffsDevice DuffsDevice added the bug label Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants