Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not cache utf8 offsets for non-canonical lengths #18727

Merged
merged 1 commit into from May 24, 2021
Merged

Conversation

Leont
Copy link
Contributor

@Leont Leont commented Apr 19, 2021

In particular, if the length is beyond the end, it should not be stored as the end.

This fixes #18588

@hvds
Copy link
Contributor

hvds commented Apr 19, 2021

Looking through this commit, I'm slightly confused - when is canonical_position ever true?

@Leont
Copy link
Contributor Author

Leont commented Apr 19, 2021

Looking through this commit, I'm slightly confused - when is canonical_position ever true?

sv_pos_u2b_forwards contains this piece of code.

while (s < send && uoffset) {
    --uoffset;
    s += UTF8SKIP(s);
}

So either it stops because we're at the right offset, or because we're at the end of the string (or both, but that's really just a special case of the former).

In the former case, canonical_position would be true, in the latter it wouldn't.

@hvds
Copy link
Contributor

hvds commented Apr 19, 2021

Looking through this commit, I'm slightly confused - when is canonical_position ever true?

sv_pos_u2b_forwards contains this piece of code.

while (s < send && uoffset) {
    --uoffset;
    s += UTF8SKIP(s);
}

So either it stops because we're at the right offset, or because we're at the end of the string (or both, but that's really just a special case of the former).

In the former case, canonical_position would be true, in the latter it wouldn't.

Thanks, I see my mistake now - I had (repeatedly) read it as *canonical_position = uoffset = 0, rolling two assignments into a single statement rather than *canonical_position = uoffset == 0, a single assignment of the result of a comparison.

In part it was natural for me to read it the wrong way because I would always write such a construct as *canonical_position = (uoffset == 0); I think it would be a good idea to add the parens in this case too.

@khwilliamson
Copy link
Contributor

I had the same initial reaction as @hvds, but did spot the == when I looked more slowly. Parens would have helped me see it immediately. I would be more comfortable with a comment. Unless a bug fix is face-palm worthy, it wasn't obviously a problem, and won't be again 6 months from now

@Leont
Copy link
Contributor Author

Leont commented Apr 20, 2021

Fair enough, I'll add it

In particular, if the length is beyond the end, it should not be stored as the end.
@Leont Leont merged commit e6e9dd2 into blead May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search for empty substring with index returns different results with utf8 magic
3 participants