Do not cache utf8 offsets for non-canonical lengths #18727

Leont · 2021-04-19T16:55:06Z

In particular, if the length is beyond the end, it should not be stored as the end.

This fixes #18588

hvds · 2021-04-19T19:09:34Z

Looking through this commit, I'm slightly confused - when is canonical_position ever true?

Leont · 2021-04-19T20:47:34Z

Looking through this commit, I'm slightly confused - when is canonical_position ever true?

sv_pos_u2b_forwards contains this piece of code.

while (s < send && uoffset) {
    --uoffset;
    s += UTF8SKIP(s);
}

So either it stops because we're at the right offset, or because we're at the end of the string (or both, but that's really just a special case of the former).

In the former case, canonical_position would be true, in the latter it wouldn't.

hvds · 2021-04-19T21:55:00Z

Looking through this commit, I'm slightly confused - when is canonical_position ever true?

sv_pos_u2b_forwards contains this piece of code.
while (s < send && uoffset) {
    --uoffset;
    s += UTF8SKIP(s);
}
So either it stops because we're at the right offset, or because we're at the end of the string (or both, but that's really just a special case of the former).

In the former case, canonical_position would be true, in the latter it wouldn't.

Thanks, I see my mistake now - I had (repeatedly) read it as *canonical_position = uoffset = 0, rolling two assignments into a single statement rather than *canonical_position = uoffset == 0, a single assignment of the result of a comparison.

In part it was natural for me to read it the wrong way because I would always write such a construct as *canonical_position = (uoffset == 0); I think it would be a good idea to add the parens in this case too.

khwilliamson · 2021-04-20T11:03:04Z

I had the same initial reaction as @hvds, but did spot the == when I looked more slowly. Parens would have helped me see it immediately. I would be more comfortable with a comment. Unless a bug fix is face-palm worthy, it wasn't obviously a problem, and won't be again 6 months from now

Leont · 2021-04-20T19:03:00Z

Fair enough, I'll add it

In particular, if the length is beyond the end, it should not be stored as the end.

Leont force-pushed the leont/utf8-index branch from 19f71fa to 205a709 Compare April 19, 2021 17:08

Leont requested a review from khwilliamson April 19, 2021 17:10

Leont added the type-Unicode label Apr 26, 2021

Leont force-pushed the leont/utf8-index branch from 205a709 to 1a61ae0 Compare April 30, 2021 11:22

Leont force-pushed the leont/utf8-index branch from 1a61ae0 to 345942a Compare May 23, 2021 10:36

Do not cache utf8 offsets for non-canonical lengths

cf86783

In particular, if the length is beyond the end, it should not be stored as the end.

Leont force-pushed the leont/utf8-index branch from 345942a to cf86783 Compare May 23, 2021 10:44

Leont merged commit e6e9dd2 into blead May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not cache utf8 offsets for non-canonical lengths #18727

Do not cache utf8 offsets for non-canonical lengths #18727

Leont commented Apr 19, 2021

hvds commented Apr 19, 2021

Leont commented Apr 19, 2021

hvds commented Apr 19, 2021

khwilliamson commented Apr 20, 2021

Leont commented Apr 20, 2021

Do not cache utf8 offsets for non-canonical lengths #18727

Do not cache utf8 offsets for non-canonical lengths #18727

Conversation

Leont commented Apr 19, 2021

hvds commented Apr 19, 2021

Leont commented Apr 19, 2021

hvds commented Apr 19, 2021

khwilliamson commented Apr 20, 2021

Leont commented Apr 20, 2021