-
Notifications
You must be signed in to change notification settings - Fork 597
Remove some UTF8SKIPs #23698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Remove some UTF8SKIPs #23698
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3554bf6
to
2e4b43d
Compare
2e4b43d
to
404e163
Compare
tonycoz
reviewed
Sep 11, 2025
Closed
tonycoz
reviewed
Sep 11, 2025
This moves the trivial case to before the complicated one, which is easier to comprehend. And instead of complementing the conditional, use a different name (that evaluates to that complement) which makes it clearer what's going on.
This cleans up some ragged edges, makes things fit in 80 columns
The && in this expression already makes the result a boolean; no need to cast it to such. Removing it allows the entire expression to fit on one line.
This will be useful in the next commits
Instead of a bool, they will now return the number of bytes that comprise the character being checked. So the result can be used as a bool, just as before; or the extra information can save recalculations, as done in the future commits.
Or 0 when the character isn't of type FOO. This allows these macros to be used as booleans, as previously; or to give you how many bytes there are in the matched UTF-8 character. This was always trivially the case for ASCII-range characters, as the former boolean 0,1 gave you the correct length if they matched. The previous commit extended this to return the length for above-Latin1 characters. This commit is the final piece. Latin1 characters that aren't ASCII always are two bytes. So just multiply the return by 2, yielding 0 if no match or 2 bytes if matched.
This value is now returned from the isSPACE_utf8_safe macro. Use it instead of re-deriving it.
This value is now returned from the isID(FIRST|CONT)_utf8_safe macros. Use it instead of re-deriving it.
This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros. Use it instead of re-deriving it. This also simplifies the code
The previous commit removed a surrounding block; outdent correspondingly
This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros. Use it instead of re-deriving it.
404e163
to
8887d78
Compare
I decided to split the part that makes behavior changes out for a later p.r. |
tonycoz
approved these changes
Sep 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Functions that verify something is UTF-8 will necessarily parse the whole thing, so they have its byte length at their finger tips. Many of those functions return that length when the input is valid; or 0 if not. Thus they can be used as bools, with 0, non-zero. But also can allow the caller to look at the count and not have to re-derive the value.
This p.r. changes the functions that returned a simple boolean to instead return the count. Previously they were just discarding that number.
It then changes the callers of these that re-derive that value to use the returned count instead.