Skip to content

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Sep 9, 2025

Functions that verify something is UTF-8 will necessarily parse the whole thing, so they have its byte length at their finger tips. Many of those functions return that length when the input is valid; or 0 if not. Thus they can be used as bools, with 0, non-zero. But also can allow the caller to look at the count and not have to re-derive the value.

This p.r. changes the functions that returned a simple boolean to instead return the count. Previously they were just discarding that number.

It then changes the callers of these that re-derive that value to use the returned count instead.

  • This set of changes does not require a perldelta entry.

@tonycoz tonycoz mentioned this pull request Sep 11, 2025
This moves the trivial case to before the complicated one, which is
easier to comprehend.  And instead of complementing the conditional, use
a different name (that evaluates to that complement) which makes it
clearer what's going on.
This cleans up some ragged edges, makes things fit in 80 columns
The && in this expression already makes the result a boolean; no need to
cast it to such.  Removing it allows the entire expression to fit on one
line.
This will be useful in the next commits
Instead of a bool, they will now return the number of bytes that
comprise the character being checked.  So the result can be used as a
bool, just as before; or the extra information can save recalculations,
as done in the future commits.
Or 0 when the character isn't of type FOO.  This allows these macros to
be used as booleans, as previously; or to give you how many bytes there
are in the matched UTF-8 character.

This was always trivially the case for ASCII-range characters, as the
former boolean 0,1 gave you the correct length if they matched.

The previous commit extended this to return the length for above-Latin1
characters.

This commit is the final piece.  Latin1 characters that aren't ASCII
always are two bytes.  So just multiply the return by 2, yielding 0 if
no match or 2 bytes if matched.
This value is now returned from the isSPACE_utf8_safe macro.  Use it
instead of re-deriving it.
This value is now returned from the isID(FIRST|CONT)_utf8_safe macros.
Use it instead of re-deriving it.
This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros.
Use it instead of re-deriving it.  This also simplifies the code
The previous commit removed a surrounding block; outdent correspondingly
This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros.
Use it instead of re-deriving it.
@khwilliamson khwilliamson changed the title Remove some UTF8SKIPs; use isIDCONT Remove some UTF8SKIPs Sep 15, 2025
@khwilliamson
Copy link
Contributor Author

I decided to split the part that makes behavior changes out for a later p.r.

@khwilliamson khwilliamson merged commit 5ea209f into Perl:blead Sep 15, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants