Remove some UTF8SKIPs #23698

khwilliamson · 2025-09-09T00:26:05Z

Functions that verify something is UTF-8 will necessarily parse the whole thing, so they have its byte length at their finger tips. Many of those functions return that length when the input is valid; or 0 if not. Thus they can be used as bools, with 0, non-zero. But also can allow the caller to look at the count and not have to re-derive the value.

This p.r. changes the functions that returned a simple boolean to instead return the count. Previously they were just discarding that number.

It then changes the callers of these that re-derive that value to use the returned count instead.

This set of changes does not require a perldelta entry.

regcomp_internal.h

toke.c

This moves the trivial case to before the complicated one, which is easier to comprehend. And instead of complementing the conditional, use a different name (that evaluates to that complement) which makes it clearer what's going on.

This cleans up some ragged edges, makes things fit in 80 columns

The && in this expression already makes the result a boolean; no need to cast it to such. Removing it allows the entire expression to fit on one line.

This will be useful in the next commits

Instead of a bool, they will now return the number of bytes that comprise the character being checked. So the result can be used as a bool, just as before; or the extra information can save recalculations, as done in the future commits.

Or 0 when the character isn't of type FOO. This allows these macros to be used as booleans, as previously; or to give you how many bytes there are in the matched UTF-8 character. This was always trivially the case for ASCII-range characters, as the former boolean 0,1 gave you the correct length if they matched. The previous commit extended this to return the length for above-Latin1 characters. This commit is the final piece. Latin1 characters that aren't ASCII always are two bytes. So just multiply the return by 2, yielding 0 if no match or 2 bytes if matched.

This value is now returned from the isSPACE_utf8_safe macro. Use it instead of re-deriving it.

This value is now returned from the isID(FIRST|CONT)_utf8_safe macros. Use it instead of re-deriving it.

This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros. Use it instead of re-deriving it. This also simplifies the code

The previous commit removed a surrounding block; outdent correspondingly

This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros. Use it instead of re-deriving it.

khwilliamson · 2025-09-15T19:38:35Z

I decided to split the part that makes behavior changes out for a later p.r.

khwilliamson force-pushed the advance branch from 3554bf6 to 2e4b43d Compare September 9, 2025 16:29

github-actions bot added the hasConflicts label Sep 9, 2025

khwilliamson force-pushed the advance branch from 2e4b43d to 404e163 Compare September 9, 2025 17:38

khwilliamson removed the hasConflicts label Sep 9, 2025

tonycoz reviewed Sep 11, 2025

View reviewed changes

regcomp_internal.h Show resolved Hide resolved

tonycoz mentioned this pull request Sep 11, 2025

Avoid some UTF8SKIPs #23695

Closed

tonycoz reviewed Sep 11, 2025

View reviewed changes

toke.c Outdated Show resolved Hide resolved

khwilliamson added 11 commits September 15, 2025 12:47

handy.h: Swap order of conditionals for clarity

592b079

This moves the trivial case to before the complicated one, which is easier to comprehend. And instead of complementing the conditional, use a different name (that evaluates to that complement) which makes it clearer what's going on.

handy.h: White space only

fd90e73

This cleans up some ragged edges, makes things fit in 80 columns

handy.h: Remove unnecessary cast

04b84be

The && in this expression already makes the result a boolean; no need to cast it to such. Removing it allows the entire expression to fit on one line.

utf8.c: Replace macro by a static function

e5ef7eb

This will be useful in the next commits

class.c: Avoid UTF8SKIPs

ba197ce

This value is now returned from the isSPACE_utf8_safe macro. Use it instead of re-deriving it.

pp_ctl.c: Avoid UTF8SKIPs

b784fff

This value is now returned from the isID(FIRST|CONT)_utf8_safe macros. Use it instead of re-deriving it.

regcomp.c: Avoid UTF8SKIPs

d16ee55

This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros. Use it instead of re-deriving it. This also simplifies the code

regcomp.c: White space only

70a8504

The previous commit removed a surrounding block; outdent correspondingly

toke.c: Avoid UTF8SKIPs

8887d78

This value is now returned from the isID(FIRST|CONT)_lazy_if_safe macros. Use it instead of re-deriving it.

khwilliamson force-pushed the advance branch from 404e163 to 8887d78 Compare September 15, 2025 19:36

khwilliamson changed the title ~~Remove some UTF8SKIPs; use isIDCONT~~ Remove some UTF8SKIPs Sep 15, 2025

tonycoz approved these changes Sep 15, 2025

View reviewed changes

khwilliamson merged commit 5ea209f into Perl:blead Sep 15, 2025
33 checks passed

tonycoz mentioned this pull request Oct 2, 2025

Add isIDCONT_lazy_if_safe() #23775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove some UTF8SKIPs #23698

Remove some UTF8SKIPs #23698

Uh oh!

khwilliamson commented Sep 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

khwilliamson commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove some UTF8SKIPs #23698

Remove some UTF8SKIPs #23698

Uh oh!

Conversation

khwilliamson commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khwilliamson commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khwilliamson commented Sep 9, 2025 •

edited

Loading