Skip to content

Commit

Permalink
regcomp.c - add optimistic eval (*{ ... }) and (**{ ... })
Browse files Browse the repository at this point in the history
This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and
(??{ ... }). The only difference being that the star variants are
"optimisitic" and are defined to never disable optimisations. This is
especially relevant now that use of (?{ ... }) prevents important
optimisations anywhere in the pattern, instead of the older and inconsistent
rules where it only affected the parts that contained the EVAL.

It is also very useful for injecting debugging style expressions to the
pattern to understand what the regex engine is actually doing. The older
style (?{ ... }) variants would change the regex engines behavior, meaning
this was not as effective a tool as it could have been.

Similarly it is now possible to test that a given regex optimisation
works correctly using (*{ ... }), which was not possible with (?{ ... }).
  • Loading branch information
demerphq committed Jan 19, 2023
1 parent 09b3a40 commit c224bbd
Show file tree
Hide file tree
Showing 12 changed files with 229 additions and 48 deletions.
14 changes: 14 additions & 0 deletions pod/perldelta.pod
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,20 @@ here, but most should go in the L</Performance Enhancements> section.

[ List each enhancement as a =head2 entry ]

=head2 Optimistic Eval in Patterns

The use of C<(?{ ... })> and C<(??{ ... })> in a pattern disables various
optimizations globally in that pattern. This may or may not be desired by the
programmer. This release adds the C<(*{ ... })> and C<(**{ ... })>
equivalents. The only difference is that they do not and will never disable
any optimisations in the regex engine. This may make them more unstable in the
sense that they may be called more or less times in the future, however the
number of times they execute will truly match how the regex engine functions.
For example, certain types of optmisation are disabled when C<(?{ ... })> is
included in a pattern, so that patterns which are O(N) in normal use become
O(N*N) with a C<(?{ ... })> pattern in them. Switching to C<(*{ ... })> means
the pattern will stay O(N).

=head1 Security

XXX Any security-related notices go here. In particular, any security
Expand Down
60 changes: 51 additions & 9 deletions pod/perlre.pod
Original file line number Diff line number Diff line change
Expand Up @@ -1990,6 +1990,18 @@ keep track of the number of nested parentheses. For example:
/the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
print "color = $color, animal = $animal\n";

The use of this construct disables some optimisations globally in the
pattern, and the pattern may execute much slower as a consequence.
Use a C<*> instead of the C<?> block to create an optimistic form of
this construct. C<(*{ ... })> should not disable any optimisations.

=item C<(*{ I<code> })>
X<(*{})> X<regex, optimistic code>

This is *exactly* the same as C<(?{ I<code> })> with the exception
that it does not disable B<any> optimisations at all in the regex engine.
How often it is executed may vary from perl release to perl release.
In a failing match it may not even be executed at all.

=item C<(??{ I<code> })>
X<(??{})>
Expand Down Expand Up @@ -2047,6 +2059,20 @@ consuming any input string will also result in a fatal error. The depth
at which that happens is compiled into perl, so it can be changed with a
custom build.

The use of this construct disables some optimisations globally in the pattern,
and the pattern may execute much slower as a consequence. Use a C<*> instead
of the C<?> to create an optimistic form of this construct: C<(**{...})>
maybe used as a replacement and should not disable any optimisations, but is
likely to be even more volatile from perl version to perl version than
C<(??{...})> is.

=item C<(**{ I<code> })>
X<(**{})> X<regex, postponed optimistic>

This is exactly the same as C<(??{ I<code> })> however it does not disable
B<any> optimisations. It is even more likely to change from version to version
of perl. In a failing match it may not even be executed at all.

=item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)>
X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
Expand Down Expand Up @@ -2201,7 +2227,15 @@ Full syntax: C<< (?(?=I<lookahead>)I<then>|I<else>) >>
=item C<(?{ I<CODE> })>

Treats the return value of the code block as the condition.
Full syntax: C<< (?(?{ I<code> })I<then>|I<else>) >>
Full syntax: C<< (?(?{ I<CODE> })I<then>|I<else>) >>

Note use of this construct may globally affect the performance
of the pattern. Consider using C<(*{ I<CODE> })>

=item C<(*{ I<CODE> })>

Treats the return value of the code block as the condition.
Full syntax: C<< (?(*{ I<CODE> })I<then>|I<else>) >>

=item C<(R)>

Expand Down Expand Up @@ -3293,14 +3327,15 @@ part of this regular expression needs to be converted explicitly

=head2 Embedded Code Execution Frequency

The exact rules for how often C<(??{})> and C<(?{})> are executed in a pattern
are unspecified. In the case of a successful match you can assume that
they DWIM and will be executed in left to right order the appropriate
number of times in the accepting path of the pattern as would any other
meta-pattern. How non-accepting pathways and match failures affect the
number of times a pattern is executed is specifically unspecified and
may vary depending on what optimizations can be applied to the pattern
and is likely to change from version to version.
The exact rules for how often C<(?{})> and C<(??{})> are executed in a pattern
are unspecified, as are their even less well defined equivalents C<(*{})> and
C<(**{})>. In the case of a successful match you can assume that they DWIM and
will be executed in left to right order the appropriate number of times in the
accepting path of the pattern as would any other meta-pattern. How non-
accepting pathways and match failures affect the number of times a pattern is
executed is specifically unspecified and may vary depending on what
optimizations can be applied to the pattern and is likely to change from
version to version.

For instance in

Expand All @@ -3326,6 +3361,13 @@ example:

will output "o" twice.

For historical and consistency reasons the use of normal code blocks
anywhere in a pattern will disable certain optimisations. As of 5.37.7
you can use an "optimistic" codeblock, C<(*{ ... })> or C<(**{ ... })>
if you do *not* wish to disable these optimisations. This may result
in code blocks being called less often than might have been had they
not been optimistic.

=head2 PCRE/Python Support

As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
Expand Down
52 changes: 42 additions & 10 deletions regcomp.c
Original file line number Diff line number Diff line change
Expand Up @@ -1902,7 +1902,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
!sawlookahead &&
(OP(first) == STAR &&
REGNODE_TYPE(OP(REGNODE_AFTER(first))) == REG_ANY) &&
!(RExC_rx->intflags & PREGf_ANCH) && !pRExC_state->code_blocks)
!(RExC_rx->intflags & PREGf_ANCH) && !(RExC_seen & REG_PESSIMIZE_SEEN))
{
/* turn .* into ^.* with an implied $*=1 */
const int type =
Expand All @@ -1915,7 +1915,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
}
if (sawplus && !sawminmod && !sawlookahead
&& (!sawopen || !RExC_sawback)
&& !pRExC_state->code_blocks) /* May examine pos and $& */
&& !(RExC_seen & REG_PESSIMIZE_SEEN)) /* May examine pos and $& */
/* x+ must match at the 1st pos of run of x's */
RExC_rx->intflags |= PREGf_SKIP;

Expand Down Expand Up @@ -2167,20 +2167,27 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
}
if (RExC_seen & REG_GPOS_SEEN)
RExC_rx->intflags |= PREGf_GPOS_SEEN;

if (RExC_seen & REG_PESSIMIZE_SEEN)
RExC_rx->intflags |= PREGf_PESSIMIZE_SEEN;

if (RExC_seen & REG_LOOKBEHIND_SEEN)
RExC_rx->extflags |= RXf_NO_INPLACE_SUBST; /* inplace might break the
lookbehind */
if (pRExC_state->code_blocks)
RExC_rx->extflags |= RXf_EVAL_SEEN;
if (RExC_seen & REG_VERBARG_SEEN)
{

if (RExC_seen & REG_VERBARG_SEEN) {
RExC_rx->intflags |= PREGf_VERBARG_SEEN;
RExC_rx->extflags |= RXf_NO_INPLACE_SUBST; /* don't understand this! Yves */
}

if (RExC_seen & REG_CUTGROUP_SEEN)
RExC_rx->intflags |= PREGf_CUTGROUP_SEEN;

if (pm_flags & PMf_USE_RE_EVAL)
RExC_rx->intflags |= PREGf_USE_RE_EVAL;

if (RExC_paren_names)
RXp_PAREN_NAMES(RExC_rx) = MUTABLE_HV(SvREFCNT_inc(RExC_paren_names));
else
Expand Down Expand Up @@ -2944,6 +2951,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
I32 num; /* numeric backreferences */
SV * max_open; /* Max number of unclosed parens */
I32 was_in_lookaround = RExC_in_lookaround;
I32 fake_eval = 0; /* matches paren */

/* The difference between the following variables can be seen with *
* the broken pattern /(?:foo/ where segment_parse_start will point *
Expand Down Expand Up @@ -3000,6 +3008,16 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
goto parse_rest;
}
else if ( *RExC_parse == '*') { /* (*VERB:ARG), (*construct:...) */
if (RExC_parse[1] == '{') {
fake_eval = '{';
goto handle_qmark;
}
else
if ( RExC_parse[1] == '*' && RExC_parse[2] == '{' ) {
fake_eval = '?';
goto handle_qmark;
}

char *start_verb = RExC_parse + 1;
STRLEN verb_len;
char *start_arg = NULL;
Expand Down Expand Up @@ -3310,7 +3328,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
return ret;
}
else if (*RExC_parse == '?') { /* (?...) */
bool is_logical = 0;
handle_qmark:
; /* make sure the label has a statement associated with it*/
bool is_logical = 0, is_optimistic = 0;
const char * const seqstart = RExC_parse;
const char * endptr;
const char non_existent_group_msg[]
Expand All @@ -3323,8 +3343,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
}

RExC_parse_inc_by(1); /* past the '?' */
paren = *RExC_parse; /* might be a trailing NUL, if not
well-formed */
if (!fake_eval) {
paren = *RExC_parse; /* might be a trailing NUL, if not
well-formed */
is_optimistic = 0;
} else {
is_optimistic = 1;
paren = fake_eval;
}
RExC_parse_inc();
if (RExC_parse > RExC_end) {
paren = '\0';
Expand Down Expand Up @@ -3705,10 +3731,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
}
pRExC_state->code_index++;
nextchar(pRExC_state);
if (!is_optimistic)
RExC_seen |= REG_PESSIMIZE_SEEN;

if (is_logical) {
regnode_offset eval;
ret = reg_node(pRExC_state, LOGICAL);
FLAGS(REGNODE_p(ret)) = 2;

eval = reg2Lanode(pRExC_state, EVAL,
n,
Expand All @@ -3717,13 +3746,15 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
* return value */
RExC_flags & RXf_PMf_COMPILETIME
);
FLAGS(REGNODE_p(ret)) = 2;
FLAGS(REGNODE_p(eval)) = is_optimistic * EVAL_OPTIMISTIC_FLAG;
if (! REGTAIL(pRExC_state, ret, eval)) {
REQUIRE_BRANCHJ(flagp, 0);
}
return ret;
}
ret = reg2Lanode(pRExC_state, EVAL, n, 0);
FLAGS(REGNODE_p(ret)) = is_optimistic * EVAL_OPTIMISTIC_FLAG;

return ret;
}
case '(': /* (?(?{...})...) and (?(?=...)...) */
Expand All @@ -3737,7 +3768,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
|| RExC_parse[1] == '<'
|| RExC_parse[1] == '{'))
|| ( RExC_parse[0] == '*' /* (?(*...)) */
&& ( memBEGINs(RExC_parse + 1,
&& ( RExC_parse[1] == '{'
|| ( memBEGINs(RExC_parse + 1,
(Size_t) (RExC_end - (RExC_parse + 1)),
"pla:")
|| memBEGINs(RExC_parse + 1,
Expand All @@ -3760,7 +3792,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
"negative_lookahead:")
|| memBEGINs(RExC_parse + 1,
(Size_t) (RExC_end - (RExC_parse + 1)),
"negative_lookbehind:"))))
"negative_lookbehind:")))))
) { /* Lookahead or eval. */
I32 flag;
regnode_offset tail;
Expand Down
5 changes: 5 additions & 0 deletions regcomp.h
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ typedef struct regexp_internal {
#define PREGf_ANCH_SBOL 0x00000800
#define PREGf_ANCH_GPOS 0x00001000
#define PREGf_RECURSE_SEEN 0x00002000
#define PREGf_PESSIMIZE_SEEN 0x00004000

#define PREGf_ANCH \
( PREGf_ANCH_SBOL | PREGf_ANCH_GPOS | PREGf_ANCH_MBOL )
Expand Down Expand Up @@ -976,6 +977,7 @@ ARGp_SET_inline(struct regnode *node, SV *ptr) {
#define REG_UNFOLDED_MULTI_SEEN 0x00000400
/* spare */
#define REG_UNBOUNDED_QUANTIFIER_SEEN 0x00001000
#define REG_PESSIMIZE_SEEN 0x00002000


START_EXTERN_C
Expand Down Expand Up @@ -1426,6 +1428,9 @@ typedef enum {
#include "reginline.h"
#endif

#define EVAL_OPTIMISTIC_FLAG 128
#define EVAL_FLAGS_MASK (EVAL_OPTIMISTIC_FLAG-1)

#endif /* PERL_REGCOMP_H_ */

/*
Expand Down
6 changes: 5 additions & 1 deletion regcomp_debug.c
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ S_regdump_intflags(pTHX_ const char *lead, const U32 flags)

ASSUME(REG_INTFLAGS_NAME_SIZE <= sizeof(flags)*8);

for (bit=0; bit<REG_INTFLAGS_NAME_SIZE; bit++) {
for (bit=0; bit<=REG_INTFLAGS_NAME_SIZE; bit++) {
if (flags & (1<<bit)) {
if (!set++ && lead)
Perl_re_printf( aTHX_ "%s", lead);
Expand Down Expand Up @@ -871,6 +871,10 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o, const regmatch_
}
else if (op == SBOL)
Perl_sv_catpvf(aTHX_ sv, " /%s/", o->flags ? "\\A" : "^");
else if (op == EVAL) {
if (o->flags & EVAL_OPTIMISTIC_FLAG)
Perl_sv_catpvf(aTHX_ sv, " optimistic");
}

/* add on the verb argument if there is one */
if ( ( k == VERB || op == ACCEPT || op == OPFAIL ) && o->flags) {
Expand Down
3 changes: 3 additions & 0 deletions regcomp_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -1193,6 +1193,9 @@ static const scan_data_t zero_scan_data = {
if (RExC_seen & REG_UNBOUNDED_QUANTIFIER_SEEN) \
Perl_re_printf( aTHX_ "REG_UNBOUNDED_QUANTIFIER_SEEN "); \
\
if (RExC_seen & REG_PESSIMIZE_SEEN) \
Perl_re_printf( aTHX_ "REG_PESSIMIZE_SEEN "); \
\
Perl_re_printf( aTHX_ "\n"); \
});

Expand Down
18 changes: 9 additions & 9 deletions regcomp_study.c
Original file line number Diff line number Diff line change
Expand Up @@ -2693,10 +2693,10 @@ Perl_study_chunk(pTHX_
if ( RE_OPTIMIZE_CURLYX_TO_CURLYN
&& OP(oscan) == CURLYX
&& data
&& !pRExC_state->code_blocks /* XXX: for now disable whenever eval
is seen anywhere. We need a better
way. */
&& ( ( data->flags & (SF_IN_PAR|SF_HAS_EVAL) ) == SF_IN_PAR )
&& !(RExC_seen & REG_PESSIMIZE_SEEN) /* XXX: for now disable whenever a
non optimistic eval is seen
anywhere.*/
&& ( data->flags & SF_IN_PAR ) /* has parens */
&& !deltanext
&& minnext == 1
&& mutate_ok
Expand Down Expand Up @@ -2750,10 +2750,10 @@ Perl_study_chunk(pTHX_
if ( RE_OPTIMIZE_CURLYX_TO_CURLYM
&& OP(oscan) == CURLYX
&& data
&& !pRExC_state->code_blocks /* XXX: for now disable whenever eval
is seen anywhere. We need a better
way. */
&& !(data->flags & (SF_HAS_PAR|SF_HAS_EVAL))
&& !(RExC_seen & REG_PESSIMIZE_SEEN) /* XXX: for now disable whenever a
non optimistic eval is seen
anywhere.*/
&& !(data->flags & SF_HAS_PAR) /* no parens! */
&& !deltanext /* atom is fixed width */
&& minnext != 0 /* CURLYM can't handle zero width */
/* Nor characters whose fold at run-time may be
Expand Down Expand Up @@ -3469,7 +3469,7 @@ Perl_study_chunk(pTHX_
}
}
else if (OP(scan) == EVAL) {
if (data)
if (data && !(scan->flags & EVAL_OPTIMISTIC_FLAG) )
data->flags |= SF_HAS_EVAL;
}
else if ( REGNODE_TYPE(OP(scan)) == ENDLIKE ) {
Expand Down
2 changes: 1 addition & 1 deletion regexec.c
Original file line number Diff line number Diff line change
Expand Up @@ -8585,7 +8585,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
break;

case LOGICAL: /* modifier for EVAL and IFMATCH */
logical = scan->flags;
logical = scan->flags & EVAL_FLAGS_MASK; /* reserve a bit for optimistic eval */
break;

/*******************************************************************
Expand Down
3 changes: 2 additions & 1 deletion regnodes.h
Original file line number Diff line number Diff line change
Expand Up @@ -2876,11 +2876,12 @@ EXTCONST char * const PL_reg_intflags_name[] = {
"ANCH_SBOL", /* (1<<11) - 0x00000800 - PREGf_ANCH_SBOL */
"ANCH_GPOS", /* (1<<12) - 0x00001000 - PREGf_ANCH_GPOS */
"RECURSE_SEEN", /* (1<<13) - 0x00002000 - PREGf_RECURSE_SEEN */
"PESSIMIZE_SEEN", /* (1<<14) - 0x00004000 - PREGf_PESSIMIZE_SEEN */
};
#endif /* DOINIT */

#ifdef DEBUGGING
# define REG_INTFLAGS_NAME_SIZE 14
# define REG_INTFLAGS_NAME_SIZE 15
#endif

/* The following have no fixed length. U8 so we can do strchr() on it. */
Expand Down
Loading

0 comments on commit c224bbd

Please sign in to comment.