From 7c932d07cab18751bfc7515b4320436273a459e2 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Fri, 19 Oct 2018 09:48:34 -0600 Subject: [PATCH] Remove sizing pass from regular expression compiler This commit removes the sizing pass for regular expression compilation. It attempts to be the minimum required to do this. Future patches are in the works that improve it,, and there is certainly lots more that could be done. This is being done now for security reasons, as there have been several bugs leading to CVEs where the sizing pass computed the size improperly, and a crafted pattern could allow an attack. This means that simple bugs can too easily become attack vectors. This is NOT the AST that people would like, but it should make it easier to move the code in that direction. Instead of a sizing pass, as the pattern is parsed, new space is malloc'd for each regnode found. To minimize the number of such mallocs that actually go out and request memory, an initial guess is made, based on the length of the pattern being compiled. The guessed amount is malloc'd and then immediately released. Hopefully that memory won't be gobbled up by another process before we actually gain control of it. The guess is currently simply the number of bytes in the pattern. Patches and/or suggestions are welcome on improving the guess or this method. This commit does not mean, however, that only one pass is done in all cases. Currently there are several situations that require extra passes. These are: a) If the pattern isn't UTF-8, but encounters a construct that requires it to become UTF-8, the parse is immediately stopped, the translation is done, and the parse restarted. This is unchanged from how it worked prior to this commit. b) If the pattern uses /d rules and it encounters a construct that requires it to become /u, the parse is immediately stopped and restarted using /u rules. A future enhancement is to only restart if something has been encountered that would generate something different than what has already been generated, as many operations are the same under both /d and /u. Prior to this commit, in rare circumstances was the parse immediately restarted. Only those few that changed the sizing did so. Instead the sizing pass was allowed to complete and then the generation pass ran, using /u. Some CVEs were caused by faulty implementation here. c) Very large patterns may need to have long jumps in their program. Prior to this commit, that was determined in the sizing pass, and all jumps were made long during generation. Now, the first time the need for a long jump is detected, the parse is immediately restarted, and all jumps are made long. I haven't investigated enough to be sure, but it might be sufficient to just redo the current jump, making it long, and then switch to using just long jumps, without having to restart the parse from the beginning. d) If a reference that could be to capturing parentheses doesn't find such parentheses, a flag is set. For references that could be octal constants, they are assumed to be those constants instead of a capturing group. At the end of the parse, if the flag indicates either that the assumption(s) were wrong or that it is a fatal reference to a non-existent group, the pattern is reparsed knowing the total number of these groups. e) If (?R) or (?0) are encountered, the flag listed in item d) above is set to force a reparse. I did have code in place that avoided doing the reparse, but I wasn't confident enough that our test suite exercises that area of the code enough to have exposed all the potential interaction bugs, and I think this construct is used rarely enough to not worry about avoiding the reparse at this point in the development. f) If (?|pattern) is encountered, the behavior is the same as in item e) above. The pattern will end up being reparsed after the total number of parenthesized groups are known. I decided not to invest the effort at this time in trying to get this to work without a reparse. It might be that if we are continuing the parse to just count parentheses, and we encounter a construct that normally would restart the parse immediately, that we could defer that restart. This would cut down the maximum number of parses required. As of this commit, the worst case is we find something that requires knowing all the parentheses; later we have to switch to /u rules and so the parse is restarted. Still later we have to switch to long jumps, and the parse is restarted again. Still later we have to upgrade to UTF-8, and the parse is restarted yet again. Then the parse is completed, and the final total of parentheses is known, so everything is redone a final time. Deferring those intermediate restarts would save a bunch of reparsing. Prior to this commit, warnings were issued only during the code generation pass, which didn't get called unless the sizing pass(es) completed successfully. But now, we don't know if the pass will succeed, fail, or whether it will have to be restarted. To avoid outputting the same warning more than once, the position in the parse of the last warning generated is kept (across parses). The code looks at that position when it is about to generate a warning. If the parsing has previously gotten that far, it assumes that the warning has already been generated, and suppresses it this time. The current state of parsing is such that I believe this assumption is valid. If the parses had divergent paths, that assumption could become invalid. --- regcomp.c | 618 ++++++++++++++++++++++++++++++++++-------------------- regcomp.h | 3 +- 2 files changed, 393 insertions(+), 228 deletions(-) diff --git a/regcomp.c b/regcomp.c index 8f338ae926ad..65991f4b4c09 100644 --- a/regcomp.c +++ b/regcomp.c @@ -147,12 +147,23 @@ struct RExC_state_t { U32 seen; SSize_t size; /* Number of regnode equivalents in pattern */ - I32 npar; /* Capture buffer count, (OPEN) plus - one. ("par" 0 is the whole - pattern)*/ - I32 total_par; /* Capture buffer count after parse - completed, (OPEN) plus one. ("par" 0 - is the whole pattern)*/ + + /* position beyond 'precomp' of the warning message furthest away from + * 'precomp'. During the parse, no warnings are raised for any problems + * earlier in the parse than this position. This works if warnings are + * raised the first time a given spot is parsed, and if only one + * independent warning is raised for any given spot */ + Size_t latest_warn_offset; + + I32 npar; /* Capture buffer count so far in the + parse, (OPEN) plus one. ("par" 0 is + the whole pattern)*/ + I32 total_par; /* During initial parse, is either 0, + or -1; the latter indicating a + reparse is needed. After that pass, + it is what 'npar' became after the + pass. Hence, it being > 0 indicates + we are in a reparse situation */ I32 nestroot; /* root parens we are in - used by accept */ I32 extralen; @@ -215,7 +226,7 @@ struct RExC_state_t { bool strict; bool study_started; bool in_script_run; - bool pass1; + bool use_BRANCHJ; }; #define RExC_flags (pRExC_state->flags) @@ -230,6 +241,7 @@ struct RExC_state_t { #define RExC_start (pRExC_state->start) #define RExC_end (pRExC_state->end) #define RExC_parse (pRExC_state->parse) +#define RExC_latest_warn_offset (pRExC_state->latest_warn_offset ) #define RExC_whilem_seen (pRExC_state->whilem_seen) /* Set during the sizing pass when there is a LATIN SMALL LETTER SHARP S in any @@ -247,7 +259,6 @@ struct RExC_state_t { others */ #endif #define RExC_emit (pRExC_state->emit) -#define RExC_pass1 (pRExC_state->pass1) #define RExC_emit_start (pRExC_state->emit_start) #define RExC_emit_bound (pRExC_state->emit_bound) #define RExC_sawback (pRExC_state->sawback) @@ -284,7 +295,7 @@ struct RExC_state_t { #define RExC_study_started (pRExC_state->study_started) #define RExC_warn_text (pRExC_state->warn_text) #define RExC_in_script_run (pRExC_state->in_script_run) -#define RExC_use_BRANCHJ (!SIZE_ONLY && RExC_extralen) +#define RExC_use_BRANCHJ (pRExC_state->use_BRANCHJ) /* Heuristic check on the complexity of the pattern: if TOO_NAUGHTY, we set * a flag to disable back-off on the fixed/floating substrings - if it's @@ -341,7 +352,6 @@ struct RExC_state_t { #define REQUIRE_UTF8(flagp) STMT_START { \ if (!UTF) { \ - assert(PASS1); \ *flagp = RESTART_PARSE|NEED_UTF8; \ return 0; \ } \ @@ -354,16 +364,26 @@ struct RExC_state_t { #define REQUIRE_UNI_RULES(flagp, restart_retval) \ STMT_START { \ if (DEPENDS_SEMANTICS) { \ - assert(PASS1); \ set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET); \ RExC_uni_semantics = 1; \ - if (RExC_seen_unfolded_sharp_s) { \ *flagp |= RESTART_PARSE; \ return restart_retval; \ - } \ } \ } STMT_END +#define BRANCH_MAX_OFFSET U16_MAX +#define REQUIRE_BRANCHJ(flagp, restart_retval) \ + STMT_START { \ + RExC_use_BRANCHJ = 1; \ + *flagp |= RESTART_PARSE; \ + return restart_retval; \ + } STMT_END + +#define REQUIRE_PARENS_PASS \ + STMT_START { \ + if (RExC_total_parens == 0) RExC_total_parens = -1; \ + } STMT_END + /* Executes a return statement with the value 'X', if 'flags' contains any of * 'RESTART_PARSE', 'NEED_UTF8', or 'extra'. If so, *flagp is set to those * flags */ @@ -693,6 +713,10 @@ static const scan_data_t zero_scan_data = { STMT_START { \ if (RExC_rx_sv) \ SAVEFREESV(RExC_rx_sv); \ + if (RExC_open_parens) \ + SAVEFREEPV(RExC_open_parens); \ + if (RExC_close_parens) \ + SAVEFREEPV(RExC_close_parens); \ } STMT_END /* @@ -800,17 +824,22 @@ static const scan_data_t zero_scan_data = { #define TURN_OFF_WARNINGS_IN_SUBSTITUTE_PARSE RExC_copy_start_in_constructed = NULL #define RESTORE_WARNINGS RExC_copy_start_in_constructed = RExC_precomp -/* Outputting warnings is generally deferred until the 2nd pass. This is - * because the first pass can be restarted, for example if the pattern has to - * be converted to UTF-8. If a warning had already been output earlier in the - * pass, it would be re-output after the restart. Pass 2 is never restarted, - * so the problem simply goes away if we defer the output to that pass. See - * [perl #122671]. 'RExC_copy_start_in_constructed' being NULL is a flag to - * not generate any warnings */ +/* Since a warning can be generated multiple times as the input is reparsed, we + * output it the first time we come to that point in the parse, but suppress it + * otherwise. 'RExC_copy_start_in_constructed' being NULL is a flag to not + * generate any warnings */ #define TO_OUTPUT_WARNINGS(loc) \ - (PASS2 && RExC_copy_start_in_constructed) + ( RExC_copy_start_in_constructed \ + && ((xI(loc)) - RExC_precomp) > (Ptrdiff_t) RExC_latest_warn_offset) -#define UPDATE_WARNINGS_LOC(loc) NOOP +/* After we've emitted a warning, we save the position in the input so we don't + * output it again */ +#define UPDATE_WARNINGS_LOC(loc) \ + STMT_START { \ + if (TO_OUTPUT_WARNINGS(loc)) { \ + RExC_latest_warn_offset = (xI(loc)) - RExC_precomp; \ + } \ + } STMT_END /* 'warns' is the output of the packWARNx macro used in 'code' */ #define _WARN_HELPER(loc, warns, code) \ @@ -7234,7 +7263,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, /* ignore the utf8ness if the pattern is 0 length */ RExC_utf8 = RExC_orig_utf8 = (plen == 0 || IN_BYTES) ? 0 : SvUTF8(pat); - RExC_rx_sv = NULL; \ RExC_uni_semantics = 0; RExC_seen_unfolded_sharp_s = 0; RExC_contains_locale = 0; @@ -7245,7 +7273,16 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, RExC_frame_head= NULL; RExC_frame_last= NULL; RExC_frame_count= 0; + RExC_latest_warn_offset = 0; + RExC_use_BRANCHJ = 0; RExC_total_parens = 0; + RExC_open_parens = NULL; + RExC_close_parens = NULL; + RExC_paren_names = NULL; + RExC_size = 0; +#ifdef DEBUGGING + RExC_paren_name_list = NULL; +#endif DEBUG_r({ RExC_mysv1= sv_newmortal(); @@ -7289,6 +7326,12 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, return old_re; } + /* Allocate the pattern's SV */ + RExC_rx_sv = Rx = (REGEXP*) newSV_type(SVt_REGEXP); + RExC_rx = ReANY(Rx); + if ( RExC_rx == NULL ) + FAIL("Regexp out of space"); + rx_flags = orig_rx_flags; if ( initial_charset == REGEX_DEPENDS_CHARSET @@ -7300,8 +7343,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, set_regex_charset(&rx_flags, REGEX_UNICODE_CHARSET); } - RExC_copy_start_in_constructed = RExC_copy_start_in_input = RExC_precomp = exp; - RExC_flags = rx_flags; RExC_pm_flags = pm_flags; if (runtime_code) { @@ -7325,43 +7366,82 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, RExC_maxlen = 0; RExC_in_lookbehind = 0; RExC_seen_zerolen = *exp == '^' ? -1 : 0; - RExC_extralen = 0; #ifdef EBCDIC RExC_recode_x_to_native = 0; #endif RExC_in_multi_char_class = 0; - /* First pass: determine size, legality. */ - RExC_pass1 = TRUE; - RExC_parse = exp; - RExC_start = RExC_copy_start_in_constructed = exp; - RExC_end = exp + plen; - RExC_precomp_end = RExC_end; - RExC_naughty = 0; - RExC_npar = 1; + RExC_start = RExC_copy_start_in_constructed = RExC_copy_start_in_input = RExC_precomp = exp; + RExC_precomp_end = RExC_end = exp + plen; RExC_nestroot = 0; - RExC_size = 0L; - RExC_emit = 1; RExC_whilem_seen = 0; - RExC_open_parens = 0; - RExC_close_parens = 0; RExC_end_op = NULL; - RExC_paren_names = NULL; -#ifdef DEBUGGING - RExC_paren_name_list = NULL; -#endif RExC_recurse = NULL; RExC_study_chunk_recursed = NULL; RExC_study_chunk_recursed_bytes= 0; RExC_recurse_count = 0; pRExC_state->code_index = 0; - /* We allocate scratch space as large as the largest node, for use in the - * first pass. Since many functions return RExC_emit on success, and '0' - * if an error, RExC_emit must never be 0, so we set it to 1 and double - * the scratch space */ - Newxc(RExC_emit_start, 2 * sizeof(regnode_ssc), char, regnode); - SAVEFREEPV(RExC_emit_start); + /* Initialize the string in the compiled pattern. This is so that there is + * something to output if necessary */ + set_regex_pv(pRExC_state, Rx); + + DEBUG_PARSE_r({ + Perl_re_printf( aTHX_ + "Starting parse and generation\n"); + RExC_lastnum=0; + RExC_lastparse=NULL; + }); + + /* Allocate space and zero-initialize. Note, the two step process + of zeroing when in debug mode, thus anything assigned has to + happen after that */ + if (! RExC_size) { + + /* On the first pass of the parse, we guess how big this will be. Then + * we grow in one operation to that amount and then give it back. As + * we go along, we re-allocate what we need. + * + * XXX Currently the guess is essentially that the pattern will be an + * EXACT node with one byte input, one byte output. This is crude, and + * better heuristics are welcome. + * + * On any subsequent passes, we guess what we actually computed in the + * latest earlier pass. Such a pass probably didn't complete so is + * missing stuff. We could improve those guesses by knowing where the + * parse stopped, and use the length so far plus apply the above + * assumption to what's left. */ + RExC_size = STR_SZ(RExC_end - RExC_start); + } + + Newxc(RExC_rxi, sizeof(regexp_internal) + RExC_size, char, regexp_internal); + if ( RExC_rxi == NULL ) + FAIL("Regexp out of space"); + + Zero(RExC_rxi, sizeof(regexp_internal) + RExC_size, char); + RXi_SET( RExC_rx, RExC_rxi ); + + /* We start from 0 (over from 0 in the case this is a reparse. The first + * node parsed will give back any excess memory we have allocated so far). + * */ + RExC_size = 0; + + /* non-zero initialization begins here */ + RExC_rx->engine= eng; + RExC_rx->extflags = rx_flags; + RXp_COMPFLAGS(RExC_rx) = orig_rx_flags & RXf_PMf_FLAGCOPYMASK; + + if (pm_flags & PMf_IS_QR) { + RExC_rxi->code_blocks = pRExC_state->code_blocks; + if (RExC_rxi->code_blocks) { + RExC_rxi->code_blocks->refcnt++; + } + } + + RExC_rx->intflags = 0; + + RExC_flags = rx_flags; /* don't let top level (?i) bleed */ + RExC_parse = exp; /* This NUL is guaranteed because the pattern comes from an SV*, and the sv * code makes sure the final byte is an uncounted NUL. But should this @@ -7372,13 +7452,35 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, * etc. So it is worth noting. */ assert(*RExC_end == '\0'); - DEBUG_PARSE_r( - Perl_re_printf( aTHX_ "Starting first pass (sizing)\n"); - RExC_lastnum=0; - RExC_lastparse=NULL; - ); + RExC_naughty = 0; + RExC_npar = 1; + RExC_emit_start = RExC_rxi->program; + pRExC_state->code_index = 0; + + *((char*) RExC_emit_start) = (char) REG_MAGIC; + RExC_emit = 1; + + /* Do the parse */ + if (reg(pRExC_state, 0, &flags, 1)) { + + /* Success!, But if RExC_total_parens < 0, we need to redo the parse + * knowing how many parens there actually are */ + if (RExC_total_parens < 0) { + flags |= RESTART_PARSE; + } + + /* We have that number in RExC_npar */ + RExC_total_parens = RExC_npar; + } + else if (! MUST_RESTART(flags)) { + ReREFCNT_dec(Rx); + Perl_croak(aTHX_ "panic: reg returned failure to re_op_compile, flags=%#" UVxf, (UV) flags); + } + /* XXX dent */ + + /* Here, we either have success, or we have to redo the parse for some reason */ + if (MUST_RESTART(flags)) { - if (reg(pRExC_state, 0, &flags, 1) == 0) { /* It's possible to write a regexp in ascii that represents Unicode codepoints outside of the byte range, such as via \x{100}. If we detect such a sequence we have to convert the entire pattern to utf8 @@ -7387,8 +7489,17 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, at least some part of the pattern, and therefore must convert the whole thing. -- dmq */ - if (MUST_RESTART(flags)) { if (flags & NEED_UTF8) { + + /* We have stored the offset of the final warning output so + * far. That must be adjusted. Any variant characters between + * the start of the pattern and this warning count for 2 bytes + * in the final, so just add them again */ + if (UNLIKELY(RExC_latest_warn_offset > 0)) { + RExC_latest_warn_offset += + variant_under_utf8_count((U8 *) exp, (U8 *) exp + + RExC_latest_warn_offset); + } S_pat_upgrade_to_utf8(aTHX_ pRExC_state, &exp, &plen, pRExC_state->code_blocks ? pRExC_state->code_blocks->count : 0); DEBUG_PARSE_r(Perl_re_printf( aTHX_ "Need to redo parse after upgrade\n")); @@ -7397,132 +7508,72 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, DEBUG_PARSE_r(Perl_re_printf( aTHX_ "Need to redo parse\n")); } - goto redo_parse; + if (RExC_total_parens > 0) { + /* Make enough room for all the known parens, and zero it */ + Renew(RExC_open_parens, RExC_total_parens, regnode_offset); + Zero(RExC_open_parens, RExC_total_parens, regnode_offset); + RExC_open_parens[0] = 1; /* +1 for REG_MAGIC */ + + Renew(RExC_close_parens, RExC_total_parens, regnode_offset); + Zero(RExC_close_parens, RExC_total_parens, regnode_offset); + } + else { /* Parse did not complete. Reinitialize the parentheses + structures */ + RExC_total_parens = 0; + if (RExC_open_parens) { + Safefree(RExC_open_parens); + RExC_open_parens = NULL; + } + if (RExC_close_parens) { + Safefree(RExC_close_parens); + RExC_close_parens = NULL; + } } - Perl_croak(aTHX_ "panic: reg returned failure to re_op_compile for sizing pass, flags=%#" UVxf, (UV) flags); - } - - DEBUG_PARSE_r({ - Perl_re_printf( aTHX_ - "Required size %" IVdf " nodes\n" - "Starting second pass (creation)\n", - (IV)RExC_size); - RExC_lastnum=0; - RExC_lastparse=NULL; - }); - /* The first pass could have found things that force Unicode semantics */ - if ((RExC_utf8 || RExC_uni_semantics) - && get_regex_charset(rx_flags) == REGEX_DEPENDS_CHARSET) - { - set_regex_charset(&rx_flags, REGEX_UNICODE_CHARSET); - } + /* Clean up what we did in this parse */ + SvREFCNT_dec_NN(RExC_rx_sv); - /* Small enough for pointer-storage convention? - If extralen==0, this means that we will not need long jumps. */ - if (RExC_size >= 0x10000L && RExC_extralen) - RExC_size += RExC_extralen; - else - RExC_extralen = 0; - if (RExC_whilem_seen > 15) - RExC_whilem_seen = 15; + goto redo_parse; + } - /* Allocate space and zero-initialize. Note, the two step process - of zeroing when in debug mode, thus anything assigned has to - happen after that */ - RExC_rx_sv = Rx = (REGEXP*) newSV_type(SVt_REGEXP); - RExC_rx = ReANY(Rx); - Newxc(RExC_rxi, sizeof(regexp_internal) + (unsigned)RExC_size * sizeof(regnode), - char, regexp_internal); - if ( RExC_rx == NULL || RExC_rxi == NULL ) - FAIL("Regexp out of space"); -#ifdef DEBUGGING - /* avoid reading uninitialized memory in DEBUGGING code in study_chunk() */ - Zero(RExC_rxi, sizeof(regexp_internal) + (unsigned)RExC_size * sizeof(regnode), - char); -#else - /* bulk initialize base fields with 0. */ - Zero(RExC_rxi, sizeof(regexp_internal), char); -#endif + /* Here, we have successfully parsed and generated the pattern's program + * for the regex engine. We are ready to finish things up and look for + * optimizations. */ - /* non-zero initialization begins here */ - RXi_SET( RExC_rx, RExC_rxi ); - RExC_rx->engine= eng; - RExC_rx->extflags = rx_flags; - RXp_COMPFLAGS(RExC_rx) = orig_rx_flags & RXf_PMf_FLAGCOPYMASK; + SetProgLen(RExC_rxi,RExC_size); - if (pm_flags & PMf_IS_QR) { - RExC_rxi->code_blocks = pRExC_state->code_blocks; - if (RExC_rxi->code_blocks) - RExC_rxi->code_blocks->refcnt++; + /* The values for the two variables below are now immutable, we can add + * them to the list to free without overwhelming it */ + if (RExC_open_parens) { + SAVEFREEPV(RExC_open_parens); + } + if (RExC_close_parens) { + SAVEFREEPV(RExC_close_parens); } - /* Set up the string to compile, with correct modifiers, etc */ + /* Update the string to compile, with correct modifiers, etc */ set_regex_pv(pRExC_state, Rx); - RExC_rx->intflags = 0; - RExC_total_parens = RExC_npar; - RExC_rx->nparens = RExC_total_parens - 1; /* set early to validate backrefs */ + RExC_rx->nparens = RExC_total_parens - 1; + + /* Uses the upper 4 bits of the FLAGS field, so keep within that size */ + if (RExC_whilem_seen > 15) + RExC_whilem_seen = 15; + + DEBUG_PARSE_r({ + Perl_re_printf( aTHX_ + "Required size %" IVdf " nodes\n", (IV)RExC_size); + RExC_lastnum=0; + RExC_lastparse=NULL; + }); - /* Useful during FAIL. */ #ifdef RE_TRACK_PATTERN_OFFSETS - Newxz(RExC_offsets, 2*RExC_size+1, U32); /* MJD 20001228 */ DEBUG_OFFSETS_r(Perl_re_printf( aTHX_ "%s %" UVuf " bytes for offset annotations.\n", RExC_offsets ? "Got" : "Couldn't get", (UV)((2*RExC_size+1) * sizeof(U32)))); #endif - SetProgLen(RExC_rxi, RExC_size); - RExC_rx_sv = Rx; - - /* Second pass: emit code. */ - RExC_pass1 = FALSE; - RExC_flags = rx_flags; /* don't let top level (?i) bleed */ - RExC_pm_flags = pm_flags; - RExC_parse = exp; - RExC_end = exp + plen; - RExC_naughty = 0; - RExC_emit_start = RExC_rxi->program; - RExC_emit = 1; - RExC_emit_bound = RExC_rxi->program + RExC_size + 1; - pRExC_state->code_index = 0; - - *((char*) RExC_emit_start) = (char) REG_MAGIC; - /* setup various meta data about recursion, this all requires - * RExC_npar to be correctly set, and a bit later on we clear it */ - if (RExC_seen & REG_RECURSE_SEEN) { - DEBUG_OPTIMISE_MORE_r(Perl_re_printf( aTHX_ - "%*s%*s Setting up open/close parens\n", - 22, "| |", (int)(0 * 2 + 1), "")); - - /* setup RExC_open_parens, which holds the address of each - * OPEN tag, and to make things simpler for the 0 index - * the start of the program - this is used later for offsets */ - Newxz(RExC_open_parens, RExC_npar, regnode_offset); - SAVEFREEPV(RExC_open_parens); - RExC_open_parens[0] = RExC_emit; - - /* setup RExC_close_parens, which holds the address of each - * CLOSE tag, and to make things simpler for the 0 index - * the end of the program - this is used later for offsets */ - Newxz(RExC_close_parens, RExC_npar, regnode_offset); - SAVEFREEPV(RExC_close_parens); - /* we dont know where end op starts yet, so we dont - * need to set RExC_close_parens[0] like we do RExC_open_parens[0] above */ - /* Note, RExC_npar is 1 + the number of parens in a pattern. - * So its 1 if there are no parens. */ - RExC_study_chunk_recursed_bytes= (RExC_npar >> 3) + - ((RExC_npar & 0x07) != 0); - Newx(RExC_study_chunk_recursed, - RExC_study_chunk_recursed_bytes * RExC_npar, U8); - SAVEFREEPV(RExC_study_chunk_recursed); - } - RExC_npar = 1; - if (reg(pRExC_state, 0, &flags, 1) == 0) { - ReREFCNT_dec(Rx); - Perl_croak(aTHX_ "panic: reg returned failure to re_op_compile for generation pass, flags=%#" UVxf, (UV) flags); - } DEBUG_OPTIMISE_r( Perl_re_printf( aTHX_ "Starting post parse optimization\n"); ); @@ -7535,6 +7586,16 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, SAVEFREEPV(RExC_recurse); } + if (RExC_seen & REG_RECURSE_SEEN) { + /* Note, RExC_total_parens is 1 + the number of parens in a pattern. + * So its 1 if there are no parens. */ + RExC_study_chunk_recursed_bytes= (RExC_total_parens >> 3) + + ((RExC_total_parens & 0x07) != 0); + Newx(RExC_study_chunk_recursed, + RExC_study_chunk_recursed_bytes * RExC_total_parens, U8); + SAVEFREEPV(RExC_study_chunk_recursed); + } + reStudy: RExC_rx->minlen = minlen = sawlookahead = sawplus = sawopen = sawminmod = 0; DEBUG_r( @@ -8538,8 +8599,18 @@ S_reg_scan_name(pTHX_ RExC_state_t *pRExC_state, U32 flags) he_str = hv_fetch_ent( RExC_paren_names, sv_name, 0, 0 ); if ( he_str ) sv_dat = HeVAL(he_str); - if ( ! sv_dat ) /* Didn't find group */ + if ( ! sv_dat ) { /* Didn't find group */ + + /* It might be a forward reference; we can't fail until we + * know, by completing the parse to get all the groups, and + * then reparsing */ + if (RExC_total_parens > 0) { vFAIL("Reference to nonexistent named group"); + } + else { + REQUIRE_PARENS_PASS; + } + } return sv_dat; } else { @@ -11185,10 +11256,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) /* FALLTHROUGH */ case '\'': /* (?'...') */ name_start = RExC_parse; - svname = reg_scan_name(pRExC_state, - SIZE_ONLY /* reverse test from the others */ - ? REG_RSN_RETURN_NAME - : REG_RSN_RETURN_NULL); + svname = reg_scan_name(pRExC_state, REG_RSN_RETURN_NAME); if ( RExC_parse == name_start || RExC_parse >= RExC_end || *RExC_parse != paren) @@ -11196,7 +11264,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) vFAIL2("Sequence (?%c... not terminated", paren=='>' ? '<' : paren); } - if (SIZE_ONLY) { + { HE *he_str; SV *sv_dat = NULL; if (!svname) /* shouldn't happen */ @@ -11288,6 +11356,11 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) buffers in alternations share the same numbers */ paren = ':'; after_freeze = freeze_paren = RExC_npar; + + /* XXX This construct currently requires an extra pass. + * Investigation would be required to see if that could be + * changed */ + REQUIRE_PARENS_PASS; break; case ':': /* (?:...) */ case '>': /* (?>...) */ @@ -11302,6 +11375,11 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) FAIL("Sequence (?R) not terminated"); num = 0; RExC_seen |= REG_RECURSE_SEEN; + + /* XXX These constructs currently require an extra pass. + * It probably could be changed */ + REQUIRE_PARENS_PASS; + *flagp |= POSTPONED; goto gen_recurse_regop; /*notreached*/ @@ -11374,8 +11452,17 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) */ num = RExC_npar + num; if (num < 1) { + + /* It might be a forward reference; we can't fail until + * we know, by completing the parse to get all the + * groups, and then reparsing */ + if (RExC_total_parens > 0) { RExC_parse++; vFAIL("Reference to nonexistent group"); + } + else { + REQUIRE_PARENS_PASS; + } } } else if ( paren == '+' ) { num = RExC_npar + num - 1; @@ -11391,10 +11478,21 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) ret = reg2Lanode(pRExC_state, GOSUB, num, RExC_recurse_count); if (!SIZE_ONLY) { - if (num > (I32)RExC_rx->nparens) { + if (num >= RExC_npar) { + + /* It might be a forward reference; we can't fail until we + * know, by completing the parse to get all the groups, and + * then reparsing */ + if (RExC_total_parens > 0) { + if (num >= RExC_total_parens) { RExC_parse++; vFAIL("Reference to nonexistent group"); - } + } + } + else { + REQUIRE_PARENS_PASS; + } + } RExC_recurse_count++; DEBUG_OPTIMISE_MORE_r(Perl_re_printf( aTHX_ "%*s%*s Recurse #%" UVuf " to %" IVdf "\n", @@ -11690,9 +11788,11 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) } else REGTAIL(pRExC_state, ret, ender); +#if 0 /* Removing this doesn't cause failures in the test suite -- khw */ RExC_size++; /* XXX WHY do we need this?!! For large programs it seems to be required but I can't figure out why. -- dmq*/ +#endif return ret; } RExC_parse += UTF ? UTF8SKIP(RExC_parse) : 1; @@ -11734,6 +11834,36 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth) capturing_parens: parno = RExC_npar; RExC_npar++; + if (RExC_total_parens <= 0) { + /* If we are in our first pass through (and maybe only pass), + * we need to allocate memory for the capturing parentheses + * data structures. Since we start at npar=1, when it reaches + * 2, for the first time it has something to put in it. Above + * 2 means we extend what we already have */ + if (RExC_npar == 2) { + /* setup RExC_open_parens, which holds the address of each + * OPEN tag, and to make things simpler for the 0 index the + * start of the program - this is used later for offsets */ + Newxz(RExC_open_parens, RExC_npar, regnode_offset); + RExC_open_parens[0] = 1; /* +1 for REG_MAGIC */ + + /* setup RExC_close_parens, which holds the address of each + * CLOSE tag, and to make things simpler for the 0 index + * the end of the program - this is used later for offsets + * */ + Newxz(RExC_close_parens, RExC_npar, regnode_offset); + /* we dont know where end op starts yet, so we dont need to + * set RExC_close_parens[0] like we do RExC_open_parens[0] + * above */ + } + else { + Renew(RExC_open_parens, RExC_npar, regnode_offset); + Zero(RExC_open_parens + RExC_npar - 1, 1, regnode_offset); + + Renew(RExC_close_parens, RExC_npar, regnode_offset); + Zero(RExC_close_parens + RExC_npar - 1, 1, regnode_offset); + } + } ret = reganode(pRExC_state, OPEN, parno); if (!SIZE_ONLY ){ @@ -12069,6 +12199,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) /* FIXME adding one for every branch after the first is probably * excessive now we have TRIE support. (hv) */ MARK_NAUGHTY(1); + if ( chain > (SSize_t) BRANCH_MAX_OFFSET + && ! RExC_use_BRANCHJ) + { + REQUIRE_BRANCHJ(flagp, 0); + } REGTAIL(pRExC_state, chain, latest); } chain = latest; @@ -12238,7 +12373,6 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) NEXT_OFF(REGNODE_p(ret)) = 3; /* Go over NOTHING to LONGJMP. */ REGTAIL(pRExC_state, ret, reg_node(pRExC_state, NOTHING)); - if (SIZE_ONLY) RExC_whilem_seen++, RExC_extralen += 3; MARK_NAUGHTY_EXP(1, 4); /* compound interest */ } @@ -12746,12 +12880,6 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, PERL_ARGS_ASSERT_ALLOC_MAYBE_POPULATE_EXACT; - /* Don't bother to check for downgrading in PASS1, as it doesn't make any - * sizing difference, and is extra work that is thrown away */ - if (downgradable && ! PASS2) { - downgradable = FALSE; - } - if (! len_passed_in) { if (UTF) { if (UVCHR_IS_INVARIANT(code_point)) { @@ -12850,16 +12978,15 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, } } - if (SIZE_ONLY) { - RExC_size += STR_SZ(len); + if (downgradable) { + change_engine_size(pRExC_state, STR_SZ(len)); } - else { + RExC_emit += STR_SZ(len); STR_LEN(REGNODE_p(node)) = len; if (! len_passed_in) { Copy((char *) character, STRING(REGNODE_p(node)), len, char); } - } *flagp |= HASWIDTH; @@ -13540,8 +13667,20 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) RExC_parse++; } if (!SIZE_ONLY) { - if (num > (I32)RExC_rx->nparens) + if (num >= (I32)RExC_npar) { + + /* It might be a forward reference; we can't fail until we + * know, by completing the parse to get all the groups, and + * then reparsing */ + if (RExC_total_parens > 0) { + if (num >= RExC_total_parens) { vFAIL("Reference to nonexistent group"); + } + } + else { + REQUIRE_PARENS_PASS; + } + } } RExC_sawback = 1; ret = reganode(pRExC_state, @@ -13623,6 +13762,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) * created for the new category. */ U8 node_type = EXACT; + /* Assume node will be fully used; the excess is given back at the end */ + Ptrdiff_t initial_size = STR_SZ(256); + bool next_is_quantifier; char * oldp = NULL; @@ -13642,9 +13784,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) * */ bool uni_semantics_at_node_start; - /* The node_type may change below, but since the size of the node - * doesn't change, it works */ - ret = reg_node(pRExC_state, node_type); + /* Allocate an EXACT node. The node_type may change below to + * another EXACTish node, but since the size of the node doesn't + * change, it works */ + ret = regnode_guts(pRExC_state, node_type, initial_size, "exact"); + FILL_NODE(ret, node_type); + RExC_emit++; /* In pass1, folded, we use a temporary buffer instead of the * actual node, as the node doesn't exist yet */ @@ -14404,6 +14549,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) loopdone: /* Jumped to when encounters something that shouldn't be in the node */ + change_engine_size(pRExC_state, - (initial_size - STR_SZ(len))); + /* I (khw) don't know if you can get here with zero length, but the * old code handled this situation by creating a zero-length EXACT * node. Might as well be NOTHING instead */ @@ -15443,13 +15590,6 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, * compile time values are valid in all runtime cases */ REQUIRE_UNI_RULES(flagp, 0); - /* This will return only an ANYOF regnode, or (unlikely) something smaller - * (such as EXACT). Thus we can skip most everything if just sizing. We - * call regclass to handle '[]' so as to not have to reinvent its parsing - * rules here (throwing away the size it computes each time). And, we exit - * upon an unescaped ']' that isn't one ending a regclass. To do both - * these things, we need to realize that something preceded by a backslash - * is escaped, so we have to keep track of backslashes */ if (SIZE_ONLY) { UV nest_depth = 0; /* how many nested (?[...]) constructs */ @@ -15646,8 +15786,8 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, skip_to_be_ignored_text(pRExC_state, &RExC_parse, TRUE /* Force /x */ ); - if (RExC_parse >= RExC_end) { - Perl_croak(aTHX_ "panic: Read past end of '(?[ ])'"); + if (RExC_parse >= RExC_end) { /* Fail */ + break; } curchar = UCHARAT(RExC_parse); @@ -15822,6 +15962,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, "flags=%#" UVxf, (UV) *flagp); } + if (! current) { + break; + } + /* function call leaves parse pointing to the ']', except if we * faked it */ if (is_posix_class) { @@ -15841,6 +15985,9 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, case ')': if (av_tindex_skip_len_mg(fence_stack) < 0) { + if (UCHARAT(RExC_parse - 1) == ']') { + break; + } RExC_parse++; vFAIL("Unexpected ')'"); } @@ -16027,6 +16174,9 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, default: RExC_parse += (UTF) ? UTF8SKIP(RExC_parse) : 1; + if (RExC_parse >= RExC_end) { + break; + } vFAIL("Unexpected character"); handle_operand: @@ -16086,7 +16236,18 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, RExC_parse += (UTF) ? UTF8SKIP(RExC_parse) : 1; } /* End of loop parsing through the construct */ + vFAIL("Syntax error in (?[...])"); + done: + + if (RExC_parse >= RExC_end || RExC_parse[1] != ')') { + if (RExC_parse < RExC_end) { + RExC_parse++; + } + + vFAIL("Unexpected ']' with no following ')' in (?[..."); + } + if (av_tindex_skip_len_mg(fence_stack) >= 0) { vFAIL("Unmatched ("); } @@ -16370,8 +16531,10 @@ S_output_or_return_posix_warnings(pTHX_ RExC_state_t *pRExC_state, AV* posix_war PREPARE_TO_DIE; } } - Perl_warner(aTHX_ packWARN(WARN_REGEXP), "%s", + if (TO_OUTPUT_WARNINGS(RExC_parse)) { + Perl_warner(aTHX_ packWARN(WARN_REGEXP), "%s", SvPVX(msg)); + } SvREFCNT_dec_NN(msg); } } @@ -18029,20 +18192,6 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, } } - /* Assume we are going to generate an ANYOF-type node. */ - op = (posixl) - ? ANYOFPOSIXL - : (LOC) - ? ANYOFL - : ANYOF; - ret = reganode(pRExC_state, op, 0); - - if (SIZE_ONLY) { - return ret; - } - - /****** !SIZE_ONLY (Pass 2) AFTER HERE *********/ - /* If folding, we calculate all characters that could fold to or from the * ones already on the list */ if (cp_foldable_list) { @@ -18424,8 +18573,6 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, } if (ret_invlist) { - assert(cp_list); - *ret_invlist = cp_list; SvREFCNT_dec(swash); @@ -18729,13 +18876,16 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, } /* It's going to be an ANYOF node. */ - OP(REGNODE_p(ret)) = (use_anyofd) + op = (use_anyofd) ? ANYOFD : ((posixl) ? ANYOFPOSIXL : ((LOC) ? ANYOFL : ANYOF)); + ret = regnode_guts(pRExC_state, op, regarglen[op], "anyof"); + FILL_NODE(ret, op); /* We set the argument later */ + RExC_emit += 1 + regarglen[op]; ANYOF_FLAGS(REGNODE_p(ret)) = anyof_flags; /* Here, contains all the code points we can determine at @@ -19235,6 +19385,28 @@ S_change_engine_size(pTHX_ RExC_state_t *pRExC_state, const Ptrdiff_t size) PERL_ARGS_ASSERT_CHANGE_ENGINE_SIZE; RExC_size += size; + + Renewc(RExC_rxi, + sizeof(regexp_internal) + (RExC_size + 1) * sizeof(regnode), + /* +1 for REG_MAGIC */ + char, + regexp_internal); + if ( RExC_rxi == NULL ) + FAIL("Regexp out of space"); + RXi_SET(RExC_rx, RExC_rxi); + + RExC_emit_start = RExC_rxi->program; + if (size > 0) { + Zero(REGNODE_p(RExC_emit), size, regnode); + } + +#ifdef RE_TRACK_PATTERN_OFFSETS + Renew(RExC_offsets, 2*RExC_size+1, U32); + if (size > 0) { + Zero(RExC_offsets + 2*(RExC_size - size) + 1, 2 * size, U32); + } + RExC_offsets[0] = 2*RExC_size+1; +#endif } STATIC regnode_offset @@ -19252,21 +19424,14 @@ S_regnode_guts(pTHX_ RExC_state_t *pRExC_state, const U8 op, const STRLEN extra_ PERL_ARGS_ASSERT_REGNODE_GUTS; - assert(extra_size >= regarglen[op]); - - if (SIZE_ONLY) { SIZE_ALIGN(RExC_size); change_engine_size(pRExC_state, (Ptrdiff_t) 1 + extra_size); - return(ret); - } - if (REGNODE_p(RExC_emit) >= RExC_emit_bound) - Perl_croak(aTHX_ "panic: reg_node overrun trying to emit %d, %p>=%p", - op, (void*)REGNODE_p(RExC_emit), (void*)RExC_emit_bound); - NODE_ALIGN_FILL(REGNODE_p(ret)); #ifndef RE_TRACK_PATTERN_OFFSETS PERL_UNUSED_ARG(name); #else + assert(extra_size >= regarglen[op] || PL_regkind[op] == ANYOF); + if (RExC_offsets) { /* MJD */ MJD_OFFSET_DEBUG( ("%s:%d: (op %s) %s %" UVuf " (len %" UVuf ") (max %" UVuf ").\n", @@ -19314,7 +19479,7 @@ S_reganode(pTHX_ RExC_state_t *pRExC_state, U8 op, U32 arg) PERL_ARGS_ASSERT_REGANODE; /* ANYOF are special cased to allow non-length 1 args */ - assert(regarglen[op] == 1 || PL_regkind[op] == ANYOF); + assert(regarglen[op] == 1); if (PASS2) { regnode_offset ptr = ret; @@ -19352,9 +19517,8 @@ S_reg2Lanode(pTHX_ RExC_state_t *pRExC_state, const U8 op, const U32 arg1, const * IMPORTANT NOTE - it is the *callers* responsibility to correctly * set up NEXT_OFF() of the inserted node if needed. Something like this: * -* reginsert(pRExC, OPFAIL, orig_emit, depth+1); -* if (PASS2) -* NEXT_OFF(orig_emit) = regarglen[OPFAIL] + NODE_STEP_REGNODE; +* reginsert(pRExC, OPFAIL, orig_emit, depth+1); +* NEXT_OFF(orig_emit) = regarglen[OPFAIL] + NODE_STEP_REGNODE; * * ALSO NOTE - FLAGS(newly-inserted-operator) will be set to 0 as well. */ @@ -19378,9 +19542,11 @@ S_reginsert(pTHX_ RExC_state_t *pRExC_state, const U8 op, RExC_size += size; return; } + assert(!RExC_study_started); /* I believe we should never use reginsert once we have started studying. If this is wrong then we need to adjust RExC_recurse below like we do with RExC_open_parens/RExC_close_parens. */ + change_engine_size(pRExC_state, (Ptrdiff_t) size); src = REGNODE_p(RExC_emit); RExC_emit += size; dst = REGNODE_p(RExC_emit); diff --git a/regcomp.h b/regcomp.h index f2ce68fb715e..c0b2c07ec504 100644 --- a/regcomp.h +++ b/regcomp.h @@ -381,10 +381,9 @@ struct regnode_ssc { #define REG_MAGIC 0234 -#define SIZE_ONLY RExC_pass1 +#define SIZE_ONLY FALSE #define PASS1 SIZE_ONLY #define PASS2 (! SIZE_ONLY) - /* An ANYOF node is basically a bitmap with the index being a code point. If * the bit for that code point is 1, the code point matches; if 0, it doesn't * match (complemented if inverted). There is an additional mechanism to deal