Skip to content

Commit

Permalink
regcomp.c: Avoid a panic on malformed qr/\N{..}/i
Browse files Browse the repository at this point in the history
Input constructs that expand to more than one character are handled very
very specially when they occur within a bracketed character class.  What
happens effectively is that they are removed from the class and parsed
separately, using the regular code in regcomp.c to generate something
like a trie for them.  The other characters within the class are handled
normally.

The specially handled stuff is parsed from a separate string.  In the
case where that stuff is of the form \N{U+...}, I neglected to
adequately consider that the syntax could trigger an error.  When such
an error is raised, it can violate our assumptions about the state of
things, and lead to a panic.

THe code actually parses the construct twice.  The first time while
deciding if this expands to multiple characters (so that it can be
separated out), and the second time to actually figure out and return
the expansion.  This commit fixes the bug by adding error checking
during the first pass.  Previously, the minimal amount of work was done
to be able to find the count of characters in the expansion.  Now, more
work is done to do the checking, as we go along with the counting.  This
actually results in less special case code needing to be executed, so
there is a net code removal from this commit.
  • Loading branch information
khwilliamson committed May 19, 2018
1 parent bc929b6 commit 94a3864
Showing 1 changed file with 19 additions and 29 deletions.
48 changes: 19 additions & 29 deletions regcomp.c
Expand Up @@ -12242,8 +12242,8 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
* *node_p, nor *code_point_p, nor *flagp.
*
* If <cp_count> is not NULL, the caller wants to know the length (in code
* points) that this \N sequence matches. This is set even if the function
* returns FALSE, as detailed below.
* points) that this \N sequence matches. This is set, and the input is
* parsed for errors, even if the function returns FALSE, as detailed below.
*
* There are 5 possibilities here, as detailed in the next 5 paragraphs.
*
Expand Down Expand Up @@ -12466,57 +12466,47 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
}

/* Here, looks like its really a multiple character sequence. Fail
* if that's not what the caller wants. */
if (! node_p) {

/* But even if failing, we count the code points if requested, and
* don't back up up the pointer as the caller is expected to
* handle this situation */
if (cp_count) {
char * dot = RExC_parse + 1;
do {
dot = (char *) memchr(dot, '.', endbrace - dot);
if (! dot) {
break;
}
count++;
dot++;
} while (dot < endbrace);
count++;

*cp_count = count;
RExC_parse = endbrace;
nextchar(pRExC_state);
}
else { /* Back up the pointer. */
RExC_parse = p;
}
* if that's not what the caller wants. But continue with counting
* and error checking if they still want a count */
if (! node_p && ! cp_count) {
return FALSE;
}

/* What is done here is to convert this to a sub-pattern of the
* form \x{char1}\x{char2}... and then call reg recursively to
* parse it (enclosing in "(?: ... )" ). That way, it retains its
* atomicness, while not having to worry about special handling
* that some code points may have. */
* that some code points may have. We don't create a subpattern,
* but go through the motions of code point counting and error
* checking, if the caller doesn't want a node returned. */

if (count == 1) {
if (node_p && count == 1) {
substitute_parse = newSVpvs("?:");
}

do_concat:

if (node_p) {
/* Convert to notation the rest of the code understands */
sv_catpv(substitute_parse, "\\x{");
sv_catpvn(substitute_parse, start_digit, RExC_parse - start_digit);
sv_catpv(substitute_parse, "}");
}

/* Move to after the dot (or ending brace the final time through.)
* */
RExC_parse++;
count++;

} while (RExC_parse < endbrace);

if (! node_p) { /* Doesn't want the node */
assert (cp_count);

*cp_count = count;
return FALSE;
}

sv_catpv(substitute_parse, ")");

#ifdef EBCDIC
Expand Down

0 comments on commit 94a3864

Please sign in to comment.