PATCH: [perl #56444] delayed interpolation of \N{...}

make regen embed.fnc needs to be run on this patch. This patch fixes Bugs #56444 and #62056. Hopefully we have finally gotten this right. The parser used to handle all the escaped constants, expanding \x2e to its single byte equivalent. The problem is that for regexp patterns, this is a '.', which is a metacharacter and has special meaning that \x2e does not. So things were changed so that the parser didn't expand things in patterns. But this causes problems for \N{NAME}, when the pattern doesn't get evaluated until runtime, as for example when it has a scalar reference in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was in effect at the point during the parsing phase that this regex was encountered in, but we don't actually look at it until runtime, when these bug reports show that it is gone. The solution is for the tokenizer to parse \N{NAME}, but to compile it into an intermediate value that won't ever be considered a metacharacter. We have chosen to compile NAME to its equivalent code point value, and express it in the already existing \N{U+...} form. This indicates to the regex compiler that the original input was a named character and retains the value it had at that point in the parse. This means that \N{U+...} now always must imply Unicode semantics for the string or pattern it appeared in. Previously there was an inconsistency, where effectively \N{NAME} implied Unicode semantics, but \N{U+...} did not necessarily. So now, any string or pattern that has either of these forms is utf8 upgraded. A complication is that a charnames handler can return a sequence of multiple characters instead of just one. To deal with this case, the tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where c1 etc are the individual characters. Perhaps this will be made a public interface someday, but I decided to not expose it externally as far as possible for now in case we find reason to change it. It is possible to defeat this by passing it in a single quoted string to the regex compiler, so the documentation will be changed to discourage that. A further complication is that \N can have an additional meaning: to match a non-newline. This means that the two meanings have to be disambiguated. embed.fnc was changed to make public the function regcurly() in regcomp.c so that it could be referred to in toke.c to see if the ... in \N{...} is a legal quantifier like {2,}. This is used in the disambiguation. toke.c was changed to update some out-dated relevant comments. It now parses \N in patterns. If it determines that it isn't a named sequence, it passes it through unchanged. This happens when there is no brace after the \N, or no closing brace, or if the braces enclose a legal quantifier. Previously there has been essentially no restriction on what can come between the braces so that a custom translator can accept virtually anything. Now, legal quantifiers are assumed to mean that the \N is a "match non-newline that quantity of times". I removed the #ifdef'd out code that had been left in in case pack U reverted to earlier behavior. I did this because it complicated things, and because the change to pack U has been in long enough and shown that it is correct so it's not likely to be reverted. \N meaning a named character is handled differently depending on whether this is a pattern or not. In all cases, the output will be upgraded to utf8 because a named character implies Unicode semantics. If not a pattern, the \N is parsed into a utf8 string, as before. Otherwise it will be parsed into the intermediate \N{U+...} form. If the original was already a valid \N{U+...} constant, it is passed through unchanged. I now check that the sequence returned by the charnames handler is not malformed, which was lacking before. The code in regcomp.c which dealt with interfacing with the charnames handler has been removed. All the values should be determined by the time regcomp.c gets involved. The affected subroutine is necessarily restructured. An EXACT-type node is generated for the character sequence. Such a node has a capacity of 255 bytes, and so it is possible to overflow it. This wasn't checked for before, but now it is, and a warning issued and the overflowing characters are discarded.
Perl · Feb 19, 2010 · ff3f963 · ff3f963
1 parent 8df7d2a
commit ff3f963
Show file tree

Hide file tree

Showing 10 changed files with 659 additions and 374 deletions.
diff --git a/embed.fnc b/embed.fnc
@@ -165,6 +165,7 @@ npR	|MEM_SIZE|malloc_good_size	|size_t nbytes
 
 AnpR	|void*	|get_context
 Anp	|void	|set_context	|NN void *t
+EpRnP	|I32	|regcurly	|NN const char *s
 
 END_EXTERN_C
 
@@ -1706,7 +1707,6 @@ Es	|regnode*|regbranch	|NN struct RExC_state_t *pRExC_state \
 Es	|STRLEN	|reguni		|NN const struct RExC_state_t *pRExC_state \
 				|UV uv|NN char *s
 Es	|regnode*|regclass	|NN struct RExC_state_t *pRExC_state|U32 depth
-ERsn	|I32	|regcurly	|NN const char *s
 Es	|regnode*|reg_node	|NN struct RExC_state_t *pRExC_state|U8 op
 Es	|UV	|reg_recode	|const char value|NN SV **encp
 Es	|regnode*|regpiece	|NN struct RExC_state_t *pRExC_state \

diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod
@@ -237,9 +237,10 @@ for some or all operations. (Yuval Kogman)
 
 A new regex escape has been added, C<\N>. It will match any character that
 is not a newline, independently from the presence or absence of the single
-line match modifier C</s>. (If C<\N> is followed by an opening brace and
+line match modifier C</s>.  It is not usable within a character class.
+(If C<\N> is followed by an opening brace and
 by a letter, perl will still assume that a Unicode character name is
-coming, so compatibility is preserved.) (Rafael Garcia-Suarez)
+coming, so compatibility is preserved.) (Rafael Garcia-Suarez).
 
 This will break a L<custom charnames translator|charnames/CUSTOM TRANSLATORS>
 which allows numbers for character names, as C<\N{3}> will now mean to match 3
@@ -2464,24 +2465,6 @@ take a block as their first argument, like
 
 =item *
 
-The C<charnames> pragma may generate a run-time error when a regex is
-interpolated [RT #56444]:
-
-    use charnames ':full';
-    my $r1 = qr/\N{THAI CHARACTER SARA I}/;
-    "foo" =~ $r1;    # okay
-    "foo" =~ /$r1+/; # runtime error
-
-A workaround is to generate the character outside of the regex:
-
-    my $a = "\N{THAI CHARACTER SARA I}";
-    my $r1 = qr/$a/;
-
-However, C<$r1> must be used within the scope of the C<use charnames> for this
-to work.
-
-=item *
-
 Some regexes may run much more slowly when run in a child thread compared
 with the thread the pattern was compiled into [RT #55600].
 

diff --git a/pod/perldiag.pod b/pod/perldiag.pod
@@ -1912,10 +1912,10 @@ about 250 characters for simple names, and somewhat more for compound
 names (like C<$A::B>).  You've exceeded Perl's limits.  Future versions
 of Perl are likely to eliminate these arbitrary limitations.
 
-=item Ignoring %s in character class in regex; marked by <-- HERE in m/%s/
+=item Ignoring zero length \N{} in character class"
 
-(W) Named Unicode character escapes (\N{...}) may return multi-char
-or zero length sequences. When such an escape is used in a character class
+(W) Named Unicode character escapes (\N{...}) may return a
+zero length sequence.  When such an escape is used in a character class
 its behaviour is not well defined. Check that the correct escape has
 been used, and the correct charname handler is in scope.
 
@@ -2395,6 +2395,10 @@ See also L<Encode/"Handling Malformed Data">.
 (F) Perl thought it was reading UTF-16 encoded character data but while
 doing it Perl met a malformed Unicode surrogate.
 
+=item Malformed UTF-8 returned by \N
+
+(F) The charnames handler returned malformed UTF-8.
+
 =item Malformed UTF-8 string in pack
 
 (F) You tried to pack something that didn't comply with UTF-8 encoding
@@ -2467,7 +2471,7 @@ supplied.
 (F) The argument to the indicated command line switch must follow
 immediately after the switch, without intervening spaces.
 
-=item Missing %sbrace%s on \N{}
+=item Missing braces on \N{}
 
 (F) Wrong syntax of character name literal C<\N{charname}> within
 double-quotish context.
@@ -2506,7 +2510,34 @@ can vary from one line to the next.
 
 =item Missing right brace on %s
 
-(F) Missing right brace in C<\x{...}>, C<\p{...}> or C<\P{...}>.
+(F) Missing right brace in C<\x{...}>, C<\p{...}>, C<\P{...}>, or C<\N{...}>.
+
+=item Missing right brace on \\N{} or unescaped left brace after \\N.  Assuming the latter
+
+(W syntax)
+C<\N> has traditionally been followed by a name enclosed in braces,
+meaning the character (or sequence of characters) given by that name.
+Thus C<\N{ASTERISK}> is another way of writing C<*>, valid in both
+double-quoted strings and regular expression patterns.
+In patterns, it doesn't have the meaning an unescaped C<*> does.
+
+Starting in Perl 5.12.0, C<\N> also can have an additional meaning in patterns,
+namely to match a non-newline character.  (This is like C<.> but is not
+affected by the C</s> modifier.)
+
+This can lead to some ambiguities.  When C<\N> is not followed immediately by a
+left brace, Perl assumes the "match non-newline character" meaning.  Also, if
+the braces form a valid quantifier such as C<\N{3}> or C<\N{5,}>, Perl assumes
+that this means to match the given quantity of non-newlines (in these examples,
+3, and 5 or more, respectively).  In all other case, where there is a C<\N{>
+and a matching C<}>, Perl assumes that a character name is desired.
+
+However, if there is no matching C<}>, Perl doesn't know if it was mistakenly
+omitted, or if "match non-newline" followed by "match a C<{>" was desired.
+It assumes the latter because that is actually a valid interpretation as
+written, unlike the other case.  If you meant the former, you need to add the
+matching right brace.  If you did mean the latter, you can silence this warning
+by writing instead C<\N\{>.
 
 =item Missing right curly or square bracket
 
@@ -2593,6 +2624,13 @@ that yet.
 sense to try to declare one with a package qualifier on the front.  Use
 local() if you want to localize a package variable.
 
+=item \\N in a character class must be a named character: \\N{...}
+
+The new (5.12) meaning of C<\N> to match non-newlines is not valid in a
+bracketed character class, for the same reason that C<.> in a character class
+loses its specialness: it matches almost everything, which is probably not what
+you want.
+
 =item Name "%s::%s" used only once: possible typo
 
 (W once) Typographical errors often show up as unique variable names.
@@ -2605,6 +2643,11 @@ NOTE: This warning detects symbols that have been used only once so $c, @c,
 the same; if a program uses $c only once but also uses any of the others it
 will not trigger this warning.
 
+=item Invalid hexadecimal number in \\N{U+...}
+
+(F) The character constant represented by C<...> is not a valid hexadecimal
+number.
+
 =item Negative '/' count in unpack
 
 (F) The length count obtained from a length/code unpack operation was
@@ -4943,6 +4986,20 @@ C<< @foo->[23] >> or C<< @$ref->[99] >>.  Versions of perl <= 5.6.1 used to
 allow this syntax, but shouldn't have. It is now deprecated, and will be
 removed in a future version.
 
+=item Using just the first character returned by \N{} in character class
+
+(W) A charnames handler may return a sequence of more than one character.
+Currently all but the first one are discarded when used in a regular
+expression pattern bracketed character class.
+
+=item Using just the first characters returned by \N{}
+
+(W) A charnames handler may return a sequence of characters.  There is a finite
+limit as to the number of characters that can be used, which this sequence
+exceeded.  In the message, the characters in the sequence are separated by
+dots, and each is shown by its ordinal in hex.  Anything to the left of the
+C<HERE> was retained; anything to the right was discarded.
+
 =item UTF-16 surrogate %s
 
 (W utf8) You tried to generate half of a UTF-16 surrogate by