Skip to content

Commit

Permalink
PATCH: [perl #56444] delayed interpolation of \N{...}
Browse files Browse the repository at this point in the history
make regen embed.fnc
needs to be run on this patch.

This patch fixes Bugs #56444 and #62056.

Hopefully we have finally gotten this right.  The parser used to handle
all the escaped constants, expanding \x2e to its single byte equivalent.
The problem is that for regexp patterns, this is a '.', which is a
metacharacter and has special meaning that \x2e does not.  So things
were changed so that the parser didn't expand things in patterns.  But
this causes problems for \N{NAME}, when the pattern doesn't get
evaluated until runtime, as for example when it has a scalar reference
in it, like qr/$foo\N{NAME}/.  We want the value for \N{NAME} that was
in effect at the point during the parsing phase that this regex was
encountered in, but we don't actually look at it until runtime, when
these bug reports show that it is gone.  The solution is for the
tokenizer to parse \N{NAME}, but to compile it into an intermediate
value that won't ever be considered a metacharacter.  We have chosen to
compile NAME to its equivalent code point value, and express it in the
already existing \N{U+...} form.  This indicates to the regex compiler
that the original input was a named character and retains the value it
had at that point in the parse.

This means that \N{U+...} now always must imply Unicode semantics for
the string or pattern it appeared in.  Previously there was an
inconsistency, where effectively \N{NAME} implied Unicode semantics, but
\N{U+...} did not necessarily.  So now, any string or pattern that has
either of these forms is utf8 upgraded.

A complication is that a charnames handler can return a sequence of
multiple characters instead of just one.  To deal with this case, the
tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where
c1 etc are the individual characters.  Perhaps this will be made a
public interface someday, but I decided to not expose it externally as
far as possible for now in case we find reason to change it.  It is
possible to defeat this by passing it in a single quoted string to the
regex compiler, so the documentation will be changed to discourage that.

A further complication is that \N can have an additional meaning: to
match a non-newline.  This means that the two meanings have to be
disambiguated.

embed.fnc was changed to make public the function regcurly() in
regcomp.c so that it could be referred to in toke.c to see if the ... in
\N{...} is a legal quantifier like {2,}.  This is used in the
disambiguation.

toke.c was changed to update some out-dated relevant comments.
It now parses \N in patterns.  If it determines that it isn't a named
sequence, it passes it through unchanged.  This happens when there is no
brace after the \N, or no closing brace, or if the braces enclose a
legal quantifier.  Previously there has been essentially no restriction
on what can come between the braces so that a custom translator can
accept virtually anything.  Now, legal quantifiers are assumed to mean
that the \N is a "match non-newline that quantity of times".

I removed the #ifdef'd out code that had been left in in case pack U
reverted to earlier behavior.  I did this because it complicated things,
and because the change to pack U has been in long enough and shown that
it is correct so it's not likely to be reverted.

\N meaning a named character is handled differently depending on whether
this is a pattern or not.  In all cases, the output will be upgraded to
utf8 because a named character implies Unicode semantics.  If not a
pattern, the \N is parsed into a utf8 string, as before.  Otherwise it
will be parsed into the intermediate \N{U+...} form.  If the original
was already a valid \N{U+...} constant, it is passed through unchanged.

I now check that the sequence returned by the charnames handler is not
malformed, which was lacking before.

The code in regcomp.c which dealt with interfacing with the charnames
handler has been removed.  All the values should be determined by the
time regcomp.c gets involved.  The affected subroutine is necessarily
restructured.

An EXACT-type node is generated for the character sequence.  Such a node
has a capacity of 255 bytes, and so it is possible to overflow it.  This
wasn't checked for before, but now it is, and a warning issued and the
overflowing characters are discarded.
  • Loading branch information
Karl Williamson authored and rgs committed Feb 19, 2010
1 parent 8df7d2a commit ff3f963
Show file tree
Hide file tree
Showing 10 changed files with 659 additions and 374 deletions.
2 changes: 1 addition & 1 deletion embed.fnc
Expand Up @@ -165,6 +165,7 @@ npR |MEM_SIZE|malloc_good_size |size_t nbytes

AnpR |void* |get_context
Anp |void |set_context |NN void *t
EpRnP |I32 |regcurly |NN const char *s

END_EXTERN_C

Expand Down Expand Up @@ -1706,7 +1707,6 @@ Es |regnode*|regbranch |NN struct RExC_state_t *pRExC_state \
Es |STRLEN |reguni |NN const struct RExC_state_t *pRExC_state \
|UV uv|NN char *s
Es |regnode*|regclass |NN struct RExC_state_t *pRExC_state|U32 depth
ERsn |I32 |regcurly |NN const char *s
Es |regnode*|reg_node |NN struct RExC_state_t *pRExC_state|U8 op
Es |UV |reg_recode |const char value|NN SV **encp
Es |regnode*|regpiece |NN struct RExC_state_t *pRExC_state \
Expand Down
23 changes: 3 additions & 20 deletions pod/perl5120delta.pod
Expand Up @@ -237,9 +237,10 @@ for some or all operations. (Yuval Kogman)

A new regex escape has been added, C<\N>. It will match any character that
is not a newline, independently from the presence or absence of the single
line match modifier C</s>. (If C<\N> is followed by an opening brace and
line match modifier C</s>. It is not usable within a character class.
(If C<\N> is followed by an opening brace and
by a letter, perl will still assume that a Unicode character name is
coming, so compatibility is preserved.) (Rafael Garcia-Suarez)
coming, so compatibility is preserved.) (Rafael Garcia-Suarez).

This will break a L<custom charnames translator|charnames/CUSTOM TRANSLATORS>
which allows numbers for character names, as C<\N{3}> will now mean to match 3
Expand Down Expand Up @@ -2464,24 +2465,6 @@ take a block as their first argument, like

=item *

The C<charnames> pragma may generate a run-time error when a regex is
interpolated [RT #56444]:

use charnames ':full';
my $r1 = qr/\N{THAI CHARACTER SARA I}/;
"foo" =~ $r1; # okay
"foo" =~ /$r1+/; # runtime error

A workaround is to generate the character outside of the regex:

my $a = "\N{THAI CHARACTER SARA I}";
my $r1 = qr/$a/;

However, C<$r1> must be used within the scope of the C<use charnames> for this
to work.

=item *

Some regexes may run much more slowly when run in a child thread compared
with the thread the pattern was compiled into [RT #55600].

Expand Down
67 changes: 62 additions & 5 deletions pod/perldiag.pod
Expand Up @@ -1912,10 +1912,10 @@ about 250 characters for simple names, and somewhat more for compound
names (like C<$A::B>). You've exceeded Perl's limits. Future versions
of Perl are likely to eliminate these arbitrary limitations.

=item Ignoring %s in character class in regex; marked by <-- HERE in m/%s/
=item Ignoring zero length \N{} in character class"

(W) Named Unicode character escapes (\N{...}) may return multi-char
or zero length sequences. When such an escape is used in a character class
(W) Named Unicode character escapes (\N{...}) may return a
zero length sequence. When such an escape is used in a character class
its behaviour is not well defined. Check that the correct escape has
been used, and the correct charname handler is in scope.

Expand Down Expand Up @@ -2395,6 +2395,10 @@ See also L<Encode/"Handling Malformed Data">.
(F) Perl thought it was reading UTF-16 encoded character data but while
doing it Perl met a malformed Unicode surrogate.

=item Malformed UTF-8 returned by \N

(F) The charnames handler returned malformed UTF-8.

=item Malformed UTF-8 string in pack

(F) You tried to pack something that didn't comply with UTF-8 encoding
Expand Down Expand Up @@ -2467,7 +2471,7 @@ supplied.
(F) The argument to the indicated command line switch must follow
immediately after the switch, without intervening spaces.

=item Missing %sbrace%s on \N{}
=item Missing braces on \N{}

(F) Wrong syntax of character name literal C<\N{charname}> within
double-quotish context.
Expand Down Expand Up @@ -2506,7 +2510,34 @@ can vary from one line to the next.

=item Missing right brace on %s

(F) Missing right brace in C<\x{...}>, C<\p{...}> or C<\P{...}>.
(F) Missing right brace in C<\x{...}>, C<\p{...}>, C<\P{...}>, or C<\N{...}>.

=item Missing right brace on \\N{} or unescaped left brace after \\N. Assuming the latter

(W syntax)
C<\N> has traditionally been followed by a name enclosed in braces,
meaning the character (or sequence of characters) given by that name.
Thus C<\N{ASTERISK}> is another way of writing C<*>, valid in both
double-quoted strings and regular expression patterns.
In patterns, it doesn't have the meaning an unescaped C<*> does.

Starting in Perl 5.12.0, C<\N> also can have an additional meaning in patterns,
namely to match a non-newline character. (This is like C<.> but is not
affected by the C</s> modifier.)

This can lead to some ambiguities. When C<\N> is not followed immediately by a
left brace, Perl assumes the "match non-newline character" meaning. Also, if
the braces form a valid quantifier such as C<\N{3}> or C<\N{5,}>, Perl assumes
that this means to match the given quantity of non-newlines (in these examples,
3, and 5 or more, respectively). In all other case, where there is a C<\N{>
and a matching C<}>, Perl assumes that a character name is desired.

However, if there is no matching C<}>, Perl doesn't know if it was mistakenly
omitted, or if "match non-newline" followed by "match a C<{>" was desired.
It assumes the latter because that is actually a valid interpretation as
written, unlike the other case. If you meant the former, you need to add the
matching right brace. If you did mean the latter, you can silence this warning
by writing instead C<\N\{>.

=item Missing right curly or square bracket

Expand Down Expand Up @@ -2593,6 +2624,13 @@ that yet.
sense to try to declare one with a package qualifier on the front. Use
local() if you want to localize a package variable.

=item \\N in a character class must be a named character: \\N{...}

The new (5.12) meaning of C<\N> to match non-newlines is not valid in a
bracketed character class, for the same reason that C<.> in a character class
loses its specialness: it matches almost everything, which is probably not what
you want.

=item Name "%s::%s" used only once: possible typo

(W once) Typographical errors often show up as unique variable names.
Expand All @@ -2605,6 +2643,11 @@ NOTE: This warning detects symbols that have been used only once so $c, @c,
the same; if a program uses $c only once but also uses any of the others it
will not trigger this warning.

=item Invalid hexadecimal number in \\N{U+...}

(F) The character constant represented by C<...> is not a valid hexadecimal
number.

=item Negative '/' count in unpack

(F) The length count obtained from a length/code unpack operation was
Expand Down Expand Up @@ -4943,6 +4986,20 @@ C<< @foo->[23] >> or C<< @$ref->[99] >>. Versions of perl <= 5.6.1 used to
allow this syntax, but shouldn't have. It is now deprecated, and will be
removed in a future version.

=item Using just the first character returned by \N{} in character class

(W) A charnames handler may return a sequence of more than one character.
Currently all but the first one are discarded when used in a regular
expression pattern bracketed character class.

=item Using just the first characters returned by \N{}

(W) A charnames handler may return a sequence of characters. There is a finite
limit as to the number of characters that can be used, which this sequence
exceeded. In the message, the characters in the sequence are separated by
dots, and each is shown by its ordinal in hex. Anything to the left of the
C<HERE> was retained; anything to the right was discarded.

=item UTF-16 surrogate %s

(W utf8) You tried to generate half of a UTF-16 surrogate by
Expand Down

0 comments on commit ff3f963

Please sign in to comment.