Skip to content

Commit

Permalink
Deprecate unescaped literal "{" in regex patterns
Browse files Browse the repository at this point in the history
This commit also causes escaped (by a backslash) "(", "[", and "{" to be
considered literally.  In the previous 2 Perl versions, the escaping was
ignored, and a (default-on) deprecation warning was raised.  Now that we
have warned for 2 release cycles, we can change the meaning.of escaping
to actually do something

Warning when a literal left brace is not escaped by a backslash, will
allow us to eventually use this character in more contexts as being
meta, allowing us to extend the language.  For example, the lower limit
of a quantifier could be omited, and better error checking instituted,
or things like \w could be followed by a {...} indicating some special
word character, like \w{Greek} to restrict to just Greek word
characters.

We tried to do this in v5.16, and many CPAN modules changed to backslash
their left braces at that time.  However we had to back out that change
before 5.16 shipped because it turned out that escaping a left brace in
some contexts didn't work, namely when the brace would normally be a
metacharacter (for example surrounding a quantifier), and the pattern
delimiters were { }.  Instead we raised the useless backslash warning
mentioned above, which has now been there for the requisite 2 cycles.

This patch partially reverts 2 patches.  The first,
e62d0b1, partially reverted
the deprecation of unescaped literal left brace.  The other,
4d68ffa, instituted the deprecation of
the useless left-characters.

Note that, as in the original attempt to deprecate,  we don't raise a
warning if the left brace is the first character in the pattern.  This
is because in that position it can't be a metacharacter, so we don't
require any disambiguation, and we found that if we did raise an error,
there were quite a few places where this occurred.
  • Loading branch information
khwilliamson committed Jun 13, 2014
1 parent 4a7e65a commit 412f55b
Show file tree
Hide file tree
Showing 18 changed files with 141 additions and 229 deletions.
10 changes: 2 additions & 8 deletions dquote_static.c
Expand Up @@ -15,11 +15,7 @@
Pulled from regcomp.c.
*/
PERL_STATIC_INLINE I32
S_regcurly(pTHX_ const char *s,
const bool rbrace_must_be_escaped /* Should the terminating '} be
preceded by a backslash? This
is an abnormal case */
)
S_regcurly(pTHX_ const char *s)
{
PERL_UNUSED_CONTEXT;
PERL_ARGS_ASSERT_REGCURLY;
Expand All @@ -36,9 +32,7 @@ S_regcurly(pTHX_ const char *s,
s++;
}

return (rbrace_must_be_escaped)
? *s == '\\' && *(s+1) == '}'
: *s == '}';
return *s == '}';
}

/* XXX Add documentation after final interface and behavior is decided */
Expand Down
4 changes: 1 addition & 3 deletions embed.fnc
Expand Up @@ -1170,8 +1170,7 @@ Ap |char* |re_intuit_start|NN REGEXP * const rx \
|NULLOK re_scream_pos_data *data
Ap |SV* |re_intuit_string|NN REGEXP *const r
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_TOKE_C)
EiPR |I32 |regcurly |NN const char *s \
|const bool rbrace_must_be_escaped
EiPR |I32 |regcurly |NN const char *s
#endif
Ap |I32 |regexec_flags |NN REGEXP *const rx|NN char *stringarg \
|NN char *strend|NN char *strbeg \
Expand Down Expand Up @@ -2345,7 +2344,6 @@ sR |char* |scan_inputsymbol|NN char *start
sR |char* |scan_pat |NN char *start|I32 type
sR |char* |scan_str |NN char *start|int keep_quoted \
|int keep_delims|int re_reparse \
|bool deprecate_escaped_matching \
|NULLOK char **delimp
sR |char* |scan_subst |NN char *start
sR |char* |scan_trans |NN char *start
Expand Down
4 changes: 2 additions & 2 deletions embed.h
Expand Up @@ -1019,7 +1019,7 @@
#define grok_bslash_c(a,b) S_grok_bslash_c(aTHX_ a,b)
#define grok_bslash_o(a,b,c,d,e,f,g) S_grok_bslash_o(aTHX_ a,b,c,d,e,f,g)
#define grok_bslash_x(a,b,c,d,e,f,g) S_grok_bslash_x(aTHX_ a,b,c,d,e,f,g)
#define regcurly(a,b) S_regcurly(aTHX_ a,b)
#define regcurly(a) S_regcurly(aTHX_ a)
# endif
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_UTF8_C)
#define _add_range_to_invlist(a,b,c) Perl__add_range_to_invlist(aTHX_ a,b,c)
Expand Down Expand Up @@ -1693,7 +1693,7 @@
#define scan_ident(a,b,c,d) S_scan_ident(aTHX_ a,b,c,d)
#define scan_inputsymbol(a) S_scan_inputsymbol(aTHX_ a)
#define scan_pat(a,b) S_scan_pat(aTHX_ a,b)
#define scan_str(a,b,c,d,e,f) S_scan_str(aTHX_ a,b,c,d,e,f)
#define scan_str(a,b,c,d,e) S_scan_str(aTHX_ a,b,c,d,e)
#define scan_subst(a) S_scan_subst(aTHX_ a)
#define scan_trans(a) S_scan_trans(aTHX_ a)
#define scan_word(a,b,c,d,e) S_scan_word(aTHX_ a,b,c,d,e)
Expand Down
3 changes: 1 addition & 2 deletions handy.h
Expand Up @@ -963,8 +963,7 @@ patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc
# define _CC_QUOTEMETA 21
# define _CC_NON_FINAL_FOLD 22
# define _CC_IS_IN_SOME_FOLD 23
# define _CC_BACKSLASH_FOO_LBRACE_IS_META 31 /* temp, see mk_PL_charclass.pl */
/* Unused: 24-30
/* Unused: 24-31
* If more bits are needed, one could add a second word for non-64bit
* QUAD_IS_INT systems, using some #ifdefs to distinguish between having a 2nd
* word or not. The IS_IN_SOME_FOLD bit is the most easily expendable, as it
Expand Down
56 changes: 28 additions & 28 deletions l1_char_class_tab.h

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions lib/B/Deparse-core.t
Expand Up @@ -91,9 +91,9 @@ sub testit {

my $got_text = $deparse->coderef2text($code_ref);

unless ($got_text =~ /^{
unless ($got_text =~ /^\{
package test;
BEGIN {\${\^WARNING_BITS} = "[^"]*"}
BEGIN \{\$\{\^WARNING_BITS} = "[^"]*"}
use strict 'refs', 'subs';
use feature [^\n]+
\Q$vars\E\(\) = (.*)
Expand Down
21 changes: 21 additions & 0 deletions pod/perldelta.pod
Expand Up @@ -150,6 +150,15 @@ This non-graphic character is essentially indistinguishable from a
regular space, and so should not be allowed. See
L<charnames/CUSTOM ALIASES>.

=head2 A literal C<"{"> should now be escaped in a pattern

If you want a literal left curly bracket (also called a left brace) in a
regular expression pattern, you should now escape it by either
preceding it with a backslash (C<"\{">) or enclosing it within square
brackets C<"[{]">, or by using C<\Q>; otherwise a deprecation warning
will be raised. This was first announced as forthcoming in the v5.16
release; it will allow future extensions to the language to happen.

=head2 Module removals

XXX Remove this section if inapplicable.
Expand Down Expand Up @@ -768,6 +777,18 @@ exec() and backticks: the commands would end up looking for C</bin/sh>
instead of C</system/bin/sh>, and so would fail for the vast majority
of devices, leaving C<$!> as C<ENOENT>.

=item *

C<qr(...\(...\)...)>,
C<qr[...\[...\]...]>,
and
C<qr{...\{...\}...}>
now work. Previously it was impossible to escape these three
left-characters with a backslash within a regular expression pattern
where otherwise they would be considered metacharacters, and the pattern
opening delimiter was the character, and the closing delimiter was its
mirror character.

=back

=head1 Known Problems
Expand Down
37 changes: 12 additions & 25 deletions pod/perldiag.pod
Expand Up @@ -5657,6 +5657,18 @@ C<undef *foo>.
(A) You've accidentally run your script through B<csh> instead of Perl.
Check the #! line, or manually feed your script into Perl yourself.

=item Unescaped left brace in regex is deprecated, passed through in regex;
marked by <-- HERE in m/%s/

(D deprecated, regexp) You used a literal C<"{"> character in a regular
expression pattern. You should change to use C<"\{"> instead, because a future
version of Perl (tentatively v5.26) will consider this to be a syntax error. If
the pattern delimiters are also braces, any matching right brace
(C<"}">) should also be escaped to avoid confusing the parser, for
example,

qr{abc\{def\}ghi}

=item unexec of %s into %s failed!

(F) The unexec() routine failed for some reason. See your local FSF
Expand Down Expand Up @@ -6128,31 +6140,6 @@ discovered. See L<perlre>.
same length as the replacelist. See L<perlop> for more information
about the /d modifier.

=item Useless use of '\'; doesn't escape metacharacter '%c'

(D deprecated) You wrote a regular expression pattern something like
one of these:

m{ \x\{FF\} }x
m{foo\{1,3\}}
qr(foo\(bar\))
s[foo\[a-z\]bar][baz]

The interior braces, square brackets, and parentheses are treated as
metacharacters even though they are backslashed; instead write:

m{ \x{FF} }x
m{foo{1,3}}
qr(foo(bar))
s[foo[a-z]bar][baz]

The backslashes have no effect when a regular expression pattern is
delimited by C<{}>, C<[]>, or C<()>, which ordinarily are
metacharacters, and the delimiters are also used, paired, within the
interior of the pattern. It is planned that a future Perl release will
change the meaning of constructs like these so that the backslashes
will have an effect, so remove them from your code.

=item Useless use of \E

(W misc) You have a \E in a double-quotish string without a C<\U>,
Expand Down
20 changes: 6 additions & 14 deletions pod/perlre.pod
Expand Up @@ -556,20 +556,12 @@ X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>

(If a curly bracket occurs in any other context and does not form part of
a backslashed sequence like C<\x{...}>, it is treated as a regular
character. In particular, the lower quantifier bound is not optional,
and a typo in a quantifier silently causes it to be treated as the
literal characters. For example,

/o{4,a}/

compiles to match the sequence of six characters
S<C<"o { 4 , a }">>. It is planned to eventually require literal uses
of curly brackets to be escaped, say by preceding them with a backslash
or enclosing them within square brackets, (C<"\{"> or C<"[{]">). This
change will allow for future syntax extensions (like making the lower
bound of a quantifier optional), and better error checking. In the
meantime, you should get in the habit of escaping all instances where
you mean a literal "{".)
character. However, a deprecation warning is raised for all such
occurrences, and in Perl v5.26, literal uses of a curly bracket will be
required to be escaped, say by preceding them with a backslash (C<"\{">)
or enclosing them within square brackets (C<"[{]">). This change will
allow for future syntax extensions (like making the lower bound of a
quantifier optional), and better error checking of quantifiers.)

The "*" quantifier is equivalent to C<{0,}>, the "+"
quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
Expand Down
4 changes: 2 additions & 2 deletions proto.h
Expand Up @@ -7150,7 +7150,7 @@ PERL_STATIC_INLINE bool S_grok_bslash_x(pTHX_ char** s, UV* uv, const char** err
#define PERL_ARGS_ASSERT_GROK_BSLASH_X \
assert(s); assert(uv); assert(error_msg)

PERL_STATIC_INLINE I32 S_regcurly(pTHX_ const char *s, const bool rbrace_must_be_escaped)
PERL_STATIC_INLINE I32 S_regcurly(pTHX_ const char *s)
__attribute__warn_unused_result__
__attribute__pure__
__attribute__nonnull__(pTHX_1);
Expand Down Expand Up @@ -7629,7 +7629,7 @@ STATIC char* S_scan_pat(pTHX_ char *start, I32 type)
#define PERL_ARGS_ASSERT_SCAN_PAT \
assert(start)

STATIC char* S_scan_str(pTHX_ char *start, int keep_quoted, int keep_delims, int re_reparse, bool deprecate_escaped_matching, char **delimp)
STATIC char* S_scan_str(pTHX_ char *start, int keep_quoted, int keep_delims, int re_reparse, char **delimp)
__attribute__warn_unused_result__
__attribute__nonnull__(pTHX_1);
#define PERL_ARGS_ASSERT_SCAN_STR \
Expand Down
24 changes: 14 additions & 10 deletions regcomp.c
Expand Up @@ -225,7 +225,7 @@ struct RExC_state_t {

#define ISMULT1(c) ((c) == '*' || (c) == '+' || (c) == '?')
#define ISMULT2(s) ((*s) == '*' || (*s) == '+' || (*s) == '?' || \
((*s) == '{' && regcurly(s, FALSE)))
((*s) == '{' && regcurly(s)))

/*
* Flags to be passed up and down.
Expand Down Expand Up @@ -10484,7 +10484,7 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)

op = *RExC_parse;

if (op == '{' && regcurly(RExC_parse, FALSE)) {
if (op == '{' && regcurly(RExC_parse)) {
maxpos = NULL;
#ifdef RE_TRACK_PATTERN_OFFSETS
parse_start = RExC_parse; /* MJD */
Expand Down Expand Up @@ -10760,7 +10760,7 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p,

/* Disambiguate between \N meaning a named character versus \N meaning
* [^\n]. The former is assumed when it can't be the latter. */
if (*p != '{' || regcurly(p, FALSE)) {
if (*p != '{' || regcurly(p)) {
RExC_parse = p;
if (! node_p) {
/* no bare \N allowed in a charclass */
Expand Down Expand Up @@ -11341,12 +11341,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
vFAIL("Internal urp");
/* Supposed to be caught earlier. */
break;
case '{':
if (!regcurly(RExC_parse, FALSE)) {
RExC_parse++;
goto defchar;
}
/* FALLTHROUGH */
case '?':
case '+':
case '*':
Expand Down Expand Up @@ -12028,8 +12022,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
goto normal_default;
} /* End of switch on '\' */
break;
case '{':
/* Currently we don't warn when the lbrace is at the start
* of a construct. This catches it in the middle of a
* literal string, or when its the first thing after
* something like "\b" */
if (! SIZE_ONLY
&& (len || (p > RExC_start && isALPHA_A(*(p -1)))))
{
ckWARNregdep(p + 1, "Unescaped left brace in regex is deprecated, passed through");
}
/*FALLTHROUGH*/
default: /* A literal character */

normal_default:
if (UTF8_IS_START(*p) && UTF) {
STRLEN numlen;
Expand Down
8 changes: 0 additions & 8 deletions regen/mk_PL_charclass.pl
Expand Up @@ -47,7 +47,6 @@
XDIGIT
VERTSPACE
IS_IN_SOME_FOLD
BACKSLASH_FOO_LBRACE_IS_META
);

# Read in the case fold mappings.
Expand Down Expand Up @@ -236,13 +235,6 @@ sub Is_Non_Final_Fold {
$re = qr/\p{Is_Non_Final_Fold}/;
} elsif ($name eq 'IS_IN_SOME_FOLD') {
$re = qr/\p{_Perl_Any_Folds}/;
} elsif ($name eq 'BACKSLASH_FOO_LBRACE_IS_META') {

# This is true for FOO where FOO is the varying character in:
# \a{, \b{, \c{, ...
# and the sequence has non-literal meaning to Perl; so it is true
# for 'x' because \x{ is special, but not 'a' because \a{ isn't.
$re = qr/[gkNopPx]/;
} else { # The remainder have the same name and values as Unicode
$re = eval "qr/\\p{$name}/";
use Carp;
Expand Down
38 changes: 12 additions & 26 deletions t/lib/warnings/toke
Expand Up @@ -1453,7 +1453,8 @@ sub { # do not actually call require
EXPECT
########
# toke.c
# [perl #113094], [perl #119101]
# [perl #113094], [perl #119101], since reverted so no warnings generated
use warnings;
print "aa" =~ m{^a\{1,2\}$}, "\n";
print "aa" =~ m{^a\x\{61\}$}, "\n";
print "a\\x{6F}" =~ m{^a\\x\{6F\}$}, "\n";
Expand All @@ -1463,35 +1464,20 @@ print "a\\\\o" =~ m{^a\\\\\x\{6F\}$}, "\n";
print "aa" =~ m{^a{1,2}$}, "\n";
print "aq" =~ m[^a\[a-z\]$], "\n";
print "aq" =~ m(^a\(q\)$), "\n";
no warnings 'deprecated';
print "aa" =~ m{^a\{1,2\}$}, "\n";
print "aa" =~ m{^a\x\{61\}$}, "\n";
print "a\\x{6F}" =~ m{^a\\x\{6F\}$}, "\n";
print "a\\o" =~ m{^a\\\x\{6f\}$}, "\n";
print "aq" =~ m[^a\[a-z\]$], "\n";
print "aq" =~ m(^a\(q\)$), "\n";
EXPECT
Useless use of '\'; doesn't escape metacharacter '{' at - line 3.
Useless use of '\'; doesn't escape metacharacter '{' at - line 4.
Useless use of '\'; doesn't escape metacharacter '{' at - line 6.
Useless use of '\'; doesn't escape metacharacter '{' at - line 8.
Useless use of '\'; doesn't escape metacharacter '[' at - line 10.
Useless use of '\'; doesn't escape metacharacter '(' at - line 11.
1
1
1
1
1
1
1
1
q
1
1
Illegal hexadecimal digit '\' ignored at - line 5.
Illegal hexadecimal digit '\' ignored at - line 5.
Illegal hexadecimal digit '\' ignored at - line 7.
Illegal hexadecimal digit '\' ignored at - line 7.
Illegal hexadecimal digit '\' ignored at - line 9.
Illegal hexadecimal digit '\' ignored at - line 9.


1

1

1
q
########
# toke.c
#[perl #119123] disallow literal control character variables
Expand Down
2 changes: 1 addition & 1 deletion t/op/kvhslice.t
Expand Up @@ -151,7 +151,7 @@ plan tests => 44;
my $v = eval '%h{a}';
is (scalar @warn, 1, 'warning in scalar context');
like $warn[0],
qr{^%h{"a"} in scalar context better written as \$h{"a"}},
qr{^%h\{"a"\} in scalar context better written as \$h\{"a"\}},
"correct warning text";
}
{
Expand Down
12 changes: 11 additions & 1 deletion t/re/pat.t
Expand Up @@ -20,7 +20,7 @@ BEGIN {
require './test.pl';
}

plan tests => 733; # Update this when adding/deleting tests.
plan tests => 737; # Update this when adding/deleting tests.

run_tests() unless caller;

Expand Down Expand Up @@ -1596,6 +1596,16 @@ EOP
ok("abc" =~ /a…b…c/x, "NEL is white-space under /x");
}

{
ok('a(b)c' =~ qr(a\(b\)c), "'\\(' is a literal in qr(...)");
ok('a[b]c' =~ qr[a\[b\]c], "'\\[' is a literal in qr[...]");
ok('a{3}c' =~ qr{a\{3\}c}, # Only failed when { could be a meta
"'\\{' is a literal in qr{...}, where it could be a quantifier");

# This one is for completeness
ok('a<b>c' =~ qr<a\<b\>c>, "'\\<' is a literal in qr<...>)");
}

} # End of sub run_tests

1;
1 change: 1 addition & 0 deletions t/re/pat_advanced.t
Expand Up @@ -1260,6 +1260,7 @@ sub run_tests {

{
# \, breaks {3,4}
no warnings qw{deprecated regexp};
ok "xaaay" !~ /xa{3\,4}y/, '\, in a pattern';
ok "xa{3,4}y" =~ /xa{3\,4}y/, '\, in a pattern';

Expand Down

0 comments on commit 412f55b

Please sign in to comment.