regex: \c inside (?[]) causes panics and unexpected behavior #14934
Comments
From victor@drawall.ccCreated by @Grimy* /(?[(\c]) / panics: Read past end of '(?[ ])'. * /(?[\c#])/, /(?[\c[])/, /(?[\c\])/ and /(?[\c]])/ all report a syntax error. * /(?[(\c])/ matches a single "\c]". * /(?[(\c]) ]\b/ behaves like /\c]b/. * /(?[\c[]](])/ behaves like /\c[\]/. * /(?[\c#] All of these bugs were found on the current blead Perl Info
|
From victor@drawall.ccThe attached patch works-for-me™. Should I add new tests checking for |
From @jkeenanOn Wed Sep 30 09:15:21 2015, victor@drawall.cc wrote:
It appears that the patch did not get attached. Can you re-try? Thank you very much. -- |
The RT System itself - Status changed from 'new' to 'open' |
From victor@drawall.ccSorry for the mess-up. Here’s the patch again. I’m also including it From 96cb469f3930a276be8cedac7d23e9a26899be08 Mon Sep 17 00:00:00 2001 The \cX notation for control characters used to cause panics and unexpected The solution is to ignore the byte following \c during the first parsing pass AUTHORS | 1 + Inline Patchdiff --git a/AUTHORS b/AUTHORS
index 451c707..ebd9222 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -1222,6 +1222,7 @@ Unicode Consortium <unicode.org>
Vadim Konovalov <vkonovalov@lucent.com>
Valeriy E. Ushakov <uwe@ptc.spbu.ru>
Vernon Lyon <vlyon@cpan.org>
+Victor Adam <victor@drawall.cc>
Victor Efimov <victor@vsespb.ru>
Viktor Turskyi <koorchik@gmail.com>
Ville Skyttä <scop@cs132170.pp.htv.fi>
diff --git a/regcomp.c b/regcomp.c
index 4f4bb44..a8e80ee 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13454,6 +13454,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t
|
From victor@drawall.cc0001-regex-handle-cX-inside.patchFrom 96cb469f3930a276be8cedac7d23e9a26899be08 Mon Sep 17 00:00:00 2001
From: Victor Adam <victor@drawall.cc>
Date: Sun, 27 Sep 2015 10:22:08 +0200
Subject: [PATCH] regex: handle \cX inside (?[])
The \cX notation for control characters used to cause panics and unexpected
behavior when used insed an extended character class. See bug #126181.
The solution is to ignore the byte following \c during the first parsing pass
of a (?[]) construct.
---
AUTHORS | 1 +
regcomp.c | 4 ++++
2 files changed, 5 insertions(+)
diff --git a/AUTHORS b/AUTHORS
index 451c707..ebd9222 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -1222,6 +1222,7 @@ Unicode Consortium <unicode.org>
Vadim Konovalov <vkonovalov@lucent.com>
Valeriy E. Ushakov <uwe@ptc.spbu.ru>
Vernon Lyon <vlyon@cpan.org>
+Victor Adam <victor@drawall.cc>
Victor Efimov <victor@vsespb.ru>
Viktor Turskyi <koorchik@gmail.com>
Ville Skyttä <scop@cs132170.pp.htv.fi>
diff --git a/regcomp.c b/regcomp.c
index 4f4bb44..a8e80ee 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13454,6 +13454,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
* default: case next time and keep on incrementing until
* we find one of the invariants we do handle. */
RExC_parse++;
+ if (*RExC_parse == 'c') {
+ /* Skip the \cX notation for control characters */
+ RExC_parse++;
+ }
break;
case '[':
{
--
2.4.5
|
From @khwilliamsonOn 09/30/2015 02:21 AM, Victor ADAM wrote:
Yes, please. The easiest place to put them is in t/re/re_tests. Your patch looks good to me. I would split the author portion into its Karl Williamson |
From victor@drawall.ccAs discussed, I’ve split my patch in two : 0001*.patch adds my name to |
From victor@drawall.cc0001-AUTHORS-Add-myself.patchFrom 4c62e3b50838f5d7c1f5320046740d1f500b1737 Mon Sep 17 00:00:00 2001
From: Victor Adam <victor@drawall.cc>
Date: Fri, 2 Oct 2015 19:38:25 +0200
Subject: [PATCH 1/3] AUTHORS: Add myself
---
AUTHORS | 1 +
1 file changed, 1 insertion(+)
diff --git a/AUTHORS b/AUTHORS
index 451c707..ebd9222 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -1222,6 +1222,7 @@ Unicode Consortium <unicode.org>
Vadim Konovalov <vkonovalov@lucent.com>
Valeriy E. Ushakov <uwe@ptc.spbu.ru>
Vernon Lyon <vlyon@cpan.org>
+Victor Adam <victor@drawall.cc>
Victor Efimov <victor@vsespb.ru>
Viktor Turskyi <koorchik@gmail.com>
Ville Skyttä <scop@cs132170.pp.htv.fi>
--
2.5.3
|
From victor@drawall.cc0002-test-Add-regex-tests.patchFrom 648d5fc4a052dab62bdcc9f28af234b0ab163063 Mon Sep 17 00:00:00 2001
From: Victor Adam <victor@drawall.cc>
Date: Fri, 2 Oct 2015 19:35:19 +0200
Subject: [PATCH 2/3] test: Add regex tests
This adds tests for RT tickets #126177, #126178, #126179, #126180, #126185,
#126181, #126186, #126187 and #126222.
---
t/re/re_tests | 45 +++++++++++++++++++++++++++++++++++++++++++++
t/re/regex_sets.t | 19 +++++++++++++++++++
t/re/regexp.t | 1 +
3 files changed, 65 insertions(+)
diff --git a/t/re/re_tests b/t/re/re_tests
index 9d5fa73..7d83ee1 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1926,5 +1926,50 @@ A+(*PRUNE)BC(?{}) AAABC y $& AAABC
/(a+){1}+a/ aaa n - - # [perl #125825]
+# RT #126178: these should cause a syntax error
+(?i Bc - Unmatched (
+(?^ Bc - Unmatched (
+(?- Bc - Unmatched (
+(?a-a Bc - Unmatched (
+
+# RT #126177: (?n) should disable numbered captures
+(?n)(abc) abc y -$1 -
+'(?-n)(abc)'n abc y -$1 -abc
+
+# RT #126179: \N{} should cause an error (even when skipping the lexer)
+[\N{}] Bc - Zero length \N{}
+
+# RT #126180: those segfault inside (?[])
+[\ &!] Bsc - Incomplete expression
+[\ +!] Bsc - Incomplete expression
+[\ -!] Bsc - Incomplete expression
+[\ ^!] Bsc - Incomplete expression
+[\ |!] Bsc - Incomplete expression
+
+# RT #126185: (?-p) should be an error
+(?-p) Bc - Regexp modifier \"a\" may not appear after the \"-\"
+'(?-p)'p Bc - Regexp modifier \"a\" may not appear after the \"-\"
+
+# RT #126186: (*ACCEPT:arg) should either be an error or set $REGMARK correctly
+(*MARK:arg) y $REGMARK arg
+(*ACCEPT:arg) By $REGMARK arg
+
+# RT #126187: these produce warnings (even when skipping the tests!)
+# \p c - Can't find Unicode property TODO
+# \p^ c - Can't find Unicode property TODO
+# \P c - Can't find Unicode property TODO
+# \P^ c - Can't find Unicode property TODO
+
+# RT #126222
+(?(?!)) By - -
+
+# RT #126253
+.{1}?? Bc - Nested quantifiers
+.{1}?+ Bc - Nested quantifiers
+.{1}+? c - Nested quantifiers
+.{1}++ c - Nested quantifiers
+.{1}*? c - Nested quantifiers
+.{1}*+ c - Nested quantifiers
+
# Keep these lines at the end of the file
# vim: softtabstop=0 noexpandtab
diff --git a/t/re/regex_sets.t b/t/re/regex_sets.t
index ee161b2..185c70f 100644
--- a/t/re/regex_sets.t
+++ b/t/re/regex_sets.t
@@ -138,6 +138,25 @@ if (! is_miniperl() && locales_enabled('LC_CTYPE')) {
}
}
+# RT #126181: \cX behaves strangely inside (?[])
+{
+ no warnings qw(syntax regexp);
+
+ eval { $_ = '/(?[(\c]) /'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic');
+ eval { $_ = '(?[\c#]' . "\n])"; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic');
+ eval { $_ = '(?[(\c])'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c])/ should be a syntax error');
+ eval { $_ = '(?[(\c]) ]\b'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c]) ]\b/ should be a syntax error');
+ eval { $_ = '(?[\c[]](])'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[\c[]](])/ should be a syntax error');
+ like("\c#", qr/(?[\c#])/, '\c# should match itself');
+ like("\c[", qr/(?[\c[])/, '\c[ should match itself');
+ like("\c\ ", qr/(?[\c\])/, '\c\ should match itself');
+ like("\c]", qr/(?[\c]])/, '\c] should match itself');
+}
done_testing();
diff --git a/t/re/regexp.t b/t/re/regexp.t
index f27a027..cbac03f 100644
--- a/t/re/regexp.t
+++ b/t/re/regexp.t
@@ -99,6 +99,7 @@ use strict;
use warnings FATAL=>"all";
use vars qw($bang $ffff $nulnul); # used by the tests
use vars qw($qr $skip_amp $qr_embed $qr_embed_thr $regex_sets); # set by our callers
+use vars qw($REGMARK); # set from tested patterns
--
2.5.3
|
From victor@drawall.cc0003-regex-handle-cX-inside.patchFrom afd2df3109b876d25120b34a47fc6d952d99380d Mon Sep 17 00:00:00 2001
From: Victor Adam <victor@drawall.cc>
Date: Sun, 27 Sep 2015 10:22:08 +0200
Subject: [PATCH 3/3] regex: handle \cX inside (?[])
The \cX notation for control characters used to cause panics and unexpected
behavior when used insed an extended character class. See bug #126181.
The solution is to ignore the byte following \c during the first parsing pass
of a (?[]) construct.
---
regcomp.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index 4f4bb44..a8e80ee 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13454,6 +13454,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
* default: case next time and keep on incrementing until
* we find one of the invariants we do handle. */
RExC_parse++;
+ if (*RExC_parse == 'c') {
+ /* Skip the \cX notation for control characters */
+ RExC_parse++;
+ }
break;
case '[':
{
--
2.5.3
|
From victor@drawall.ccThanks for the pointers! I already love re_tests, it makes it really Unfortunately, it doesn’t look like I can test the last example using Anyway, I’m working right now on adding tests for this and the other |
From victor@drawall.ccAs discussed, I’ve split my patch in two : 0001*.patch adds my name to |
From @khwilliamsonOn 10/03/2015 12:25 AM, Victor ADAM wrote:
One of our rules is that individual patches can't break blead on a major Syntax error in (?[...]) in regex m/(?[\c#])/ at re/regex_sets.t line 155. I'm afraid this patch was problematic to apply, as you had the And thanks for doing all this. |
From victor@drawall.ccSorry for the long delay. I’ve rebased my work on top of the current |
From victor@drawall.cc0001-AUTHORS-Add-myself.patchFrom 969a278ddf8f298c12df76ace52ee5af27814cc6 Mon Sep 17 00:00:00 2001
From: Victor Adam <victor@drawall.cc>
Date: Fri, 2 Oct 2015 19:38:25 +0200
Subject: [PATCH 1/2] AUTHORS: Add myself
---
AUTHORS | 1 +
1 file changed, 1 insertion(+)
diff --git a/AUTHORS b/AUTHORS
index 451c707..ebd9222 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -1222,6 +1222,7 @@ Unicode Consortium <unicode.org>
Vadim Konovalov <vkonovalov@lucent.com>
Valeriy E. Ushakov <uwe@ptc.spbu.ru>
Vernon Lyon <vlyon@cpan.org>
+Victor Adam <victor@drawall.cc>
Victor Efimov <victor@vsespb.ru>
Viktor Turskyi <koorchik@gmail.com>
Ville Skyttä <scop@cs132170.pp.htv.fi>
--
2.5.3
|
From victor@drawall.cc0002-regex-handle-cX-inside.patchFrom 4e40d5385b4ac0dd89f159ef411f90068b39b3b4 Mon Sep 17 00:00:00 2001
From: Victor Adam <victor@drawall.cc>
Date: Sun, 27 Sep 2015 10:22:08 +0200
Subject: [PATCH 2/2] regex: handle \cX inside (?[])
The \cX notation for control characters used to cause panics and unexpected
behavior when used insed an extended character class. See bug #126181.
The solution is to ignore the byte following \c during the first parsing pass
of a (?[]) construct.
---
regcomp.c | 4 ++++
t/re/regex_sets.t | 19 +++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index 540f71c..beec98d 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13444,6 +13444,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
* default: case next time and keep on incrementing until
* we find one of the invariants we do handle. */
RExC_parse++;
+ if (*RExC_parse == 'c') {
+ /* Skip the \cX notation for control characters */
+ RExC_parse++;
+ }
break;
case '[':
{
diff --git a/t/re/regex_sets.t b/t/re/regex_sets.t
index 5c683ba..a5941ba 100644
--- a/t/re/regex_sets.t
+++ b/t/re/regex_sets.t
@@ -139,6 +139,25 @@ if (! is_miniperl() && locales_enabled('LC_CTYPE')) {
}
}
+# RT #126181: \cX behaves strangely inside (?[])
+{
+ no warnings qw(syntax regexp);
+
+ eval { $_ = '/(?[(\c]) /'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic');
+ eval { $_ = '(?[\c#]' . "\n])"; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic');
+ eval { $_ = '(?[(\c])'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c])/ should be a syntax error');
+ eval { $_ = '(?[(\c]) ]\b'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[(\c]) ]\b/ should be a syntax error');
+ eval { $_ = '(?[\c[]](])'; qr/$_/ };
+ like($@, qr/^Syntax error/, '/(?[\c[]](])/ should be a syntax error');
+ like("\c#", qr/(?[\c#])/, '\c# should match itself');
+ like("\c[", qr/(?[\c[])/, '\c[ should match itself');
+ like("\c\ ", qr/(?[\c\])/, '\c\ should match itself');
+ like("\c]", qr/(?[\c]])/, '\c] should match itself');
+}
done_testing();
--
2.5.3
|
From @khwilliamsonThanks, applied as |
@khwilliamson - Status changed from 'open' to 'pending release' |
@mauke - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#126181 (status was 'resolved')
Searchable as RT126181$
The text was updated successfully, but these errors were encountered: