-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CVE-2018-6797] heap-buffer-overflow (WRITE of size 1) in S_regatom (regcomp.c) #16185
Comments
From @geeknikTriggered while fuzzing Perl v5.27.4-29-gdc41635. od -tx1 ./test514 ==28186==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60700000ac58 is located 0 bytes to the right of 72-byte region SUMMARY: AddressSanitizer: heap-buffer-overflow /root/perl/regcomp.c:13652 When tested against Perl 5.22.1 under Valgrind, the following occurs: ==5420== Invalid write of size 1 |
From @geeknik |
From @khwilliamsonI'll look at this when I get around to reworking this portion of the code later in 5.27 |
The RT System itself - Status changed from 'new' to 'open' |
From @tonycozOn Fri, 06 Oct 2017 16:46:50 -0700, khw wrote:
Simplifies to the attached, which doesn't assert under valgrind etc, but does cause a panic: $ ./perl ../132227b.pl Tony |
From @tonycoz |
From @tonycozOn Tue, 30 Jan 2018 20:53:29 -0800, tonyc wrote:
What's happening is the first pass is emitting the double-s as a single byte, but the second pass emits "ss", and there isn't enough room for it. If you have a larger number of ß this can push the following bytes emitted well beyond the end of the allocated block. At minimum this has the potential to cause a segmentation fault, or to corrupt the arena, leading to a crash further on. Depending on the heap implementation it may be possible to perform a nastier exploit - an attacker has almost complete control over the bytes written. I do believe this is a security issue. Tony |
From @khwilliamsonOn 01/31/2018 08:44 PM, Tony Cook via RT wrote:
Since you've done so much analysis, do you have a patch, or know the |
From @demerphqFWIW, this looks like a reparse bug. The pattern starts out latin-1. So I would look into what is happening with \N{U+0} and why it isnt Yves On 1 February 2018 at 18:27, Karl Williamson <public@khwilliamson.com> wrote:
-- |
From @demerphqOn 2 February 2018 at 13:21, demerphq <demerphq@gmail.com> wrote:
Yep, that is it, or more or less, although we ARE trying to trigger $ git diff Inline Patchdiff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -352,7 +352,7 @@ struct RExC_state_t {
assert(PASS1); \
set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET); \
RExC_uni_semantics = 1; \
- if (RExC_seen_unfolded_sharp_s) { \
+ if (!UTF || RExC_seen_unfolded_sharp_s) { \
*flagp |= RESTART_PASS1; \
return restart_retval; \
}
With my patch: $ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i' ---------->8--------------->8------------- Without: $ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i' Yves -- |
From @demerphqOn 2 February 2018 at 17:40, demerphq <demerphq@gmail.com> wrote:
I think the attached patch is better. Basically we only need to reparse if we see a \xDF and we are in Yves -- |
From @demerphqrt_132227.patchcommit d194576035c4a6d99fada0278b2247baf0fcf6a6
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Feb 2 22:20:57 2018 +0100
fixup /00\N{U+0}\xDF/i
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..801f857 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13757,6 +13757,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* 'ss' */
if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
RExC_seen_unfolded_sharp_s = 1;
+ if (RExC_uni_semantics)
+ REQUIRE_UTF8(flagp);
maybe_exactfu = FALSE;
}
else if (maybe_exactfu
|
From @khwilliamsonOn 02/02/2018 02:26 PM, demerphq wrote:
Yves' patch unnecessarily upgrades to UTF-8. And on irc, he indicated Attached is my patch. The problem occurs when we are under /d semantics. An EXACTF node is What really should happen in this situation is addressed in my patch. Yves wanted to know why it works now if the \xDF comes before the \N{}. |
From @khwilliamson0010-132227.patchFrom 03d81af254f2dc298aa2e8925eaba73d00dcb440 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: [PATCH 10/10] 132227
---
regcomp.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..fa04aa3f41 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13756,6 +13756,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* /u. This includes the multi-char fold SHARP S to
* 'ss' */
if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+ /* If the node started out having uni rules, we
+ * wouldn't have gotten here. So this means
+ * something in the middle has changed it, but
+ * didn't think it needed to reparse. But this
+ * sharp s now does indicate the need for
+ * reparsing. */
+ if (RExC_uni_semantics) {
+ *flagp |= RESTART_PASS1;
+ return NULL;
+ }
+
RExC_seen_unfolded_sharp_s = 1;
maybe_exactfu = FALSE;
}
--
2.11.0
|
From @demerphqOn 2 February 2018 at 23:39, Karl Williamson <public@khwilliamson.com> wrote:
Yes, im a bit behind the times.
Yep, makes sense to me. Should we wrap this logic in a well named
No, that wasn't my point :-), I was conflating reparse and upgrade in some of my communications and So when you were saying "it shouldnt upgrade", and I was looking at Once the fact that you can reparse without upgrading penetrated my All I was concerned with was that /00\N{U+0}\xDF/ and /00\xDF\N{U+0}/ FWIW, I would like to push a different patch to blead, and we can use I wrapped all the macros that are used for RESTART_PASS1 (including cheers, -- |
From @demerphqwrap_restart.patchcommit 45fa60a82457aeca7f3e1cf2f486005e0bb94f22
Author: Yves Orton <demerphq@gmail.com>
Date: Sat Feb 3 14:02:17 2018 +0100
wrap RESTART_PASS1 logic in a macro layer for clarity and reduction of duplicate code
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..c139ae8 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -359,6 +359,33 @@ struct RExC_state_t {
} \
} STMT_END
+#define REQUIRE_LATIN_RULES(flagp) \
+ STMT_START { \
+ if (RExC_uni_semantics) { \
+ *flagp |= RESTART_PASS1; \
+ return NULL; \
+ } \
+ } STMT_END
+
+
+#define RETURN_NULL_ON_RESTART_OR_FLAGS(flags,flagp,extra) \
+ STMT_START { \
+ if ((flags) & (RESTART_PASS1|NEED_UTF8|(extra))) { \
+ *(flagp) = (flags) & (RESTART_PASS1|NEED_UTF8|(extra)); \
+ return NULL; \
+ } \
+ } STMT_END
+
+
+#define RETURN_NULL_ON_RESTART_FLAGP_OR_FLAGS(flagp,extra) \
+ if (*(flagp) & (RESTART_PASS1|(extra))) return NULL
+
+#define MUST_RESTART(flags) ((flags) & (RESTART_PASS1))
+
+
+#define RETURN_NULL_ON_RESTART(flags,flagp) RETURN_NULL_ON_RESTART_OR_FLAGS(flags,flagp,0)
+#define RETURN_NULL_ON_RESTART_FLAGP(flagp) RETURN_NULL_ON_RESTART_FLAGP_OR_FLAGS(flagp,0)
+
/* This converts the named class defined in regcomp.h to its equivalent class
* number defined in handy.h. */
#define namedclass_to_classnum(class) ((int) ((class) / 2))
@@ -7204,14 +7231,14 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
at least some part of the pattern, and therefore must convert the whole
thing.
-- dmq */
- if (flags & RESTART_PASS1) {
+ if (MUST_RESTART(flags)) {
if (flags & NEED_UTF8) {
S_pat_upgrade_to_utf8(aTHX_ pRExC_state, &exp, &plen,
pRExC_state->code_blocks ? pRExC_state->code_blocks->count : 0);
+ DEBUG_PARSE_r(Perl_re_printf( aTHX_ "Need to redo pass 1 after upgrade\n"));
}
else {
- DEBUG_PARSE_r(Perl_re_printf( aTHX_
- "Need to redo pass 1\n"));
+ DEBUG_PARSE_r(Perl_re_printf( aTHX_ "Need to redo pass 1\n"));
}
goto redo_first_pass;
@@ -10864,10 +10891,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
}
ret = reg(pRExC_state, 's', &flags, depth+1);
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
return ret;
}
@@ -11249,10 +11273,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
ret->flags = 1;
tail = reg(pRExC_state, 1, &flag, depth+1);
- if (flag & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flag & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flag,flagp);
REGTAIL(pRExC_state, ret, tail);
goto insert_if;
}
@@ -11361,10 +11382,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 0));
br = regbranch(pRExC_state, &flags, 1,depth+1);
if (br == NULL) {
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: regbranch returned NULL, flags=%#" UVxf,
(UV) flags);
} else
@@ -11382,10 +11400,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
lastbr = reganode(pRExC_state, IFTHEN, 0);
if (!regbranch(pRExC_state, &flags, 1,depth+1)) {
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: regbranch returned NULL, flags=%#" UVxf,
(UV) flags);
}
@@ -11490,10 +11505,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
/* branch_len = (paren != 0); */
if (br == NULL) {
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: regbranch returned NULL, flags=%#" UVxf, (UV) flags);
}
if (*RExC_parse == '|') {
@@ -11537,10 +11549,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
br = regbranch(pRExC_state, &flags, 0, depth+1);
if (br == NULL) {
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: regbranch returned NULL, flags=%#" UVxf, (UV) flags);
}
REGTAIL(pRExC_state, lastbr, br); /* BRANCH -> BRANCH. */
@@ -11755,10 +11764,7 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth)
if (latest == NULL) {
if (flags & TRYAGAIN)
continue;
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: regpiece returned NULL, flags=%#" UVxf, (UV) flags);
}
else if (ret == NULL)
@@ -11828,11 +11834,8 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
ret = regatom(pRExC_state, &flags,depth+1);
if (ret == NULL) {
- if (flags & (TRYAGAIN|RESTART_PASS1|NEED_UTF8))
- *flagp |= flags & (TRYAGAIN|RESTART_PASS1|NEED_UTF8);
- else
- FAIL2("panic: regatom returned NULL, flags=%#" UVxf, (UV) flags);
- return(NULL);
+ RETURN_NULL_ON_RESTART_OR_FLAGS(flags,flagp,TRYAGAIN);
+ FAIL2("panic: regatom returned NULL, flags=%#" UVxf, (UV) flags);
}
op = *RExC_parse;
@@ -12349,10 +12352,7 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
SvREFCNT_dec_NN(substitute_parse);
if (! *node_p) {
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return FALSE;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: reg returned NULL to grok_bslash_N, flags=%#" UVxf,
(UV) flags);
}
@@ -12752,8 +12752,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
NULL,
NULL);
if (ret == NULL) {
- if (*flagp & (RESTART_PASS1|NEED_UTF8))
- return NULL;
+ RETURN_NULL_ON_RESTART_FLAGP_OR_FLAGS(flagp,NEED_UTF8);
FAIL2("panic: regclass returned NULL to regatom, flags=%#" UVxf,
(UV) *flagp);
}
@@ -12777,10 +12776,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
}
goto tryagain;
}
- if (flags & (RESTART_PASS1|NEED_UTF8)) {
- *flagp = flags & (RESTART_PASS1|NEED_UTF8);
- return NULL;
- }
+ RETURN_NULL_ON_RESTART(flags,flagp);
FAIL2("panic: reg returned NULL to regatom, flags=%#" UVxf,
(UV) flags);
}
@@ -13065,8 +13061,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
TRUE, /* Allow an optimized regnode result */
NULL,
NULL);
- if (*flagp & RESTART_PASS1)
- return NULL;
+ RETURN_NULL_ON_RESTART_FLAGP(flagp);
/* regclass() can only return RESTART_PASS1 and NEED_UTF8 if
* multi-char folds are allowed. */
if (!ret)
@@ -13105,8 +13100,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
break;
}
- if (*flagp & RESTART_PASS1)
- return NULL;
+ RETURN_NULL_ON_RESTART_FLAGP(flagp);
/* Here, evaluates to a single code point. Go get that */
RExC_parse = parse_start;
@@ -13434,8 +13428,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
) {
if (*flagp & NEED_UTF8)
FAIL("panic: grok_bslash_N set NEED_UTF8");
- if (*flagp & RESTART_PASS1)
- return NULL;
+ RETURN_NULL_ON_RESTART_FLAGP(flagp);
/* Here, it wasn't a single code point. Go close
* up this EXACTish node. The switch() prior to
@@ -13756,6 +13749,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* /u. This includes the multi-char fold SHARP S to
* 'ss' */
if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+ /* If the node started out having uni rules, we
+ * wouldn't have gotten here. So this means
+ * something in the middle has changed it, but
+ * didn't think it needed to reparse. But this
+ * sharp s now does indicate the need for
+ * reparsing. */
+ REQUIRE_LATIN_RULES(flagp);
+
RExC_seen_unfolded_sharp_s = 1;
maybe_exactfu = FALSE;
}
@@ -16383,8 +16385,8 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
if (*flagp & NEED_UTF8)
FAIL("panic: grok_bslash_N set NEED_UTF8");
- if (*flagp & RESTART_PASS1)
- return NULL;
+
+ RETURN_NULL_ON_RESTART_FLAGP(flagp);
if (cp_count < 0) {
vFAIL("\\N in a character class must be a named character: \\N{...}");
@@ -17353,7 +17355,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
ret = reg(pRExC_state, 1, ®_flags, depth+1);
- *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_PASS1|NEED_UTF8);
+ *flagp |= reg_flags & (HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_PASS1|NEED_UTF8);
/* And restore so can parse the rest of the pattern */
RExC_parse = save_parse;
|
From @khwilliamsonOn 02/03/2018 06:32 AM, demerphq wrote:
7 months ago I made a proposal, fully explained here, and in followup posts: But briefly, under /i matching there may very well be characters One of the things the optimizer does is to look for sequences of bytes But if the whole pattern is /i, there is no such sequence of bytes that What I proposed is for the regex parser to split the EXACTish nodes into That would mean the optimizer would very likely be able to find a fixed I chatted with Yves about this yesterday, and he liked it; which he also I now have a branch that implements this. The reason it is relevant to I therefore propose that we do this, and leave my basic patch to include In the proposal of July, I suggested that a new compile-time only node The reason for doing that is that there is overhead in the engine I suppose we could join short nodes, and leave longer ones separate. And finally, I have realized some more things related to this ticket. If the order in the ticket's pattern were reversed. and the DF came In general, we do need to restart the parse, because the \N{} could be a |
From @demerphqOn 4 February 2018 at 01:11, Karl Williamson <public@khwilliamson.com> wrote:
[snip]
[snip]
[snip]
I don't follow that part entirely. I would like to see the branch. But I think we should do the following at this point for /blead/: I will push the patch i posted in my previous reply, with the cleanup Then afterwards Karl can apply his changes to EXACT joining and typing For backports we will do the minimal patch proposed by Karl, and Then we fix both with one back-release. Personally I think the Trie issue justifies posting fixes back to 5.22 cheers, -- |
From @tonycozOn Fri, 02 Feb 2018 14:40:35 -0800, public@khwilliamson.com wrote:
The patch makes sense to me, and prevents the failures in both my simple test case and the original. I was confused about the meaning of seen_unfolded_sharp_s, since we were seeing the \xDF in a /i regexp, thinking "hey this is folded", but I assume it means something like "we've emitted a \xDF, but because we weren't in unicode rules, we didn't fold it". The original test case causes overflows back to 5.18, but the patch applies back to through the perls we support, 5.24 and 5.26. The CVE form requests the versions an issue applies to, while the specific code that's failing here only exists in 5.24 and later, I suspect the issue was inherited from those earlier versions when the code was re-worked. So I expect I'll put versions 5.18 through 5.26 on the CVE request. The initial request won't include enough detail to reproduce the issue, here's what I expect to update the form with once the issue is public: Vulnerability type: Vendor of the product: Product: Version: Has vendor confirmed or acknowledged the vulnerability? Attack vector: Impact: - the other choices for Impact are: Affected Components: Attack vector Suggested description of the vulnerability for use in the CVE A crafted regular expression can cause a heap buffer overflow, with control over the bytes written Discoverer(s)/Credits Brian Carpenter <brian.carpenter@gmail.com> Reference(s) https://rt.perl.org/Public/Bug/Display.html?id=132227 Additional Information (blank) I'll only fill in Vulnerability Type through version and the description when initially requesting it (tomorrow) and fill the rest in when it goes public. To see the form, visit: and select "Request a CVE ID". Tony |
From @tonycozOn Sun, 04 Feb 2018 16:17:13 -0800, tonyc wrote:
A crafted regular expression can cause a heap buffer write overflow, with control over the bytes written Tony |
From @khwilliamsonOn 02/04/2018 02:10 AM, demerphq wrote:
I think it would be better to to do my (attached) patch first, but I'm Basically, the problem goes away, in passing, because a new algorithm is The algorithm is to always start out with an EXACT node. As long as The reason blead fails is the node type is computed before the \N{} is |
From @khwilliamson0003-regcomp.c-Under-i-segregate-folding-vs-non-folding-c.patchFrom 084d1c50e1e4072257acd6a855a26796204eeb91 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 4 Feb 2018 19:43:00 -0700
Subject: [PATCH 3/3] regcomp.c: Under/i segregate folding vs non-folding
characters
For matching sequences of characters, the regex compiler generates
various flavors of EXACT-type nodes. The optimizer uses those nodes to
look for sequences that must be in the matched string. In this way, the
pattern matching engine may be able to quickly rule out any possible
match altogether, or to narrow down the places in the target string that
might match.
Under /i matching, this generally has not been possible, because there
is no fixed string that the optimizer can grab onto, as something can
match, say, either 'A' or 'a', etc. However, in many patterns that
contain text, there are characters that are fixed even under /i. Things
like tabs, space, and punctuation, for example.
This commit segregates such folding vs non-folding characters into
separate nodes. I proposed this 7 months ago:
http://nntp.perl.org/group/perl.perl5.porters/245342
and in talking with Yves recently, decided to go ahead with it.
In the proposal of July, I suggested that a new node type be used to
mark those nodes which are under /i but contain no characters that match
other than themselves under /i. These nodes could be joined with either
a plain EXACT node, or an EXACTFish one to create longer nodes.
The reason for joining is that there is overhead in the engine whenever
we switch to the next node. But the reason for not doing it is that it
is slower to match /i nodes than ones that an memEQ can be used on.
I suppose we could join short nodes, and leave longer ones separate, but
that decision can be deferred based on real-world experience.
This patch also consolidates into one place the handling of the Latin
Sharp S, in order to avoid extra #ifdefs, and cause the logic to be
linearly shown.
---
regcomp.c | 246 ++++++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 158 insertions(+), 88 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index f5b2ddb871..928a70bc65 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3695,10 +3695,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
* XXX khw thinks this should be enhanced to fill EXACT (at least) nodes as full
* as possible, even if that means splitting an existing node so that its first
* part is moved to the preceeding node. This would maximise the efficiency of
- * memEQ during matching. Elsewhere in this file, khw proposes splitting
- * EXACTFish nodes into portions that don't change under folding vs those that
- * do. Those portions that don't change may be the only things in the pattern that
- * could be used to find fixed and floating strings.
+ * memEQ during matching.
*
* If a node is to match under /i (folded), the number of characters it matches
* can be different than its character length if it contains a multi-character
@@ -13297,7 +13294,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
char *s0;
U8 upper_parse = MAX_NODE_STRING_SIZE;
- U8 node_type = compute_EXACTish(pRExC_state);
+
+ /* We start out as an EXACT node, even if under /i, until we find a
+ * character which is in a fold. The algorithm now segregates into
+ * separate nodes, characters that fold from those that don't under
+ * /i. (This hopefull will create nodes that are fixed strings
+ * even under /i, giving the optimizer something to grab onto to.)
+ * So, if a node has something in it and the next character is in
+ * the opposite category, that node is closed up, and the function
+ * returns. Then regatom is called again, and a new node is
+ * created for the new category. */
+ U8 node_type = EXACT;
+
bool next_is_quantifier;
char * oldp = NULL;
@@ -13311,14 +13319,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* which don't participate in folds with Latin1-range characters,
* as the latter's folds aren't known until runtime. (We don't
* need to figure this out until pass 2) */
- bool maybe_exactfu = PASS2
- && (node_type == EXACTF || node_type == EXACTFL);
-
- /* If a folding node contains only code points that don't
- * participate in folds, it can be changed into an EXACT node,
- * which allows the optimizer more things to look for, and is
- * faster to match */
- bool maybe_exact;
+ bool maybe_exactfu = PASS2;
/* The node_type may change below, but since the size of the node
* doesn't change, it works */
@@ -13332,15 +13333,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
reparse:
- /* We look for the EXACTFish to EXACT node optimizaton only if
- * folding. (And we don't need to figure this out until pass 2).
- * XXX It might actually make sense to split the node into portions
- * that are exact and ones that aren't, so that we could later use
- * the exact ones to find the longest fixed and floating strings.
- * One would want to join them back into a larger node. One could
- * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
- maybe_exact = FOLD && PASS2;
-
/* This breaks under rare circumstances. If folding, we do not
* want to split a node at a character that is a non-final in a
* multi-char fold, as an input string could just happen to want to
@@ -13357,7 +13349,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
/* Here, we have a literal character. Find the maximal string of
* them in the input that we can fit into a single EXACTish node.
- * We quit at the first non-literal or when the node gets full */
+ * We quit at the first non-literal or when the node gets full, or
+ * under /i the categorization of folding/non-folding character
+ * changes */
for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
/* In most cases each iteration adds one byte to the output.
@@ -13709,8 +13703,17 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
/* Here are folding under /l, and the code point is
- * problematic. First, we know we can't simplify things */
- maybe_exact = FALSE;
+ * problematic. If this is the first character in the
+ * node, change the node type to folding. Otherwise, if
+ * this is the first problematic character, close up the
+ * existing node, so can start a new node with this one */
+ if (! len) {
+ node_type = EXACTFL;
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
/* This code point means we can't simplify things */
maybe_exactfu = FALSE;
@@ -13730,44 +13733,74 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* do for both passes is the PASS2 code for non-folding */
goto not_fold_common;
}
- else /* A regular FOLD code point */
- if (! ( UTF
-#if UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */ \
- || (UNICODE_MAJOR_VERSION == 3 && ( UNICODE_DOT_VERSION > 0) \
- || UNICODE_DOT_DOT_VERSION > 0)
- /* See comments for join_exact() as to why we fold
- * this non-UTF at compile time */
- || ( node_type == EXACTFU
- && ender == LATIN_SMALL_LETTER_SHARP_S)
-#endif
- )) {
+ else /* A regular FOLD code point */
+ if (! UTF)
+ {
/* Here, are folding and are not UTF-8 encoded; therefore
* the character must be in the range 0-255, and is not /l.
* (Not /l because we already handled these under /l in
* is_PROBLEMATIC_LOCALE_FOLD_cp) */
- if (IS_IN_SOME_FOLD_L1(ender)) {
- maybe_exact = FALSE;
+ if (! IS_IN_SOME_FOLD_L1(ender)) {
- /* See if the character's fold differs between /d and
- * /u. This includes the multi-char fold SHARP S to
- * 'ss' */
- if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
- RExC_seen_unfolded_sharp_s = 1;
- maybe_exactfu = FALSE;
+ /* Start a new node for this non-folding character if
+ * previous ones in the node were folded */
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
}
- else if (maybe_exactfu
- && (PL_fold[ender] != PL_fold_latin1[ender]
+
+ *(s++) = (char) ender;
+ }
+ else { /* Here, does participate in some fold */
+
+ /* if this is the first character in the node, change
+ * its type to folding. Otherwise, if this is the
+ * first folding character in the node, close up the
+ * existing node, so can start a new node with this
+ * one. */
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ /* See if the character's fold differs between /d and
+ * /u. On non-ancient Unicode versions, his includes
+ * the multi-char fold SHARP S to 'ss' */
+
#if UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */ \
|| (UNICODE_MAJOR_VERSION == 3 && ( UNICODE_DOT_VERSION > 0) \
|| UNICODE_DOT_DOT_VERSION > 0)
- || ( len > 0
- && isALPHA_FOLD_EQ(ender, 's')
- && isALPHA_FOLD_EQ(*(s-1), 's'))
+
+ if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+ /* See comments for join_exact() as to why we fold
+ * this non-UTF at compile time */
+ if (node_type == EXACTFU) {
+ *(s++) = 's';
+
+ /* Let the code below add in the extra 's' */
+ ender = 's';
+ added_len = 2;
+ }
+ else {
+ RExC_seen_unfolded_sharp_s = 1;
+ maybe_exactfu = FALSE;
+ }
+ }
+ else if ( len
+ && isALPHA_FOLD_EQ(ender, 's')
+ && isALPHA_FOLD_EQ(*(s-1), 's'))
+ {
+ maybe_exactfu = FALSE;
+ }
+ else
#endif
- )) {
+
+ if (PL_fold[ender] != PL_fold_latin1[ender]) {
maybe_exactfu = FALSE;
}
- }
/* Even when folding, we store just the input
* character, as we have an array that finds its fold
@@ -13775,7 +13808,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
*(s++) = (char) ender;
}
}
- else { /* FOLD, and UTF (or sharp s) */
+ else { /* FOLD, and UTF */
/* Unlike the non-fold case, we do actually have to
* calculate the fold in pass 1. This is for two reasons,
* the folded length may be longer than the unfolded, and
@@ -13784,41 +13817,72 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* of a potential multi-char fold, and have to back off
* accordingly. */
- UV folded;
if (isASCII_uni(ender)) {
- folded = toFOLD(ender);
- *(s)++ = (U8) folded;
+
+ /* As above, we close up and start a new node if the
+ * previous characters don't match the fold/non-fold
+ * state of this one. And if this is the first
+ * character in the node, and it folds, we change the
+ * node away from being EXACT */
+ if (! IS_IN_SOME_FOLD_L1(ender)) {
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ *(s)++ = (U8) ender;
+ }
+ else { /* Is in a fold */
+
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ *(s)++ = (U8) toFOLD(ender);
+ }
}
else { /* Not ASCII */
STRLEN foldlen;
- folded = _to_uni_fold_flags(
+ /* As above, we close up and start a new node if the
+ * previous characters don't match the fold/non-fold
+ * state of this one. And if this is the first
+ * character in the node, and it folds, we change the
+ * node away from being EXACT */
+ if (! _invlist_contains_cp(PL_utf8_foldable, ender)) {
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ s = (char *) uvchr_to_utf8((U8 *) s, ender);
+ added_len = UVCHR_SKIP(ender);
+ }
+ else {
+
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ ender = _to_uni_fold_flags(
ender,
(U8 *) s,
&foldlen,
FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
? FOLD_FLAGS_NOMIX_ASCII
: 0));
- s += foldlen;
- added_len = foldlen;
- }
- /* If this node only contains non-folding code points so
- * far, see if this new one is also non-folding */
- if (maybe_exact) {
- if (folded != ender) {
- maybe_exact = FALSE;
- }
- else {
- /* Here the fold is the original; we have to check
- * further to see if anything folds to it */
- if (_invlist_contains_cp(PL_utf8_foldable,
- ender))
- {
- maybe_exact = FALSE;
- }
+ s += foldlen;
+ added_len = foldlen;
}
}
- ender = folded;
}
len += added_len;
@@ -14004,21 +14068,27 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
OP(ret) = NOTHING;
}
else {
- if (FOLD) {
- /* If 'maybe_exact' is still set here, means there are no
- * code points in the node that participate in folds;
- * similarly for 'maybe_exactfu' and code points that match
- * differently depending on UTF8ness of the target string
- * (for /u), or depending on locale for /l */
- if (maybe_exact) {
- OP(ret) = (LOC)
- ? EXACTL
- : EXACT;
+ OP(ret) = node_type;
+
+ /* If the node type is EXACT here, check to see if it
+ * should be EXACTL. */
+ if (node_type == EXACT) {
+ if (LOC) {
+ OP(ret) = EXACTL;
}
- else if (maybe_exactfu) {
- OP(ret) = (LOC)
- ? EXACTFLU8
- : EXACTFU;
+ }
+
+ if (FOLD) {
+ /* If 'maybe_exactfu' is set, then there are no code points
+ * that match differently depending on UTF8ness of the
+ * target string (for /u), or depending on locale for /l */
+ if (maybe_exactfu) {
+ if (node_type == EXACTF) {
+ OP(ret) = EXACTFU;
+ }
+ else if (node_type == EXACTFL) {
+ OP(ret) = EXACTFLU8;
+ }
}
}
--
2.11.0
|
From @khwilliamsonThat patch doesn't apply to blead; the very slightly revised one here does |
From @khwilliamson0001-regcomp.c-Under-i-segregate-folding-vs-non-folding-c.patchFrom fc2ba75031cf1477d32b3eff3115e2c4ce4f4d43 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 4 Feb 2018 19:43:00 -0700
Subject: [PATCH] regcomp.c: Under/i segregate folding vs non-folding
characters
For matching sequences of characters, the regex compiler generates
various flavors of EXACT-type nodes. The optimizer uses those nodes to
look for sequences that must be in the matched string. In this way, the
pattern matching engine may be able to quickly rule out any possible
match altogether, or to narrow down the places in the target string that
might match.
Under /i matching, this generally has not been possible, because there
is no fixed string that the optimizer can grab onto, as something can
match, say, either 'A' or 'a', etc. However, in many patterns that
contain text, there are characters that are fixed even under /i. Things
like tabs, space, and punctuation, for example.
This commit segregates such folding vs non-folding characters into
separate nodes. I proposed this 7 months ago:
http://nntp.perl.org/group/perl.perl5.porters/245342
and in talking with Yves recently, decided to go ahead with it.
In the proposal of July, I suggested that a new node type be used to
mark those nodes which are under /i but contain no characters that match
other than themselves under /i. These nodes could be joined with either
a plain EXACT node, or an EXACTFish one to create longer nodes.
The reason for joining is that there is overhead in the engine whenever
we switch to the next node. But the reason for not doing it is that it
is slower to match /i nodes than ones that an memEQ can be used on.
I suppose we could join short nodes, and leave longer ones separate, but
that decision can be deferred based on real-world experience.
This patch also consolidates into one place the handling of the Latin
Sharp S, in order to avoid extra #ifdefs, and cause the logic to be
linearly shown.
---
regcomp.c | 246 ++++++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 158 insertions(+), 88 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index 6f89a8ec30..dbdfd7b694 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3695,10 +3695,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
* XXX khw thinks this should be enhanced to fill EXACT (at least) nodes as full
* as possible, even if that means splitting an existing node so that its first
* part is moved to the preceeding node. This would maximise the efficiency of
- * memEQ during matching. Elsewhere in this file, khw proposes splitting
- * EXACTFish nodes into portions that don't change under folding vs those that
- * do. Those portions that don't change may be the only things in the pattern that
- * could be used to find fixed and floating strings.
+ * memEQ during matching.
*
* If a node is to match under /i (folded), the number of characters it matches
* can be different than its character length if it contains a multi-character
@@ -13297,7 +13294,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
char *s0;
U8 upper_parse = MAX_NODE_STRING_SIZE;
- U8 node_type = compute_EXACTish(pRExC_state);
+
+ /* We start out as an EXACT node, even if under /i, until we find a
+ * character which is in a fold. The algorithm now segregates into
+ * separate nodes, characters that fold from those that don't under
+ * /i. (This hopefull will create nodes that are fixed strings
+ * even under /i, giving the optimizer something to grab onto to.)
+ * So, if a node has something in it and the next character is in
+ * the opposite category, that node is closed up, and the function
+ * returns. Then regatom is called again, and a new node is
+ * created for the new category. */
+ U8 node_type = EXACT;
+
bool next_is_quantifier;
char * oldp = NULL;
@@ -13311,14 +13319,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* which don't participate in folds with Latin1-range characters,
* as the latter's folds aren't known until runtime. (We don't
* need to figure this out until pass 2) */
- bool maybe_exactfu = PASS2
- && (node_type == EXACTF || node_type == EXACTFL);
-
- /* If a folding node contains only code points that don't
- * participate in folds, it can be changed into an EXACT node,
- * which allows the optimizer more things to look for, and is
- * faster to match */
- bool maybe_exact;
+ bool maybe_exactfu = PASS2;
/* The node_type may change below, but since the size of the node
* doesn't change, it works */
@@ -13332,15 +13333,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
reparse:
- /* We look for the EXACTFish to EXACT node optimizaton only if
- * folding. (And we don't need to figure this out until pass 2).
- * XXX It might actually make sense to split the node into portions
- * that are exact and ones that aren't, so that we could later use
- * the exact ones to find the longest fixed and floating strings.
- * One would want to join them back into a larger node. One could
- * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
- maybe_exact = FOLD && PASS2;
-
/* This breaks under rare circumstances. If folding, we do not
* want to split a node at a character that is a non-final in a
* multi-char fold, as an input string could just happen to want to
@@ -13357,7 +13349,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
/* Here, we have a literal character. Find the maximal string of
* them in the input that we can fit into a single EXACTish node.
- * We quit at the first non-literal or when the node gets full */
+ * We quit at the first non-literal or when the node gets full, or
+ * under /i the categorization of folding/non-folding character
+ * changes */
for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
/* In most cases each iteration adds one byte to the output.
@@ -13709,8 +13703,17 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
/* Here are folding under /l, and the code point is
- * problematic. First, we know we can't simplify things */
- maybe_exact = FALSE;
+ * problematic. If this is the first character in the
+ * node, change the node type to folding. Otherwise, if
+ * this is the first problematic character, close up the
+ * existing node, so can start a new node with this one */
+ if (! len) {
+ node_type = EXACTFL;
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
/* This code point means we can't simplify things */
maybe_exactfu = FALSE;
@@ -13730,51 +13733,81 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* do for both passes is the PASS2 code for non-folding */
goto not_fold_common;
}
- else /* A regular FOLD code point */
- if (! ( UTF
-#if UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */ \
- || (UNICODE_MAJOR_VERSION == 3 && ( UNICODE_DOT_VERSION > 0) \
- || UNICODE_DOT_DOT_VERSION > 0)
- /* See comments for join_exact() as to why we fold
- * this non-UTF at compile time */
- || ( node_type == EXACTFU
- && ender == LATIN_SMALL_LETTER_SHARP_S)
-#endif
- )) {
+ else /* A regular FOLD code point */
+ if (! UTF)
+ {
/* Here, are folding and are not UTF-8 encoded; therefore
* the character must be in the range 0-255, and is not /l.
* (Not /l because we already handled these under /l in
* is_PROBLEMATIC_LOCALE_FOLD_cp) */
- if (IS_IN_SOME_FOLD_L1(ender)) {
- maybe_exact = FALSE;
+ if (! IS_IN_SOME_FOLD_L1(ender)) {
- /* See if the character's fold differs between /d and
- * /u. This includes the multi-char fold SHARP S to
- * 'ss' */
- if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
- RExC_seen_unfolded_sharp_s = 1;
- maybe_exactfu = FALSE;
+ /* Start a new node for this non-folding character if
+ * previous ones in the node were folded */
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
}
- else if (maybe_exactfu
- && (PL_fold[ender] != PL_fold_latin1[ender]
+
+ *(s++) = (char) ender;
+ }
+ else { /* Here, does participate in some fold */
+
+ /* if this is the first character in the node, change
+ * its type to folding. Otherwise, if this is the
+ * first folding character in the node, close up the
+ * existing node, so can start a new node with this
+ * one. */
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ /* See if the character's fold differs between /d and
+ * /u. On non-ancient Unicode versions, his includes
+ * the multi-char fold SHARP S to 'ss' */
+
#if UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */ \
|| (UNICODE_MAJOR_VERSION == 3 && ( UNICODE_DOT_VERSION > 0) \
|| UNICODE_DOT_DOT_VERSION > 0)
- || ( len > 0
- && isALPHA_FOLD_EQ(ender, 's')
- && isALPHA_FOLD_EQ(*(s-1), 's'))
+
+ if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+ /* See comments for join_exact() as to why we fold
+ * this non-UTF at compile time */
+ if (node_type == EXACTFU) {
+ *(s++) = 's';
+
+ /* Let the code below add in the extra 's' */
+ ender = 's';
+ added_len = 2;
+ }
+ else {
+ RExC_seen_unfolded_sharp_s = 1;
+ maybe_exactfu = FALSE;
+ }
+ }
+ else if ( len
+ && isALPHA_FOLD_EQ(ender, 's')
+ && isALPHA_FOLD_EQ(*(s-1), 's'))
+ {
+ maybe_exactfu = FALSE;
+ }
+ else
#endif
- )) {
+
+ if (PL_fold[ender] != PL_fold_latin1[ender]) {
maybe_exactfu = FALSE;
}
- }
/* Even when folding, we store just the input
* character, as we have an array that finds its fold
* quickly */
*(s++) = (char) ender;
}
- else { /* FOLD, and UTF (or sharp s) */
+ else { /* FOLD, and UTF */
/* Unlike the non-fold case, we do actually have to
* calculate the fold in pass 1. This is for two reasons,
* the folded length may be longer than the unfolded, and
@@ -13783,41 +13816,72 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* of a potential multi-char fold, and have to back off
* accordingly. */
- UV folded;
if (isASCII_uni(ender)) {
- folded = toFOLD(ender);
- *(s)++ = (U8) folded;
+
+ /* As above, we close up and start a new node if the
+ * previous characters don't match the fold/non-fold
+ * state of this one. And if this is the first
+ * character in the node, and it folds, we change the
+ * node away from being EXACT */
+ if (! IS_IN_SOME_FOLD_L1(ender)) {
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ *(s)++ = (U8) ender;
+ }
+ else { /* Is in a fold */
+
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ *(s)++ = (U8) toFOLD(ender);
+ }
}
else { /* Not ASCII */
STRLEN foldlen;
- folded = _to_uni_fold_flags(
+ /* As above, we close up and start a new node if the
+ * previous characters don't match the fold/non-fold
+ * state of this one. And if this is the first
+ * character in the node, and it folds, we change the
+ * node away from being EXACT */
+ if (! _invlist_contains_cp(PL_utf8_foldable, ender)) {
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ s = (char *) uvchr_to_utf8((U8 *) s, ender);
+ added_len = UVCHR_SKIP(ender);
+ }
+ else {
+
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ ender = _to_uni_fold_flags(
ender,
(U8 *) s,
&foldlen,
FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
? FOLD_FLAGS_NOMIX_ASCII
: 0));
- s += foldlen;
- added_len = foldlen;
- }
- /* If this node only contains non-folding code points so
- * far, see if this new one is also non-folding */
- if (maybe_exact) {
- if (folded != ender) {
- maybe_exact = FALSE;
- }
- else {
- /* Here the fold is the original; we have to check
- * further to see if anything folds to it */
- if (_invlist_contains_cp(PL_utf8_foldable,
- ender))
- {
- maybe_exact = FALSE;
- }
+ s += foldlen;
+ added_len = foldlen;
}
}
- ender = folded;
}
len += added_len;
@@ -14003,21 +14067,27 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
OP(ret) = NOTHING;
}
else {
- if (FOLD) {
- /* If 'maybe_exact' is still set here, means there are no
- * code points in the node that participate in folds;
- * similarly for 'maybe_exactfu' and code points that match
- * differently depending on UTF8ness of the target string
- * (for /u), or depending on locale for /l */
- if (maybe_exact) {
- OP(ret) = (LOC)
- ? EXACTL
- : EXACT;
+ OP(ret) = node_type;
+
+ /* If the node type is EXACT here, check to see if it
+ * should be EXACTL. */
+ if (node_type == EXACT) {
+ if (LOC) {
+ OP(ret) = EXACTL;
}
- else if (maybe_exactfu) {
- OP(ret) = (LOC)
- ? EXACTFLU8
- : EXACTFU;
+ }
+
+ if (FOLD) {
+ /* If 'maybe_exactfu' is set, then there are no code points
+ * that match differently depending on UTF8ness of the
+ * target string (for /u), or depending on locale for /l */
+ if (maybe_exactfu) {
+ if (node_type == EXACTF) {
+ OP(ret) = EXACTFU;
+ }
+ else if (node_type == EXACTFL) {
+ OP(ret) = EXACTFLU8;
+ }
}
}
--
2.11.0
|
From @khwilliamsonOn 02/04/2018 08:38 PM, Karl Williamson wrote:
And, in thinking about it overnight, I realized that my patch does not qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i' Compiling REx |
From @khwilliamsonOn 02/05/2018 10:41 AM, Karl Williamson wrote:
The small patch I originally sent turns out to be overkill. We don't And that patch is easily incorporated into my other one, which the 2nd So here's what I think should happen We need to decide if we're going to sneak this in or wait for the CVE date. If we sneak it in, do we do it with Yves or my patch? Note that his |
From @khwilliamson0002-132227.patchFrom 65ca3a816676489b315d69c69ae58818ee7e5a89 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: [PATCH 2/2] 132227
---
regcomp.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..67f9c93287 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13756,6 +13756,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* /u. This includes the multi-char fold SHARP S to
* 'ss' */
if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+ /* If the node started out having uni rules, we
+ * wouldn't have gotten here. So this means
+ * something in the middle has changed it, but
+ * didn't think it needed to reparse. But this
+ * sharp s now does indicate the need for
+ * reparsing. */
+ if (RExC_uni_semantics) {
+ p = oldp;
+ goto loopdone;
+ }
+
RExC_seen_unfolded_sharp_s = 1;
maybe_exactfu = FALSE;
}
--
2.11.0
|
From @khwilliamson0001-regcomp.c-Under-i-segregate-folding-vs-non-folding-c.patchFrom b7665795ecc38f0b218e3c25cec93e12549e1215 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 4 Feb 2018 19:43:00 -0700
Subject: [PATCH] regcomp.c: Under/i segregate folding vs non-folding
characters
For matching sequences of characters, the regex compiler generates
various flavors of EXACT-type nodes. The optimizer uses those nodes to
look for sequences that must be in the matched string. In this way, the
pattern matching engine may be able to quickly rule out any possible
match altogether, or to narrow down the places in the target string that
might match.
Under /i matching, this generally has not been possible, because there
is no fixed string that the optimizer can grab onto, as something can
match, say, either 'A' or 'a', etc. However, in many patterns that
contain text, there are characters that are fixed even under /i. Things
like tabs, space, and punctuation, for example.
This commit segregates such folding vs non-folding characters into
separate nodes. I proposed this 7 months ago:
http://nntp.perl.org/group/perl.perl5.porters/245342
and in talking with Yves recently, decided to go ahead with it.
In the proposal of July, I suggested that a new node type be used to
mark those nodes which are under /i but contain no characters that match
other than themselves under /i. These nodes could be joined with either
a plain EXACT node, or an EXACTFish one to create longer nodes.
The reason for joining is that there is overhead in the engine whenever
we switch to the next node. But the reason for not doing it is that it
is slower to match /i nodes than ones that an memEQ can be used on.
I suppose we could join short nodes, and leave longer ones separate, but
that decision can be deferred based on real-world experience.
This patch also consolidates into one place the handling of the Latin
Sharp S, in order to avoid extra #ifdefs, and cause the logic to be
linearly shown.
---
regcomp.c | 252 ++++++++++++++++++++++++++++++++++++++++----------------------
regexec.c | 5 ++
2 files changed, 169 insertions(+), 88 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index 6f89a8ec30..934d0ae819 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3695,10 +3695,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
* XXX khw thinks this should be enhanced to fill EXACT (at least) nodes as full
* as possible, even if that means splitting an existing node so that its first
* part is moved to the preceeding node. This would maximise the efficiency of
- * memEQ during matching. Elsewhere in this file, khw proposes splitting
- * EXACTFish nodes into portions that don't change under folding vs those that
- * do. Those portions that don't change may be the only things in the pattern that
- * could be used to find fixed and floating strings.
+ * memEQ during matching.
*
* If a node is to match under /i (folded), the number of characters it matches
* can be different than its character length if it contains a multi-character
@@ -13297,7 +13294,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
char *s0;
U8 upper_parse = MAX_NODE_STRING_SIZE;
- U8 node_type = compute_EXACTish(pRExC_state);
+
+ /* We start out as an EXACT node, even if under /i, until we find a
+ * character which is in a fold. The algorithm now segregates into
+ * separate nodes, characters that fold from those that don't under
+ * /i. (This hopefull will create nodes that are fixed strings
+ * even under /i, giving the optimizer something to grab onto to.)
+ * So, if a node has something in it and the next character is in
+ * the opposite category, that node is closed up, and the function
+ * returns. Then regatom is called again, and a new node is
+ * created for the new category. */
+ U8 node_type = EXACT;
+
bool next_is_quantifier;
char * oldp = NULL;
@@ -13311,14 +13319,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* which don't participate in folds with Latin1-range characters,
* as the latter's folds aren't known until runtime. (We don't
* need to figure this out until pass 2) */
- bool maybe_exactfu = PASS2
- && (node_type == EXACTF || node_type == EXACTFL);
-
- /* If a folding node contains only code points that don't
- * participate in folds, it can be changed into an EXACT node,
- * which allows the optimizer more things to look for, and is
- * faster to match */
- bool maybe_exact;
+ bool maybe_exactfu = PASS2;
/* The node_type may change below, but since the size of the node
* doesn't change, it works */
@@ -13332,15 +13333,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
reparse:
- /* We look for the EXACTFish to EXACT node optimizaton only if
- * folding. (And we don't need to figure this out until pass 2).
- * XXX It might actually make sense to split the node into portions
- * that are exact and ones that aren't, so that we could later use
- * the exact ones to find the longest fixed and floating strings.
- * One would want to join them back into a larger node. One could
- * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
- maybe_exact = FOLD && PASS2;
-
/* This breaks under rare circumstances. If folding, we do not
* want to split a node at a character that is a non-final in a
* multi-char fold, as an input string could just happen to want to
@@ -13357,7 +13349,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
/* Here, we have a literal character. Find the maximal string of
* them in the input that we can fit into a single EXACTish node.
- * We quit at the first non-literal or when the node gets full */
+ * We quit at the first non-literal or when the node gets full, or
+ * under /i the categorization of folding/non-folding character
+ * changes */
for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
/* In most cases each iteration adds one byte to the output.
@@ -13709,8 +13703,17 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
/* Here are folding under /l, and the code point is
- * problematic. First, we know we can't simplify things */
- maybe_exact = FALSE;
+ * problematic. If this is the first character in the
+ * node, change the node type to folding. Otherwise, if
+ * this is the first problematic character, close up the
+ * existing node, so can start a new node with this one */
+ if (! len) {
+ node_type = EXACTFL;
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
/* This code point means we can't simplify things */
maybe_exactfu = FALSE;
@@ -13730,51 +13733,87 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* do for both passes is the PASS2 code for non-folding */
goto not_fold_common;
}
- else /* A regular FOLD code point */
- if (! ( UTF
-#if UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */ \
- || (UNICODE_MAJOR_VERSION == 3 && ( UNICODE_DOT_VERSION > 0) \
- || UNICODE_DOT_DOT_VERSION > 0)
- /* See comments for join_exact() as to why we fold
- * this non-UTF at compile time */
- || ( node_type == EXACTFU
- && ender == LATIN_SMALL_LETTER_SHARP_S)
-#endif
- )) {
+ else /* A regular FOLD code point */
+ if (! UTF)
+ {
/* Here, are folding and are not UTF-8 encoded; therefore
* the character must be in the range 0-255, and is not /l.
* (Not /l because we already handled these under /l in
* is_PROBLEMATIC_LOCALE_FOLD_cp) */
- if (IS_IN_SOME_FOLD_L1(ender)) {
- maybe_exact = FALSE;
+ if (! IS_IN_SOME_FOLD_L1(ender)) {
- /* See if the character's fold differs between /d and
- * /u. This includes the multi-char fold SHARP S to
- * 'ss' */
- if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
- RExC_seen_unfolded_sharp_s = 1;
- maybe_exactfu = FALSE;
+ /* Start a new node for this non-folding character if
+ * previous ones in the node were folded */
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
}
- else if (maybe_exactfu
- && (PL_fold[ender] != PL_fold_latin1[ender]
+
+ *(s++) = (char) ender;
+ }
+ else { /* Here, does participate in some fold */
+
+ /* if this is the first character in the node, change
+ * its type to folding. Otherwise, if this is the
+ * first folding character in the node, close up the
+ * existing node, so can start a new node with this
+ * one. */
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ /* See if the character's fold differs between /d and
+ * /u. On non-ancient Unicode versions, his includes
+ * the multi-char fold SHARP S to 'ss' */
+
#if UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */ \
|| (UNICODE_MAJOR_VERSION == 3 && ( UNICODE_DOT_VERSION > 0) \
|| UNICODE_DOT_DOT_VERSION > 0)
- || ( len > 0
- && isALPHA_FOLD_EQ(ender, 's')
- && isALPHA_FOLD_EQ(*(s-1), 's'))
+
+ if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+ if (node_type == EXACTF && RExC_uni_semantics) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ /* See comments for join_exact() as to why we fold
+ * this non-UTF at compile time */
+ if (node_type == EXACTFU) {
+ *(s++) = 's';
+
+ /* Let the code below add in the extra 's' */
+ ender = 's';
+ added_len = 2;
+ }
+ else {
+ RExC_seen_unfolded_sharp_s = 1;
+ maybe_exactfu = FALSE;
+ }
+ }
+ else if ( len
+ && isALPHA_FOLD_EQ(ender, 's')
+ && isALPHA_FOLD_EQ(*(s-1), 's'))
+ {
+ maybe_exactfu = FALSE;
+ }
+ else
#endif
- )) {
+
+ if (PL_fold[ender] != PL_fold_latin1[ender]) {
maybe_exactfu = FALSE;
}
- }
/* Even when folding, we store just the input
* character, as we have an array that finds its fold
* quickly */
*(s++) = (char) ender;
+ }
}
- else { /* FOLD, and UTF (or sharp s) */
+ else { /* FOLD, and UTF */
/* Unlike the non-fold case, we do actually have to
* calculate the fold in pass 1. This is for two reasons,
* the folded length may be longer than the unfolded, and
@@ -13783,41 +13822,72 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* of a potential multi-char fold, and have to back off
* accordingly. */
- UV folded;
if (isASCII_uni(ender)) {
- folded = toFOLD(ender);
- *(s)++ = (U8) folded;
+
+ /* As above, we close up and start a new node if the
+ * previous characters don't match the fold/non-fold
+ * state of this one. And if this is the first
+ * character in the node, and it folds, we change the
+ * node away from being EXACT */
+ if (! IS_IN_SOME_FOLD_L1(ender)) {
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ *(s)++ = (U8) ender;
+ }
+ else { /* Is in a fold */
+
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ *(s)++ = (U8) toFOLD(ender);
+ }
}
else { /* Not ASCII */
STRLEN foldlen;
- folded = _to_uni_fold_flags(
+ /* As above, we close up and start a new node if the
+ * previous characters don't match the fold/non-fold
+ * state of this one. And if this is the first
+ * character in the node, and it folds, we change the
+ * node away from being EXACT */
+ if (! _invlist_contains_cp(PL_utf8_foldable, ender)) {
+ if (len && node_type != EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ s = (char *) uvchr_to_utf8((U8 *) s, ender);
+ added_len = UVCHR_SKIP(ender);
+ }
+ else {
+
+ if (! len) {
+ node_type = compute_EXACTish(pRExC_state);
+ }
+ else if (node_type == EXACT) {
+ p = oldp;
+ goto loopdone;
+ }
+
+ ender = _to_uni_fold_flags(
ender,
(U8 *) s,
&foldlen,
FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
? FOLD_FLAGS_NOMIX_ASCII
: 0));
- s += foldlen;
- added_len = foldlen;
- }
- /* If this node only contains non-folding code points so
- * far, see if this new one is also non-folding */
- if (maybe_exact) {
- if (folded != ender) {
- maybe_exact = FALSE;
- }
- else {
- /* Here the fold is the original; we have to check
- * further to see if anything folds to it */
- if (_invlist_contains_cp(PL_utf8_foldable,
- ender))
- {
- maybe_exact = FALSE;
- }
+ s += foldlen;
+ added_len = foldlen;
}
}
- ender = folded;
}
len += added_len;
@@ -14003,21 +14073,27 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
OP(ret) = NOTHING;
}
else {
- if (FOLD) {
- /* If 'maybe_exact' is still set here, means there are no
- * code points in the node that participate in folds;
- * similarly for 'maybe_exactfu' and code points that match
- * differently depending on UTF8ness of the target string
- * (for /u), or depending on locale for /l */
- if (maybe_exact) {
- OP(ret) = (LOC)
- ? EXACTL
- : EXACT;
+ OP(ret) = node_type;
+
+ /* If the node type is EXACT here, check to see if it
+ * should be EXACTL. */
+ if (node_type == EXACT) {
+ if (LOC) {
+ OP(ret) = EXACTL;
}
- else if (maybe_exactfu) {
- OP(ret) = (LOC)
- ? EXACTFLU8
- : EXACTFU;
+ }
+
+ if (FOLD) {
+ /* If 'maybe_exactfu' is set, then there are no code points
+ * that match differently depending on UTF8ness of the
+ * target string (for /u), or depending on locale for /l */
+ if (maybe_exactfu) {
+ if (node_type == EXACTF) {
+ OP(ret) = EXACTFU;
+ }
+ else if (node_type == EXACTFL) {
+ OP(ret) = EXACTFLU8;
+ }
}
}
diff --git a/regexec.c b/regexec.c
index 08bf7134cc..8e4ba1700f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1,4 +1,7 @@
/* regexec.c
+
+ valgrind ./perl -Ilib -DU -le "use re qw(Debug EXTRA); 'xyxy' =~ m'(0|\D*0?x)*?\1';" 2>&1 | more
+ *
*/
/*
@@ -5855,6 +5858,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
state_num = OP(scan);
reenter_switch:
+ DEBUG_U(PerlIO_printf( Perl_debug_log, "%s:%d: locinput=%p, strend=%p, nextchr=%d\n", __FILE__, __LINE__, locinput , reginfo->strend, UCHARAT(locinput)));
DEBUG_EXECUTE_r(
if (state_num <= REGNODE_MAX) {
SV * const prop = sv_newmortal();
@@ -5875,6 +5879,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
to_complement = 0;
SET_nextchr;
+ DEBUG_U(PerlIO_printf( Perl_debug_log, "%s:%d: locinput=%p, strend=%p, nextchr=%d\n", __FILE__, __LINE__, locinput , reginfo->strend, nextchr));
assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS));
switch (state_num) {
--
2.11.0
|
From @tonycozOn Mon, 05 Feb 2018 12:39:26 -0800, public@khwilliamson.com wrote:
Thanks.
I'd prefer to wait until the issue is public before updating blead, but it's up to you. Could you please provide a version of your most recent patch for maint-5.26/maint-5.24 with a better note than "132227"? This will need to go downstream for vendors to update their perl packages. I've requested a CVE ID for this issue. Tony |
From @xsawyerxI will contact the vendors. I just need a short description, a fix, On 6 February 2018 at 07:44, Tony Cook via RT
|
From @tonycozOn Mon, 05 Feb 2018 21:44:23 -0800, tonyc wrote:
This issue is CVE-2018-6797. Tony |
From @tonycozOn Tue, 06 Feb 2018 21:01:35 -0800, tonyc wrote:
Patches for 5.24 and 5.26, merged and de-conflicted (and with a more useful subject than "132227") Karl fixed this in blead in 141513f. Tony |
From @tonycoz5.24-0005-perl-132227-restart-a-node-if-we-change-to-uni-rules.patchFrom e02d7478ebfc399a9d10ba0df60eee217aa7ab8f Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: (perl #132227) restart a node if we change to uni rules within the
node and encounter a sharp S
This could lead to a buffer overflow.
---
regcomp.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index c6c7cb4925..d79bd191c9 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13323,6 +13323,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* /u. This includes the multi-char fold SHARP S to
* 'ss' */
if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+ /* If the node started out having uni rules, we
+ * wouldn't have gotten here. So this means
+ * something in the middle has changed it, but
+ * didn't think it needed to reparse. But this
+ * sharp s now does indicate the need for
+ * reparsing. */
+ if (RExC_uni_semantics) {
+ p = oldp;
+ goto loopdone;
+ }
+
RExC_seen_unfolded_sharp_s = 1;
maybe_exactfu = FALSE;
}
--
2.11.0
|
From @tonycoz5.26-0006-perl-132227-restart-a-node-if-we-change-to-uni-rules.patchFrom 6041b59a21326bd9c2d58034943f07233d5ac1ec Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: (perl #132227) restart a node if we change to uni rules within the
node and encounter a sharp S
This could lead to a buffer overflow.
---
regcomp.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index b2de7f05e1..943029b154 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13538,6 +13538,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* /u. This includes the multi-char fold SHARP S to
* 'ss' */
if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+ /* If the node started out having uni rules, we
+ * wouldn't have gotten here. So this means
+ * something in the middle has changed it, but
+ * didn't think it needed to reparse. But this
+ * sharp s now does indicate the need for
+ * reparsing. */
+ if (RExC_uni_semantics) {
+ p = oldp;
+ goto loopdone;
+ }
+
RExC_seen_unfolded_sharp_s = 1;
maybe_exactfu = FALSE;
}
--
2.11.0
|
From @khwilliamsonOn 02/19/2018 08:31 PM, Tony Cook via RT wrote:
Actually, Tony asked me not to fix it in blead, and so I didn't. I wrote earlier in the thread that the following pattern still triggers $ blead -le panic: reg_node overrun trying to emit 34, 55ba06753904>=55ba067538a4 at |
From @tonycozOn Mon, 19 Feb 2018 20:07:30 -0800, public@khwilliamson.com wrote:
Ok, thanks. Do you have a patch that applies to blead to fix the problem? Tony |
From @khwilliamsonOn 02/19/2018 09:07 PM, Karl Williamson wrote:
Blead patch attached. |
From @khwilliamson0005-perl-132227-restart-node-if-change-to-uni-rules.patchFrom c76970a0a5101dea085fe22ec93fdb1ed95b655e Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Thu, 22 Feb 2018 22:45:19 -0700
Subject: [PATCH 5/5] (perl #132227) restart node if change to uni rules
This is only necessary if within the node we encounter a sharp S.
Otherwise this could lead to a buffer overflow.
---
regcomp.c | 18 ++++++++++++++++++
t/re/pat_advanced.t | 4 ++++
2 files changed, 22 insertions(+)
diff --git a/regcomp.c b/regcomp.c
index 34ac9169f2..a78df8b8cb 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13968,6 +13968,24 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
ender = 's';
added_len = 2;
}
+ else if (RExC_uni_semantics) {
+
+ /* Here, we are supossed to be using Unicode
+ * rules, but this folding node is not. This
+ * happens during pass 1 when the node started
+ * out not under Unicode rules, but a \N{} was
+ * encountered during the processing of it,
+ * causing Unicode rules to be switched into.
+ * Pass 1 continues uninteruppted, as by the
+ * time we get to pass 2, we will know enough
+ * to generate the correct folds. Except in
+ * this one case, we need to restart the node,
+ * because the fold of the sharp s requires 2
+ * characters, and the sizing needs to account
+ * for that. */
+ p = oldp;
+ goto loopdone;
+ }
else {
RExC_seen_unfolded_sharp_s = 1;
maybe_exactfu = FALSE;
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index d90ceeb5bd..d9c4773d2d 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -2332,6 +2332,10 @@ EOF
unlike("s\N{U+DF}", qr/^\x{00DF}/i, "\"s\\N{U+DF}\", qr/^\\x{00DF}/i");
}
+ { # Bug #132227, caused failed assertion
+ ok(qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i, "No seqgfault [perl #132227]");
+ }
+
# User-defined Unicode properties to match above-Unicode code points
sub Is_31_Bit_Super { return "110000\t7FFFFFFF\n" }
sub Is_Portable_Super { return '!utf8::Any' } # Matches beyond 32 bits
--
2.11.0
|
From @xsawyerxThe public disclosure date is set for March 15th. |
From @geeknikIs it okay to disclose this bug now? |
From @tonycozOn Thu, Mar 15, 2018 at 11:38:58PM -0500, Brian C. wrote:
Sorry, no, downstream has delayed disclosure. The actual disclosure date is under discussion AFAIK. Tony |
From @geeknikWe are long past the industry standard of 90 days now. |
From @xsawyerxIndustry standard is also to respect downstream vendors concerns and we On Fri, Mar 16, 2018, 20:26 Brian C. <brian.carpenter@gmail.com> wrote:
|
From @xsawyerxThe disclosure date has been postponed officially to April 14th. On 16 March 2018 at 20:28, Sawyer X <xsawyerx@gmail.com> wrote:
|
@xsawyerx - Status changed from 'open' to 'pending release' |
From @xsawyerxOn Mon, 19 Mar 2018 15:27:19 -0700, xsawyerx@gmail.com wrote:
Moved to public queue. |
From @khwilliamsonBlead commit that fixed this is 2407a17 |
From @khwilliamsonThank you for filing this report. You have helped make Perl better. With the release yesterday of Perl 5.28.0, this and 185 other issues have been Perl 5.28.0 may be downloaded via: If you find that the problem persists, feel free to reopen this ticket. |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#132227 (status was 'resolved')
Searchable as RT132227$
The text was updated successfully, but these errors were encountered: