-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5.8 Regression: some regexes with multi-char folds loop #10193
Comments
From @khwilliamsonThis is a bug report for perl from khw@khw-desktop.nonet, "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i both loop. This bug is in 5.10 but not 5.8. I don't know the full Flags: This perlbug was built using Perl 5.11.5 - Fri Feb 19 08:36:52 MST 2010 Site configuration information for perl 5.11.4: Configured by khw at Sat Jan 23 13:32:28 MST 2010. Summary of my perl5 (revision 5 version 11 subversion 4) configuration: Locally applied patches: @INC for perl 5.11.4: Environment for perl 5.11.4: PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/opt/real/RealPlayer:/home/khw/cxoffice/bin |
From @khwilliamsonkarl williamson (via RT) wrote:
I should have mentioned that this is more likely to happen in 5.11.5 or |
From @obra
Even though this is likely to happen "more" in 5.11.5 than it did in -Jesse |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonJesse Vincent wrote:
I'm not either. I'm thinking that it should be in the known problems of |
From @obraOn Thu, Feb 25, 2010 at 04:31:33PM -0700, karl williamson wrote:
Sure. Do you have opinions on ideal phrasing? -- |
From @khwilliamsonjesse wrote:
If I could phrase things ideally, I might have been a poet and not a |
From @khwilliamsonkarl williamson (via RT) wrote:
I looked at this enough to see part of what is causing it, and that it The problem I suspect stems from the attempt to shoehorn in multi-char The case fold of FB00 is 'ff'. So /f+/i should match it (\N{U+66} is The problem is that each f matches only half of FB00, but the code In a way this is not a regression from 5.8, because it didn't work Hopefully, Yves' redesign of multi-char folding will fix this whole thing. |
From @khwilliamsonIn 5.8 it doesn't match correctly either, but it doesn't loop forever. |
From @khwilliamson0001-PATCH-perl-72998-regex-looping.patchFrom fbf9950bdadcfe4479e137c54e2546b23a221061 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Tue, 13 Apr 2010 21:25:36 -0600
Subject: [PATCH] PATCH: [perl #72998] regex looping
If a character folds to multiple ones in case-insensitive matching,
it should not match just one of those, or the regular expression can
loop. This patch doesn't fix the problem of it not matching correctly,
but it simply causes it to fail now, rather than much much later, by
disallowing the case where it thought it matched but did not advance the
pointer to the next character.
Fixing the problem in general of multi-char folding characters is a much
bigger issue, to be addressed later.
---
t/re/re.t | 10 +++++++++-
utf8.c | 3 ++-
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/t/re/re.t b/t/re/re.t
index 87965f2..aae8120 100644
--- a/t/re/re.t
+++ b/t/re/re.t
@@ -51,6 +51,14 @@ if ('1234'=~/(?:(?<A>\d)|(?<C>!))(?<B>\d)(?<A>\d)(?<B>\d)/){
}
is(regnames_count(),3);
}
+
+ {
+ # Bug #72998; this can loop
+ watchdog(2);
+ eval '"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i';
+ pass("Didn't loop");
+ }
+
# New tests above this line, don't forget to update the test count below!
-BEGIN { plan tests => 18 }
+BEGIN { plan tests => 19 }
# No tests here!
diff --git a/utf8.c b/utf8.c
index 9ed0663..1a6077c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -2609,7 +2609,8 @@ Perl_ibcmp_utf8(pTHX_ const char *s1, char **pe1, register UV l1, bool u1, const
/* A match is defined by all the scans that specified
* an explicit length reaching their final goals. */
- match = (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2);
+ match = (n1 == 0 && n2 == 0 /* Must not match partial char; Bug #72998 */
+ && (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2));
if (match) {
if (pe1)
--
1.5.6.3
|
From @khwilliamsonThis is a small revision to the patch I submitted yesterday. The I actually have a more extensive patch prepared that cleans up the whole |
From @khwilliamson0001-PATCH-perl-72998-regex-looping.patchFrom 93eb9a395ae991368a9b7fd3691c13791b7d5cf5 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Tue, 13 Apr 2010 21:25:36 -0600
Subject: [PATCH] PATCH: [perl #72998] regex looping
If a character folds to multiple ones in case-insensitive matching,
it should not match just one of those, or the regular expression can
loop. For example, \N{LATIN SMALL LIGATURE FF} folds to 'ff', and so
"\N{LATIN SMALL LIGATURE FF}" =~ /f+/i
should match. Prior to this patch, this function returned that there is
a match, but left the matching string pointer at the beginning of the
"\N{LATIN SMALL LIGATURE FF}" because it doesn't make sense to match
just half a character, and at this level it doesn't know about the '+'.
This leaves things in an inconsistent state, with the reporting of a
match, but the input pointer unchanged, the result of which is a loop.
I don't know how to fix this so that it correctly matches, and there are
semantic issues with doing so. For example, if
"\N{LATIN SMALL LIGATURE FF}" =~ /ff/i
matches, then one would think that so should
"\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i
But $1 and $2 don't really make sense here, since they both refer to the
half of the same character.
So this patch just returns failure if only a partial character is
matched. That leaves things consistent, and solves the problem of
looping, so that Perl doesn't hang on such a construct, but leaves the
ultimate solution for another day.
---
t/re/re.t | 10 +++++++++-
utf8.c | 3 ++-
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/t/re/re.t b/t/re/re.t
index 87965f2..249c6dd 100644
--- a/t/re/re.t
+++ b/t/re/re.t
@@ -51,6 +51,14 @@ if ('1234'=~/(?:(?<A>\d)|(?<C>!))(?<B>\d)(?<A>\d)(?<B>\d)/){
}
is(regnames_count(),3);
}
+
+ { # Keep this test last, as whole script will be interrupted if times out
+ # Bug #72998; this can loop
+ watchdog(2);
+ eval '"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i';
+ pass("Didn't loop");
+ }
+
# New tests above this line, don't forget to update the test count below!
-BEGIN { plan tests => 18 }
+BEGIN { plan tests => 19 }
# No tests here!
diff --git a/utf8.c b/utf8.c
index 9ed0663..1a6077c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -2609,7 +2609,8 @@ Perl_ibcmp_utf8(pTHX_ const char *s1, char **pe1, register UV l1, bool u1, const
/* A match is defined by all the scans that specified
* an explicit length reaching their final goals. */
- match = (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2);
+ match = (n1 == 0 && n2 == 0 /* Must not match partial char; Bug #72998 */
+ && (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2));
if (match) {
if (pe1)
--
1.5.6.3
|
From @rgarciaOn 15 April 2010 03:35, karl williamson <public@khwilliamson.com> wrote:
I agree. |
@rgs - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#72998 (status was 'resolved')
Searchable as RT72998$
The text was updated successfully, but these errors were encountered: