-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex CURLYs fail on utf8 latin1 chars when pattern not utf8 #10743
Comments
From @khwilliamsonThis is a bug report for perl from khw@khw-desktop.nonet, Neither pattern nor string utf8: Upgrade string to utf8 but leave pattern non-utf8: As above, but force pattern to utf8 by adding a char above latin1 that As above, but remove upgrade so string is not utf8, pattern is: To summarize, when the string is utf8 and the pattern isn't, it fails to Flags: Site configuration information for perl 5.11.4: Configured by khw at Sat Jan 23 13:32:28 MST 2010. Summary of my perl5 (revision 5 version 11 subversion 4) configuration: Locally applied patches: @INC for perl 5.11.4: Environment for perl 5.11.4: PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/opt/real/RealPlayer:/home/khw/cxoffice/bin |
From @AbigailOn Tue, Feb 16, 2010 at 01:41:02PM -0800, karl williamson wrote:
But only in a combination of capturing and a quantifier: #!/usr/bin/perl use 5.010; use strict; use Test::More; $_ = "\xF7"; ok /\x{00F7}/, "No capture, no quantifier, no anchors"; ok /^\x{00F7}$/, "No capture, no quantifier, anchors"; done_testing; __END__ I get these failures in 5.8.[89] and 5.10.1 as well. |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonI thought there was already a ticket for this, but couldn't find it. |
From @khwilliamson0001-regexec.c-utf8-strings-don-t-match-non-utf8-self.patchFrom 9ca5f258da774908a47d340499d1bc8e20e6eeb0 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 19 Oct 2010 20:45:16 -0600
Subject: [PATCH] regexec.c: utf8 strings don't match non-utf8 self
Some constructs with characters that are variant under utf8 have a utf8
string not matching it's non-utf8 self. The solution is to test for
this case and change to use both in utf8 while doing the comparison.
---
regexec.c | 45 +++++++++++++++++++++++++++++++++++++++++----
t/re/pat.t | 14 +++++++++++++-
2 files changed, 54 insertions(+), 5 deletions(-)
diff --git a/regexec.c b/regexec.c
index 901703f..a1a1bc2 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,10 +5750,47 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
case CANY:
scan = loceol;
break;
- case EXACT: /* length of string is 1 */
- c = (U8)*STRING(p);
- while (scan < loceol && UCHARAT(scan) == c)
- scan++;
+ case EXACT:
+ /* To get here, EXACT nodes must have *byte* length == 1. That means
+ * they match only characters in the string that can be expressed as a
+ * single byte. For non-utf8 strings, that means a simple match. For
+ * utf8 strings, the character matched must be an invariant, or
+ * downgradable to a single byte. The pattern's utf8ness is
+ * irrelevant, as it must be a single byte, so either it isn't utf8, or
+ * if it is it's an invariant */
+
+ assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+ {
+ U8* pattern = (U8*) STRING(p);
+ U8 c = UCHARAT(pattern);
+ if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
+
+ /* Here, the string isn't utf8, or the character in the EXACT
+ * node is the same in utf8 as not, so can just do equality.
+ * Each matching char must be 1 byte long */
+ while (scan < loceol && UCHARAT(scan) == c) {
+ scan++;
+ }
+ }
+ else {
+
+ /* Here, the string is utf8, and the char to match is different
+ * in utf8 than not. Fastest to find the two utf8 bytes that
+ * represent c, and then look for those in sequence in the utf8
+ * string */
+ U8 high = UTF8_TWO_BYTE_HI(c);
+ U8 low = UTF8_TWO_BYTE_LO(c);
+ loceol = PL_regeol;
+ while (hardcount < max
+ && scan + 1 < loceol
+ && UCHARAT(scan) == high
+ && UCHARAT(scan + 1) == low)
+ {
+ scan += 2;
+ hardcount++;
+ }
+ }
+ }
break;
case EXACTF: /* length of string is 1 */
c = (U8)*STRING(p);
diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..4668104 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -23,7 +23,7 @@ BEGIN {
}
-plan tests => 398; # Update this when adding/deleting tests.
+plan tests => 402; # Update this when adding/deleting tests.
run_tests() unless caller;
@@ -1072,6 +1072,18 @@ sub run_tests {
}
+ { # Some constructs with Latin1 characters cause a utf8 string not to
+ # match itself in non-utf8
+ my $c = "\xc0";
+ my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
+ utf8::upgrade($utf8_pattern);
+ ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+ ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+ utf8::upgrade($c);
+ ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+ ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+ }
+
{
# Test that a regex followed by an operator and/or a statement modifier work
# These tests use string-eval so that it reports a clean error when it fails
--
1.5.6.3
|
From @khwilliamsonkarl williamson (via RT) wrote:
I realized that I had put the assert before the value it was testing was |
From @khwilliamson0001-regexec.c-utf8-strings-don-t-match-non-utf8-self.patchFrom 8701660d127695c1e77d8dea57cb23ef8ef16cd8 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 19 Oct 2010 20:45:16 -0600
Subject: [PATCH] regexec.c: utf8 strings don't match non-utf8 self
Some constructs with characters that are variant under utf8 have a utf8
string not matching it's non-utf8 self. The solution is to test for
this case and change to use both in utf8 while doing the comparison.
---
regexec.c | 40 +++++++++++++++++++++++++++++++++++++---
t/re/pat.t | 14 +++++++++++++-
2 files changed, 50 insertions(+), 4 deletions(-)
diff --git a/regexec.c b/regexec.c
index 901703f..662cf7d 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,10 +5750,44 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
case CANY:
scan = loceol;
break;
- case EXACT: /* length of string is 1 */
+ case EXACT:
+ /* To get here, EXACT nodes must have *byte* length == 1. That means
+ * they match only characters in the string that can be expressed as a
+ * single byte. For non-utf8 strings, that means a simple match. For
+ * utf8 strings, the character matched must be an invariant, or
+ * downgradable to a single byte. The pattern's utf8ness is
+ * irrelevant, as it must be a single byte, so either it isn't utf8, or
+ * if it is it's an invariant */
+
c = (U8)*STRING(p);
- while (scan < loceol && UCHARAT(scan) == c)
- scan++;
+ assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+ if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
+
+ /* Here, the string isn't utf8, or the character in the EXACT
+ * node is the same in utf8 as not, so can just do equality.
+ * Each matching char must be 1 byte long */
+ while (scan < loceol && UCHARAT(scan) == c) {
+ scan++;
+ }
+ }
+ else {
+
+ /* Here, the string is utf8, and the char to match is different
+ * in utf8 than not. Fastest to find the two utf8 bytes that
+ * represent c, and then look for those in sequence in the utf8
+ * string */
+ U8 high = UTF8_TWO_BYTE_HI(c);
+ U8 low = UTF8_TWO_BYTE_LO(c);
+ loceol = PL_regeol;
+ while (hardcount < max
+ && scan + 1 < loceol
+ && UCHARAT(scan) == high
+ && UCHARAT(scan + 1) == low)
+ {
+ scan += 2;
+ hardcount++;
+ }
+ }
break;
case EXACTF: /* length of string is 1 */
c = (U8)*STRING(p);
diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..4668104 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -23,7 +23,7 @@ BEGIN {
}
-plan tests => 398; # Update this when adding/deleting tests.
+plan tests => 402; # Update this when adding/deleting tests.
run_tests() unless caller;
@@ -1072,6 +1072,18 @@ sub run_tests {
}
+ { # Some constructs with Latin1 characters cause a utf8 string not to
+ # match itself in non-utf8
+ my $c = "\xc0";
+ my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
+ utf8::upgrade($utf8_pattern);
+ ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+ ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+ utf8::upgrade($c);
+ ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+ ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+ }
+
{
# Test that a regex followed by an operator and/or a statement modifier work
# These tests use string-eval so that it reports a clean error when it fails
--
1.5.6.3
|
From @khwilliamsonkarl williamson (via RT) wrote:
This slightly revises the first patch to remove an unnecessary block and |
From @khwilliamson0001-Subject-regexec.c-utf8-doesn-t-match-non-utf8-self.patchFrom 3c117e8450fa50df5f635aaa4b20fa3eda8dc66e Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 20 Oct 2010 10:20:29 -0600
Subject: [PATCH] Subject: regexec.c: utf8 doesn't match non-utf8 self
Some regex patterns don't match a character with itself when the target
string is in utf8 and the pattern isn't, and the character is variant
under utf8. (This means only Latin1-range characters in the pattern are
affected.)
The solution is to test for this case and use the utf8 representation of
the pattern character for the comparison.
---
regexec.c | 40 +++++++++++++++++++++++++++++++++++++---
t/re/pat.t | 14 +++++++++++++-
2 files changed, 50 insertions(+), 4 deletions(-)
diff --git a/regexec.c b/regexec.c
index 901703f..f87c2fa 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,10 +5750,44 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
case CANY:
scan = loceol;
break;
- case EXACT: /* length of string is 1 */
+ case EXACT:
+ /* To get here, EXACT nodes must have *byte* length == 1. That means
+ * they match only characters in the string that can be expressed as a
+ * single byte. For non-utf8 strings, that means a simple match. For
+ * utf8 strings, the character matched must be an invariant, or
+ * downgradable to a single byte. The pattern's utf8ness is
+ * irrelevant, as it must be a single byte, so either it isn't utf8, or
+ * if it is it's an invariant */
+
c = (U8)*STRING(p);
- while (scan < loceol && UCHARAT(scan) == c)
- scan++;
+ assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+ if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
+
+ /* Here, the string isn't utf8, or the character in the EXACT
+ * node is the same in utf8 as not, so can just do equality.
+ * Each matching char must be 1 byte long */
+ while (scan < loceol && UCHARAT(scan) == c) {
+ scan++;
+ }
+ }
+ else {
+
+ /* Here, the string is utf8, and the char to match is different
+ * in utf8 than not. Fastest to find the two utf8 bytes that
+ * represent c, and then look for those in sequence in the utf8
+ * string */
+ U8 high = UTF8_TWO_BYTE_HI(c);
+ U8 low = UTF8_TWO_BYTE_LO(c);
+ loceol = PL_regeol;
+ while (hardcount < max
+ && scan + 1 < loceol
+ && UCHARAT(scan) == high
+ && UCHARAT(scan + 1) == low)
+ {
+ scan += 2;
+ hardcount++;
+ }
+ }
break;
case EXACTF: /* length of string is 1 */
c = (U8)*STRING(p);
diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..4668104 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -23,7 +23,7 @@ BEGIN {
}
-plan tests => 398; # Update this when adding/deleting tests.
+plan tests => 402; # Update this when adding/deleting tests.
run_tests() unless caller;
@@ -1072,6 +1072,18 @@ sub run_tests {
}
+ { # Some constructs with Latin1 characters cause a utf8 string not to
+ # match itself in non-utf8
+ my $c = "\xc0";
+ my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
+ utf8::upgrade($utf8_pattern);
+ ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+ ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+ utf8::upgrade($c);
+ ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+ ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+ }
+
{
# Test that a regex followed by an operator and/or a statement modifier work
# These tests use string-eval so that it reports a clean error when it fails
--
1.5.6.3
|
From @khwilliamson0002-regexec.c-utf8-doesn-t-match-i-nonutf8-self.patchFrom 36a48cf433c58efc5d7c81e082c42d4c21423fa6 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 20 Oct 2010 11:11:13 -0600
Subject: [PATCH] regexec.c: utf8 doesn't match /i nonutf8 self
This is a continuation of [perl #78464]. It fixes it also for the /i
flag. After this, a character should match itself in the regrepeat
function, even if one is in utf8 and the other isn't, for both /i and
not.
The solution is to move the code for handling /i into the non-i
structure so that the decisions about utf8 are all in one place. When
the string is in utf8, it uses the utf8-fold function.
This has the added effect of fixing a few cases where a utf8 string did
not match a fold in a non-utf8 pattern. I haven't added tests for
these, as it only fixes a few cases where this is a problem, and I'm
working on a comprehensive solution to the problem, accompanied by
extensive tests.
---
regexec.c | 109 +++++++++++++++++++++++++++++++++++++++++++----------------
t/re/pat.t | 4 ++
2 files changed, 83 insertions(+), 30 deletions(-)
diff --git a/regexec.c b/regexec.c
index f87c2fa..842afaf 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,8 +5750,12 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
case CANY:
scan = loceol;
break;
+ case EXACTFL:
+ PL_reg_flags |= RF_tainted;
+ /* FALL THROUGH */
case EXACT:
- /* To get here, EXACT nodes must have *byte* length == 1. That means
+ case EXACTF:
+ /* To get here, EXACTish nodes must have *byte* length == 1. That means
* they match only characters in the string that can be expressed as a
* single byte. For non-utf8 strings, that means a simple match. For
* utf8 strings, the character matched must be an invariant, or
@@ -5761,47 +5765,92 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
c = (U8)*STRING(p);
assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+
if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
/* Here, the string isn't utf8, or the character in the EXACT
* node is the same in utf8 as not, so can just do equality.
* Each matching char must be 1 byte long */
- while (scan < loceol && UCHARAT(scan) == c) {
- scan++;
+ switch (OP(p)) {
+ case EXACT:
+ while (scan < loceol && UCHARAT(scan) == c) {
+ scan++;
+ }
+ break;
+ case EXACTF:
+ while (scan < loceol &&
+ (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c]))
+ {
+ scan++;
+ }
+ break;
+ case EXACTFL:
+ while (scan < loceol &&
+ (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c]))
+ {
+ scan++;
+ }
+ break;
+ default:
+ Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p));
}
}
else {
- /* Here, the string is utf8, and the char to match is different
- * in utf8 than not. Fastest to find the two utf8 bytes that
- * represent c, and then look for those in sequence in the utf8
- * string */
- U8 high = UTF8_TWO_BYTE_HI(c);
- U8 low = UTF8_TWO_BYTE_LO(c);
- loceol = PL_regeol;
- while (hardcount < max
- && scan + 1 < loceol
- && UCHARAT(scan) == high
- && UCHARAT(scan + 1) == low)
- {
- scan += 2;
- hardcount++;
+ /* Here, the string is utf8, and the pattern char is different
+ * in utf8 than not. */
+
+ switch (OP(p)) {
+ case EXACT:
+ {
+ /* Fastest to find the two utf8 bytes that represent c, and
+ * then look for those in sequence in the utf8 string */
+ U8 high = UTF8_TWO_BYTE_HI(c);
+ U8 low = UTF8_TWO_BYTE_LO(c);
+ loceol = PL_regeol;
+
+ while (hardcount < max
+ && scan + 1 < loceol
+ && UCHARAT(scan) == high
+ && UCHARAT(scan + 1) == low)
+ {
+ scan += 2;
+ hardcount++;
+ }
+ }
+ break;
+ case EXACTFL: /* Doesn't really make sense, but is best we can
+ do. The documents warn against mixing locale
+ and utf8 */
+ case EXACTF:
+ { /* utf8 string, so use utf8 foldEQ */
+ char *tmpeol = loceol;
+ while (hardcount < max
+ && foldEQ_utf8(scan, &tmpeol, 0, utf8_target,
+ STRING(p), NULL, 1, UTF_PATTERN))
+ {
+ scan = tmpeol;
+ tmpeol = loceol;
+ hardcount++;
+ }
+
+ /* XXX Note that the above handles properly the German
+ * sharp ss in the pattern matching ss in the string. But
+ * it doesn't handle properly cases where the string
+ * contains say 'LIGATURE ff' and the pattern is 'f+'.
+ * This would require, say, a new function or revised
+ * interface to foldEQ_utf8(), in which the maximum number
+ * of characters to match could be passed and it would
+ * return how many actually did. This is just one of many
+ * cases where multi-char folds don't work properly, and so
+ * the fix is being deferred */
+ }
+ break;
+ default:
+ Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p));
}
}
break;
- case EXACTF: /* length of string is 1 */
- c = (U8)*STRING(p);
- while (scan < loceol &&
- (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c]))
- scan++;
- break;
- case EXACTFL: /* length of string is 1 */
- PL_reg_flags |= RF_tainted;
- c = (U8)*STRING(p);
- while (scan < loceol &&
- (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c]))
- scan++;
- break;
case ANYOF:
if (utf8_target) {
loceol = PL_regeol;
diff --git a/t/re/pat.t b/t/re/pat.t
index 4668104..d4bbbb8 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -1078,10 +1078,14 @@ sub run_tests {
my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
utf8::upgrade($utf8_pattern);
ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+ ok $c =~ /$pattern/i, "\\xc0 =~ /$pattern/i; Neither pattern nor target utf8";
ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+ ok $c =~ /$utf8_pattern/i, "\\xc0 =~ /$pattern/i; pattern utf8, target not";
utf8::upgrade($c);
ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+ ok $c =~ /$pattern/i, "\\xc0 =~ /$pattern/i; target utf8, pattern not";
ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+ ok $c =~ /$utf8_pattern/i, "\\xc0 =~ /$pattern/i; Both target and pattern utf8";
}
{
--
1.5.6.3
|
From @cpansproutOn Wed Oct 20 10:38:45 2010, public@khwilliamson.com wrote:
|
The RT System itself - Status changed from 'new' to 'open' |
@cpansprout - Status changed from 'open' to 'resolved' |
From @demerphqis this applied? seems reasonable On 20 October 2010 19:38, karl williamson <public@khwilliamson.com> wrote:
-- |
Migrated from rt.perl.org#78464 (status was 'resolved')
Searchable as RT78464$
The text was updated successfully, but these errors were encountered: