Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex CURLYs fail on utf8 latin1 chars when pattern not utf8 #10743

Closed
p5pRT opened this issue Oct 20, 2010 · 14 comments
Closed

regex CURLYs fail on utf8 latin1 chars when pattern not utf8 #10743

p5pRT opened this issue Oct 20, 2010 · 14 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Oct 20, 2010

Migrated from rt.perl.org#78464 (status was 'resolved')

Searchable as RT78464$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 16, 2010

From @khwilliamson

This is a bug report for perl from khw@​khw-desktop.nonet,
generated with the help of perlbug 1.39 running under perl 5.11.4.


Neither pattern nor string utf8​:
./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+,?)/'
1

Upgrade string to utf8 but leave pattern non-utf8​:
./perl -Ilib -E 'my $c = "\xF7"; utf8​::upgrade($c); say 0 + $c =~
/((\x{00F7})+,?)/'
0

As above, but force pattern to utf8 by adding a char above latin1 that
can match 0 times​:
./perl -Ilib -E 'my $c = "\xF7"; utf8​::upgrade($c); say 0 + $c =~
/((\x{00F7})+\x{100}?)/'
1

As above, but remove upgrade so string is not utf8, pattern is​:
./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+\x{100}?)/'
1

To summarize, when the string is utf8 and the pattern isn't, it fails to
match itself.



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.11.4​:

Configured by khw at Sat Jan 23 13​:32​:28 MST 2010.

Summary of my perl5 (revision 5 version 11 subversion 4) configuration​:
  Commit id​: d0a7635
  Platform​:
  osname=linux, osvers=2.6.27-16-generic, archname=i686-linux
  uname='linux khw-desktop 2.6.27-16-generic #1 smp tue dec 1
17​:56​:54 utc 2009 i686 gnulinux '
  config_args='-s -d -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-g3' -A'optimize=-g3' -A'optimize=-O0''
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
  optimize='-O0 -g3',
  cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
  ccversion='', gccversion='4.3.2', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.8.90.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.8.90'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -g3 -g3 -O0 -L/usr/local/lib
-fstack-protector'

Locally applied patches​:


@​INC for perl 5.11.4​:
  /home/khw/blead/lib/perl5/site_perl/5.11.4/i686-linux
  /home/khw/blead/lib/perl5/site_perl/5.11.4
  /home/khw/blead/lib/perl5/5.11.4/i686-linux
  /home/khw/blead/lib/perl5/5.11.4
  /home/khw/blead/lib/perl5/site_perl
  .


Environment for perl 5.11.4​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/opt/real/RealPlayer​:/home/khw/cxoffice/bin
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 17, 2010

From @Abigail

On Tue, Feb 16, 2010 at 01​:41​:02PM -0800, karl williamson wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #72888]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=72888 >

This is a bug report for perl from khw@​khw-desktop.nonet,
generated with the help of perlbug 1.39 running under perl 5.11.4.

-----------------------------------------------------------------
Neither pattern nor string utf8​:
./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+,?)/'
1

Upgrade string to utf8 but leave pattern non-utf8​:
./perl -Ilib -E 'my $c = "\xF7"; utf8​::upgrade($c); say 0 + $c =~
/((\x{00F7})+,?)/'
0

As above, but force pattern to utf8 by adding a char above latin1 that
can match 0 times​:
./perl -Ilib -E 'my $c = "\xF7"; utf8​::upgrade($c); say 0 + $c =~
/((\x{00F7})+\x{100}?)/'
1

As above, but remove upgrade so string is not utf8, pattern is​:
./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+\x{100}?)/'
1

To summarize, when the string is utf8 and the pattern isn't, it fails to
match itself.

But only in a combination of capturing and a quantifier​:

  #!/usr/bin/perl

  use 5.010;

  use strict;
  use warnings;
  no warnings 'syntax';

  use Test​::More;

  $_ = "\xF7";
  utf8​::upgrade ($_);

  ok /\x{00F7}/, "No capture, no quantifier, no anchors";
  ok /(\x{00F7})/, "Capture, no quantifier, no anchors";
  ok /\x{00F7}+/, "No capture, + quantifier, no anchors";
  ok /(\x{00F7})+/, "Capture, + quantifier, no anchors";

  ok /^\x{00F7}$/, "No capture, no quantifier, anchors";
  ok /^(\x{00F7})$/, "Capture, no quantifier, anchors";
  ok /^\x{00F7}+$/, "No capture, + quantifier, anchors";
  ok /^\x{00F7}*$/, "No capture, * quantifier, anchors";
  ok /^\x{00F7}?$/, "No capture, ? quantifier, anchors";
  ok /^(\x{00F7})+$/, "Capture, + quantifier, anchors";
  ok /^(\x{00F7})*$/, "Capture, * quantifier, anchors";
  ok /^(\x{00F7})?$/, "Capture, ? quantifier, anchors";

  done_testing;

  __END__
  ok 1 - No capture, no quantifier, no anchors
  ok 2 - Capture, no quantifier, no anchors
  ok 3 - No capture, + quantifier, no anchors
  not ok 4 - Capture, + quantifier, no anchors
  # Failed test 'Capture, + quantifier, no anchors'
  # at /tmp/eep line 18.
  ok 5 - No capture, no quantifier, anchors
  ok 6 - Capture, no quantifier, anchors
  ok 7 - No capture, + quantifier, anchors
  ok 8 - No capture, * quantifier, anchors
  ok 9 - No capture, ? quantifier, anchors
  not ok 10 - Capture, + quantifier, anchors
  # Failed test 'Capture, + quantifier, anchors'
  # at /tmp/eep line 25.
  not ok 11 - Capture, * quantifier, anchors
  # Failed test 'Capture, * quantifier, anchors'
  # at /tmp/eep line 26.
  not ok 12 - Capture, ? quantifier, anchors
  # Failed test 'Capture, ? quantifier, anchors'
  # at /tmp/eep line 27.
  1..12
  # Looks like you failed 4 tests of 12.

I get these failures in 5.8.[89] and 5.10.1 as well.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 17, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

I thought there was already a ticket for this, but couldn't find it.
Patch attached.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

0001-regexec.c-utf8-strings-don-t-match-non-utf8-self.patch
From 9ca5f258da774908a47d340499d1bc8e20e6eeb0 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 19 Oct 2010 20:45:16 -0600
Subject: [PATCH] regexec.c: utf8 strings don't match non-utf8 self

Some constructs with characters that are variant under utf8 have a utf8
string not matching it's non-utf8 self.  The solution is to test for
this case and change to use both in utf8 while doing the comparison.
---
 regexec.c  |   45 +++++++++++++++++++++++++++++++++++++++++----
 t/re/pat.t |   14 +++++++++++++-
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/regexec.c b/regexec.c
index 901703f..a1a1bc2 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,10 +5750,47 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
     case CANY:
 	scan = loceol;
 	break;
-    case EXACT:		/* length of string is 1 */
-	c = (U8)*STRING(p);
-	while (scan < loceol && UCHARAT(scan) == c)
-	    scan++;
+    case EXACT:
+	/* To get here, EXACT nodes must have *byte* length == 1.  That means
+	 * they match only characters in the string that can be expressed as a
+	 * single byte.  For non-utf8 strings, that means a simple match.  For
+	 * utf8 strings, the character matched must be an invariant, or
+	 * downgradable to a single byte.  The pattern's utf8ness is
+	 * irrelevant, as it must be a single byte, so either it isn't utf8, or
+	 * if it is it's an invariant */
+
+	assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+	{
+	    U8* pattern = (U8*) STRING(p);
+	    U8 c = UCHARAT(pattern);
+	    if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
+
+		/* Here, the string isn't utf8, or the character in the EXACT
+		 * node is the same in utf8 as not, so can just do equality.
+		 * Each matching char must be 1 byte long */
+		while (scan < loceol && UCHARAT(scan) == c) {
+		    scan++;
+		}
+	    }
+	    else {
+
+		/* Here, the string is utf8, and the char to match is different
+		 * in utf8 than not.  Fastest to find the two utf8 bytes that
+		 * represent c, and then look for those in sequence in the utf8
+		 * string */
+		U8 high = UTF8_TWO_BYTE_HI(c);
+		U8 low = UTF8_TWO_BYTE_LO(c);
+		loceol = PL_regeol;
+		while (hardcount < max
+		       && scan + 1 < loceol
+		       && UCHARAT(scan) == high
+		       && UCHARAT(scan + 1) == low)
+		{
+		    scan += 2;
+		    hardcount++;
+		}
+	    }
+	}
 	break;
     case EXACTF:	/* length of string is 1 */
 	c = (U8)*STRING(p);
diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..4668104 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -23,7 +23,7 @@ BEGIN {
 }
 
 
-plan tests => 398;  # Update this when adding/deleting tests.
+plan tests => 402;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1072,6 +1072,18 @@ sub run_tests {
 
     }
 
+    {   # Some constructs with Latin1 characters cause a utf8 string not to
+        # match itself in non-utf8
+        my $c = "\xc0";
+        my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
+        utf8::upgrade($utf8_pattern);
+        ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+        ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+        utf8::upgrade($c);
+        ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+        ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+    }
+
     {
         # Test that a regex followed by an operator and/or a statement modifier work
         # These tests use string-eval so that it reports a clean error when it fails
-- 
1.5.6.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #78464]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=78464 >

I thought there was already a ticket for this, but couldn't find it.
Patch attached.

I realized that I had put the assert before the value it was testing was
set, so it was only chance that it passed the test suite. Here is a
slightly revised patch

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

0001-regexec.c-utf8-strings-don-t-match-non-utf8-self.patch
From 8701660d127695c1e77d8dea57cb23ef8ef16cd8 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 19 Oct 2010 20:45:16 -0600
Subject: [PATCH] regexec.c: utf8 strings don't match non-utf8 self

Some constructs with characters that are variant under utf8 have a utf8
string not matching it's non-utf8 self.  The solution is to test for
this case and change to use both in utf8 while doing the comparison.
---
 regexec.c  |   40 +++++++++++++++++++++++++++++++++++++---
 t/re/pat.t |   14 +++++++++++++-
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/regexec.c b/regexec.c
index 901703f..662cf7d 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,10 +5750,44 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
     case CANY:
 	scan = loceol;
 	break;
-    case EXACT:		/* length of string is 1 */
+    case EXACT:
+	/* To get here, EXACT nodes must have *byte* length == 1.  That means
+	 * they match only characters in the string that can be expressed as a
+	 * single byte.  For non-utf8 strings, that means a simple match.  For
+	 * utf8 strings, the character matched must be an invariant, or
+	 * downgradable to a single byte.  The pattern's utf8ness is
+	 * irrelevant, as it must be a single byte, so either it isn't utf8, or
+	 * if it is it's an invariant */
+
 	c = (U8)*STRING(p);
-	while (scan < loceol && UCHARAT(scan) == c)
-	    scan++;
+	assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+	if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
+
+	    /* Here, the string isn't utf8, or the character in the EXACT
+		* node is the same in utf8 as not, so can just do equality.
+		* Each matching char must be 1 byte long */
+	    while (scan < loceol && UCHARAT(scan) == c) {
+		scan++;
+	    }
+	}
+	else {
+
+	    /* Here, the string is utf8, and the char to match is different
+		* in utf8 than not.  Fastest to find the two utf8 bytes that
+		* represent c, and then look for those in sequence in the utf8
+		* string */
+	    U8 high = UTF8_TWO_BYTE_HI(c);
+	    U8 low = UTF8_TWO_BYTE_LO(c);
+	    loceol = PL_regeol;
+	    while (hardcount < max
+		    && scan + 1 < loceol
+		    && UCHARAT(scan) == high
+		    && UCHARAT(scan + 1) == low)
+	    {
+		scan += 2;
+		hardcount++;
+	    }
+	}
 	break;
     case EXACTF:	/* length of string is 1 */
 	c = (U8)*STRING(p);
diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..4668104 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -23,7 +23,7 @@ BEGIN {
 }
 
 
-plan tests => 398;  # Update this when adding/deleting tests.
+plan tests => 402;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1072,6 +1072,18 @@ sub run_tests {
 
     }
 
+    {   # Some constructs with Latin1 characters cause a utf8 string not to
+        # match itself in non-utf8
+        my $c = "\xc0";
+        my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
+        utf8::upgrade($utf8_pattern);
+        ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+        ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+        utf8::upgrade($c);
+        ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+        ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+    }
+
     {
         # Test that a regex followed by an operator and/or a statement modifier work
         # These tests use string-eval so that it reports a clean error when it fails
-- 
1.5.6.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #78464]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=78464 >

I thought there was already a ticket for this, but couldn't find it.
Patch attached.

This slightly revises the first patch to remove an unnecessary block and
  consequent indent. The second patch extends the fix to the /i flag.
With both patches, the affected cases will have a character match its
variant self with and without /i

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

0001-Subject-regexec.c-utf8-doesn-t-match-non-utf8-self.patch
From 3c117e8450fa50df5f635aaa4b20fa3eda8dc66e Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 20 Oct 2010 10:20:29 -0600
Subject: [PATCH] Subject: regexec.c: utf8 doesn't match non-utf8 self

Some regex patterns don't match a character with itself when the target
string is in utf8 and the pattern isn't, and the character is variant
under utf8.  (This means only Latin1-range characters in the pattern are
affected.)

The solution is to test for this case and use the utf8 representation of
the pattern character for the comparison.
---
 regexec.c  |   40 +++++++++++++++++++++++++++++++++++++---
 t/re/pat.t |   14 +++++++++++++-
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/regexec.c b/regexec.c
index 901703f..f87c2fa 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,10 +5750,44 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
     case CANY:
 	scan = loceol;
 	break;
-    case EXACT:		/* length of string is 1 */
+    case EXACT:
+	/* To get here, EXACT nodes must have *byte* length == 1.  That means
+	 * they match only characters in the string that can be expressed as a
+	 * single byte.  For non-utf8 strings, that means a simple match.  For
+	 * utf8 strings, the character matched must be an invariant, or
+	 * downgradable to a single byte.  The pattern's utf8ness is
+	 * irrelevant, as it must be a single byte, so either it isn't utf8, or
+	 * if it is it's an invariant */
+
 	c = (U8)*STRING(p);
-	while (scan < loceol && UCHARAT(scan) == c)
-	    scan++;
+	assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+	if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
+
+	    /* Here, the string isn't utf8, or the character in the EXACT
+	     * node is the same in utf8 as not, so can just do equality.
+	     * Each matching char must be 1 byte long */
+	    while (scan < loceol && UCHARAT(scan) == c) {
+		scan++;
+	    }
+	}
+	else {
+
+	    /* Here, the string is utf8, and the char to match is different
+	     * in utf8 than not.  Fastest to find the two utf8 bytes that
+	     * represent c, and then look for those in sequence in the utf8
+	     * string */
+	    U8 high = UTF8_TWO_BYTE_HI(c);
+	    U8 low = UTF8_TWO_BYTE_LO(c);
+	    loceol = PL_regeol;
+	    while (hardcount < max
+		   && scan + 1 < loceol
+		   && UCHARAT(scan) == high
+		   && UCHARAT(scan + 1) == low)
+	    {
+		scan += 2;
+		hardcount++;
+	    }
+	}
 	break;
     case EXACTF:	/* length of string is 1 */
 	c = (U8)*STRING(p);
diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..4668104 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -23,7 +23,7 @@ BEGIN {
 }
 
 
-plan tests => 398;  # Update this when adding/deleting tests.
+plan tests => 402;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1072,6 +1072,18 @@ sub run_tests {
 
     }
 
+    {   # Some constructs with Latin1 characters cause a utf8 string not to
+        # match itself in non-utf8
+        my $c = "\xc0";
+        my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
+        utf8::upgrade($utf8_pattern);
+        ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+        ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+        utf8::upgrade($c);
+        ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+        ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+    }
+
     {
         # Test that a regex followed by an operator and/or a statement modifier work
         # These tests use string-eval so that it reports a clean error when it fails
-- 
1.5.6.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 20, 2010

From @khwilliamson

0002-regexec.c-utf8-doesn-t-match-i-nonutf8-self.patch
From 36a48cf433c58efc5d7c81e082c42d4c21423fa6 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 20 Oct 2010 11:11:13 -0600
Subject: [PATCH] regexec.c: utf8 doesn't match /i nonutf8 self

This is a continuation of [perl #78464].  It fixes it also for the /i
flag.  After this, a character should match itself in the regrepeat
function, even if one is in utf8 and the other isn't, for both /i and
not.

The solution is to move the code for handling /i into the non-i
structure so that the decisions about utf8 are all in one place.  When
the string is in utf8, it uses the utf8-fold function.

This has the added effect of fixing a few cases where a utf8 string did
not match a fold in a non-utf8 pattern.  I haven't added tests for
these, as it only fixes a few cases where this is a problem, and I'm
working on a comprehensive solution to the problem, accompanied by
extensive tests.
---
 regexec.c  |  109 +++++++++++++++++++++++++++++++++++++++++++----------------
 t/re/pat.t |    4 ++
 2 files changed, 83 insertions(+), 30 deletions(-)

diff --git a/regexec.c b/regexec.c
index f87c2fa..842afaf 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5750,8 +5750,12 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
     case CANY:
 	scan = loceol;
 	break;
+    case EXACTFL:
+	PL_reg_flags |= RF_tainted;
+	/* FALL THROUGH */
     case EXACT:
-	/* To get here, EXACT nodes must have *byte* length == 1.  That means
+    case EXACTF:
+	/* To get here, EXACTish nodes must have *byte* length == 1.  That means
 	 * they match only characters in the string that can be expressed as a
 	 * single byte.  For non-utf8 strings, that means a simple match.  For
 	 * utf8 strings, the character matched must be an invariant, or
@@ -5761,47 +5765,92 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
 
 	c = (U8)*STRING(p);
 	assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
+
 	if ((! utf8_target) || UNI_IS_INVARIANT(c)) {
 
 	    /* Here, the string isn't utf8, or the character in the EXACT
 	     * node is the same in utf8 as not, so can just do equality.
 	     * Each matching char must be 1 byte long */
-	    while (scan < loceol && UCHARAT(scan) == c) {
-		scan++;
+	    switch (OP(p)) {
+	    case EXACT:
+		while (scan < loceol && UCHARAT(scan) == c) {
+		    scan++;
+		}
+		break;
+	    case EXACTF:
+		while (scan < loceol &&
+		    (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c]))
+		{
+		    scan++;
+		}
+		break;
+	    case EXACTFL:
+		while (scan < loceol &&
+		    (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c]))
+		{
+		    scan++;
+		}
+		break;
+	    default:
+		Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p));
 	    }
 	}
 	else {
 
-	    /* Here, the string is utf8, and the char to match is different
-	     * in utf8 than not.  Fastest to find the two utf8 bytes that
-	     * represent c, and then look for those in sequence in the utf8
-	     * string */
-	    U8 high = UTF8_TWO_BYTE_HI(c);
-	    U8 low = UTF8_TWO_BYTE_LO(c);
-	    loceol = PL_regeol;
-	    while (hardcount < max
-		   && scan + 1 < loceol
-		   && UCHARAT(scan) == high
-		   && UCHARAT(scan + 1) == low)
-	    {
-		scan += 2;
-		hardcount++;
+	    /* Here, the string is utf8, and the pattern char is different
+	     * in utf8 than not.  */
+
+	    switch (OP(p)) {
+	    case EXACT:
+		{
+		    /* Fastest to find the two utf8 bytes that represent c, and
+		     * then look for those in sequence in the utf8 string */
+		    U8 high = UTF8_TWO_BYTE_HI(c);
+		    U8 low = UTF8_TWO_BYTE_LO(c);
+		    loceol = PL_regeol;
+
+		    while (hardcount < max
+			   && scan + 1 < loceol
+			   && UCHARAT(scan) == high
+			   && UCHARAT(scan + 1) == low)
+		    {
+			scan += 2;
+			hardcount++;
+		    }
+		}
+		break;
+	    case EXACTFL:   /* Doesn't really make sense, but is best we can
+			       do.  The documents warn against mixing locale
+			       and utf8 */
+	    case EXACTF:
+		{   /* utf8 string, so use utf8 foldEQ */
+		    char *tmpeol = loceol;
+		    while (hardcount < max
+			   && foldEQ_utf8(scan, &tmpeol, 0, utf8_target,
+				          STRING(p), NULL, 1, UTF_PATTERN))
+		    {
+			scan = tmpeol;
+			tmpeol = loceol;
+			hardcount++;
+		    }
+
+		    /* XXX Note that the above handles properly the German
+		     * sharp ss in the pattern matching ss in the string.  But
+		     * it doesn't handle properly cases where the string
+		     * contains say 'LIGATURE ff' and the pattern is 'f+'.
+		     * This would require, say, a new function or revised
+		     * interface to foldEQ_utf8(), in which the maximum number
+		     * of characters to match could be passed and it would
+		     * return how many actually did.  This is just one of many
+		     * cases where multi-char folds don't work properly, and so
+		     * the fix is being deferred */
+		}
+		break;
+	    default:
+		Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p));
 	    }
 	}
 	break;
-    case EXACTF:	/* length of string is 1 */
-	c = (U8)*STRING(p);
-	while (scan < loceol &&
-	       (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c]))
-	    scan++;
-	break;
-    case EXACTFL:	/* length of string is 1 */
-	PL_reg_flags |= RF_tainted;
-	c = (U8)*STRING(p);
-	while (scan < loceol &&
-	       (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c]))
-	    scan++;
-	break;
     case ANYOF:
 	if (utf8_target) {
 	    loceol = PL_regeol;
diff --git a/t/re/pat.t b/t/re/pat.t
index 4668104..d4bbbb8 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -1078,10 +1078,14 @@ sub run_tests {
         my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/;
         utf8::upgrade($utf8_pattern);
         ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8";
+        ok $c =~ /$pattern/i, "\\xc0 =~ /$pattern/i; Neither pattern nor target utf8";
         ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not";
+        ok $c =~ /$utf8_pattern/i, "\\xc0 =~ /$pattern/i; pattern utf8, target not";
         utf8::upgrade($c);
         ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not";
+        ok $c =~ /$pattern/i, "\\xc0 =~ /$pattern/i; target utf8, pattern not";
         ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8";
+        ok $c =~ /$utf8_pattern/i, "\\xc0 =~ /$pattern/i; Both target and pattern utf8";
     }
 
     {
-- 
1.5.6.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 21, 2010

From @cpansprout

On Wed Oct 20 10​:38​:45 2010, public@​khwilliamson.com wrote​:

karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #78464]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=78464 >

I thought there was already a ticket for this, but couldn't find it.
Patch attached.

This slightly revises the first patch to remove an unnecessary block and
consequent indent. The second patch extends the fix to the /i flag.
With both patches, the affected cases will have a character match its
variant self with and without /i

Thank you. Applied as 634c83a and
d4e0b82.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 21, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 21, 2010

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this Oct 21, 2010
@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 28, 2010

From @demerphq

is this applied? seems reasonable

On 20 October 2010 19​:38, karl williamson <public@​khwilliamson.com> wrote​:

karl williamson (via RT) wrote​:

# New Ticket Created by  karl williamson # Please include the string​:
 [perl #78464]
# in the subject line of all future correspondence about this issue. #
<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=78464 >

I thought there was already a ticket for this, but couldn't find it. Patch
attached.

This slightly revises the first patch to remove an unnecessary block and
 consequent indent.  The second patch extends the fix to the /i flag. With
both patches, the affected cases will have a character match its variant
self with and without /i

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.