Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.8 Regression: some regexes with multi-char folds loop #10193

Closed
p5pRT opened this issue Feb 22, 2010 · 14 comments
Closed

5.8 Regression: some regexes with multi-char folds loop #10193

p5pRT opened this issue Feb 22, 2010 · 14 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Feb 22, 2010

Migrated from rt.perl.org#72998 (status was 'resolved')

Searchable as RT72998$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 22, 2010

From @khwilliamson

This is a bug report for perl from khw@​khw-desktop.nonet,
generated with the help of perlbug 1.39 running under perl 5.11.4.


"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.



Flags​:
  category=core
  severity=medium


This perlbug was built using Perl 5.11.5 - Fri Feb 19 08​:36​:52 MST 2010
It is being executed now by Perl 5.11.4 - Sat Jan 23 13​:32​:28 MST 2010.

Site configuration information for perl 5.11.4​:

Configured by khw at Sat Jan 23 13​:32​:28 MST 2010.

Summary of my perl5 (revision 5 version 11 subversion 4) configuration​:
  Commit id​: d0a7635
  Platform​:
  osname=linux, osvers=2.6.27-16-generic, archname=i686-linux
  uname='linux khw-desktop 2.6.27-16-generic #1 smp tue dec 1
17​:56​:54 utc 2009 i686 gnulinux '
  config_args='-s -d -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-g3' -A'optimize=-g3' -A'optimize=-O0''
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
  optimize='-O0 -g3',
  cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
  ccversion='', gccversion='4.3.2', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.8.90.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.8.90'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -g3 -g3 -O0 -L/usr/local/lib
-fstack-protector'

Locally applied patches​:


@​INC for perl 5.11.4​:
  /home/khw/blead/lib/perl5/site_perl/5.11.4/i686-linux
  /home/khw/blead/lib/perl5/site_perl/5.11.4
  /home/khw/blead/lib/perl5/5.11.4/i686-linux
  /home/khw/blead/lib/perl5/5.11.4
  /home/khw/blead/lib/perl5/site_perl
  .


Environment for perl 5.11.4​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/opt/real/RealPlayer​:/home/khw/cxoffice/bin
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 23, 2010

From @khwilliamson

karl williamson (via RT) wrote​:

"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.
-----------------------------------------------------------------

I should have mentioned that this is more likely to happen in 5.11.5
than before because of the fixes for \N{}. Now this error will happen
for constructs like​:
"\x{FB00}" =~ /\N{U+66}+/
because the \N{U+66} now forces the pattern into utf8;

or
"\X{FB00}" =~ /\N{LATIN SMALL LETTER F}/
because the \N{} actually will compile correctly, unlike before.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 25, 2010

From @obra

"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.

Even though this is likely to happen "more" in 5.11.5 than it did in
5.12, I'm not sure it's serious enough that I'd call it a blocker for
the .0 release.

-Jesse

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 25, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 25, 2010

From @khwilliamson

Jesse Vincent wrote​:

"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.

Even though this is likely to happen "more" in 5.11.5 than it did in
5.12, I'm not sure it's serious enough that I'd call it a blocker for
the .0 release.

-Jesse

I'm not either. I'm thinking that it should be in the known problems of
the delta. N'est pas?

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 26, 2010

From @obra

On Thu, Feb 25, 2010 at 04​:31​:33PM -0700, karl williamson wrote​:

Jesse Vincent wrote​:

"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.

Even though this is likely to happen "more" in 5.11.5 than it did in
5.12, I'm not sure it's serious enough that I'd call it a blocker for
the .0 release.

-Jesse

I'm not either. I'm thinking that it should be in the known problems of
the delta. N'est pas?

Sure. Do you have opinions on ideal phrasing?

--

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 26, 2010

From @khwilliamson

jesse wrote​:

On Thu, Feb 25, 2010 at 04​:31​:33PM -0700, karl williamson wrote​:

Jesse Vincent wrote​:

"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.
Even though this is likely to happen "more" in 5.11.5 than it did in
5.12, I'm not sure it's serious enough that I'd call it a blocker for
the .0 release.

-Jesse

I'm not either. I'm thinking that it should be in the known problems of
the delta. N'est pas?

Sure. Do you have opinions on ideal phrasing?

If I could phrase things ideally, I might have been a poet and not a
programmer. (But then again that doesn't pay as well :)). I'll submit
a patch as a starting point.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 2, 2010

From @khwilliamson

karl williamson (via RT) wrote​:

This is a bug report for perl from khw@​khw-desktop.nonet,
generated with the help of perlbug 1.39 running under perl 5.11.4.

-----------------------------------------------------------------
"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i
"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i

both loop. This bug is in 5.10 but not 5.8. I don't know the full
circumstances, but preliminarily, it appears to affect patterns which
are in utf8 with the fold a sequence of ASCII characters, and when input
with \N. This causes a CURLYM regnode to be generated, instead of a
PLUS node in the examples above. But it fails with * as well.
-----------------------------------------------------------------

I looked at this enough to see part of what is causing it, and that it
apparently isn't easy to fix; perhaps it would be to someone more
familiar with regexec.c.

The problem I suspect stems from the attempt to shoehorn in multi-char
folds into code that never dreamed of the possibility.

The case fold of FB00 is 'ff'. So /f+/i should match it (\N{U+66} is
another way of writing 'f').

The problem is that each f matches only half of FB00, but the code
doesn't know how to deal with half a match, so it matches the half, and
then leaves the pointer to FB00 unchanged.

In a way this is not a regression from 5.8, because it didn't work
properly in 5.8 either; it just failed quickly without looping. If you
replace the \N{U+66} with the equivalent \x{66} or just f, it also
doesn't match properly, but fails quickly. The difference is that the
\N{} construct forces it into using CURLYM instead of another type of
CURLY, and CURLYM loops, and the others fail quickly.

Hopefully, Yves' redesign of multi-char folding will fix this whole thing.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 14, 2010

From @khwilliamson

In 5.8 it doesn't match correctly either, but it doesn't loop forever.
This restores that better but not idyllic state.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 14, 2010

From @khwilliamson

0001-PATCH-perl-72998-regex-looping.patch
From fbf9950bdadcfe4479e137c54e2546b23a221061 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Tue, 13 Apr 2010 21:25:36 -0600
Subject: [PATCH] PATCH: [perl #72998] regex looping

If a character folds to multiple ones in case-insensitive matching,
it should not match just one of those, or the regular expression can
loop.  This patch doesn't fix the problem of it not matching correctly,
but it simply causes it to fail now, rather than much much later, by
disallowing the case where it thought it matched but did not advance the
pointer to the next character.

Fixing the problem in general of multi-char folding characters is a much
bigger issue, to be addressed later.
---
 t/re/re.t |   10 +++++++++-
 utf8.c    |    3 ++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/t/re/re.t b/t/re/re.t
index 87965f2..aae8120 100644
--- a/t/re/re.t
+++ b/t/re/re.t
@@ -51,6 +51,14 @@ if ('1234'=~/(?:(?<A>\d)|(?<C>!))(?<B>\d)(?<A>\d)(?<B>\d)/){
     }
     is(regnames_count(),3);
 }
+
+    {
+        # Bug #72998; this can loop 
+        watchdog(2);
+        eval '"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i';
+        pass("Didn't loop");
+    }
+
 # New tests above this line, don't forget to update the test count below!
-BEGIN { plan tests => 18 }
+BEGIN { plan tests => 19 }
 # No tests here!
diff --git a/utf8.c b/utf8.c
index 9ed0663..1a6077c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -2609,7 +2609,8 @@ Perl_ibcmp_utf8(pTHX_ const char *s1, char **pe1, register UV l1, bool u1, const
 
      /* A match is defined by all the scans that specified
       * an explicit length reaching their final goals. */
-     match = (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2);
+     match = (n1 == 0 && n2 == 0    /* Must not match partial char; Bug #72998 */
+	     && (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2));
 
      if (match) {
 	  if (pe1)
-- 
1.5.6.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 15, 2010

From @khwilliamson

This is a small revision to the patch I submitted yesterday. The
differences are that a comment has been added to the test case telling
that it should be kept last in the file, and the commit message is much
more detailed about the problem this patches.

I actually have a more extensive patch prepared that cleans up the whole
routine to be faster and clearer, but I think this should go in 5.12.1,
as it solves not a Perl crash, but a hung Perl, which is perhaps worse,
and I thought it prudent to submit a minimal patch for the maintenance
release.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 15, 2010

From @khwilliamson

0001-PATCH-perl-72998-regex-looping.patch
From 93eb9a395ae991368a9b7fd3691c13791b7d5cf5 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Tue, 13 Apr 2010 21:25:36 -0600
Subject: [PATCH] PATCH: [perl #72998] regex looping

If a character folds to multiple ones in case-insensitive matching,
it should not match just one of those, or the regular expression can
loop.  For example, \N{LATIN SMALL LIGATURE FF} folds to 'ff', and so
    "\N{LATIN SMALL LIGATURE FF}" =~ /f+/i
should match.  Prior to this patch, this function returned that there is
a match, but left the matching string  pointer at the beginning of the
"\N{LATIN SMALL LIGATURE FF}" because it doesn't make sense to match
just half a character, and at this level it doesn't know about the '+'.
This leaves things in an inconsistent state, with the reporting of a
match, but the input pointer unchanged, the result of which is a loop.

I don't know how to fix this so that it correctly matches, and there are
semantic issues with doing so.  For example, if
    "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i
matches, then one would think that so should
    "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i
But $1 and $2 don't really make sense here, since they both refer to the
half of the same character.

So this patch just returns failure if only a partial character is
matched.  That leaves things consistent, and solves the problem of
looping, so that Perl doesn't hang on such a construct, but leaves the
ultimate solution for another day.
---
 t/re/re.t |   10 +++++++++-
 utf8.c    |    3 ++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/t/re/re.t b/t/re/re.t
index 87965f2..249c6dd 100644
--- a/t/re/re.t
+++ b/t/re/re.t
@@ -51,6 +51,14 @@ if ('1234'=~/(?:(?<A>\d)|(?<C>!))(?<B>\d)(?<A>\d)(?<B>\d)/){
     }
     is(regnames_count(),3);
 }
+
+    { # Keep this test last, as whole script will be interrupted if times out
+        # Bug #72998; this can loop 
+        watchdog(2);
+        eval '"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i';
+        pass("Didn't loop");
+    }
+
 # New tests above this line, don't forget to update the test count below!
-BEGIN { plan tests => 18 }
+BEGIN { plan tests => 19 }
 # No tests here!
diff --git a/utf8.c b/utf8.c
index 9ed0663..1a6077c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -2609,7 +2609,8 @@ Perl_ibcmp_utf8(pTHX_ const char *s1, char **pe1, register UV l1, bool u1, const
 
      /* A match is defined by all the scans that specified
       * an explicit length reaching their final goals. */
-     match = (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2);
+     match = (n1 == 0 && n2 == 0    /* Must not match partial char; Bug #72998 */
+	     && (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2));
 
      if (match) {
 	  if (pe1)
-- 
1.5.6.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 15, 2010

From @rgarcia

On 15 April 2010 03​:35, karl williamson <public@​khwilliamson.com> wrote​:

This is a small revision to the patch I submitted yesterday.  The
differences are that a comment has been added to the test case telling that
it should be kept last in the file, and the commit message is much more
detailed about the problem this patches.

I actually have a more extensive patch prepared that cleans up the whole
routine to be faster and clearer, but I think this should go in 5.12.1, as
it solves not a Perl crash, but a hung Perl, which is perhaps worse, and I
thought it prudent to submit a minimal patch for the maintenance release.

I agree.
Thanks, applied to bleadperl.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 15, 2010

@rgs - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.