Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexes which interpolate list elements incorrectly parsed #16478

Open
p5pRT opened this issue Mar 26, 2018 · 9 comments
Open

Regexes which interpolate list elements incorrectly parsed #16478

p5pRT opened this issue Mar 26, 2018 · 9 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Mar 26, 2018

Migrated from rt.perl.org#133027 (status was 'open')

Searchable as RT133027$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 26, 2018

From david@cantrell.org.uk

Created by david@cantrell.org.uk

If you interpolate a list element - ie $x[...] into a regex,
It appears to not always be parsed correctly. I would expect
this code to print "bar" four times. Instead it prints​:

bar
bar
foo
bar

$x[1] = "foo"; $_ = "foo"; s/$x[1]/bar/; print "$_\n";
$x[10] = "foo"; $_ = "foo"; s/$x[10]/bar/; print "$_\n";
$x[100] = "foo"; $_ = "foo"; s/$x[100]/bar/; print "$_\n";
$x[1000] = "foo"; $_ = "foo"; s/$x[1000]/bar/; print "$_\n";

NB I didn't discover this myself. User 'ernix' on stackoverflow
did, but I couldn't see anything for it in rt.perl.org. Here's his
original message​:

https://stackoverflow.com/questions/49492181/regex-substitution-using-a-list-element-with-just-3-digits-index-doesnt-work-as

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.27.11:

Configured by dc at Mon Mar 26 16:20:20 BST 2018.

Summary of my perl5 (revision 5 version 27 subversion 11) configuration:
  Commit id: 811612a11efb1ebc131370e8238d3512779354f8
  Platform:
    osname=linux
    osvers=4.9.0-6-amd64
    archname=x86_64-linux-thread-multi
    uname='linux chimera-dev-vm 4.9.0-6-amd64 #1 smp debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 gnulinux '
    config_args='-de -Duse64bitall -Dusethreads -Dprefix=/home/dc/perl-blead -Dusedevel'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-O2'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='6.3.0 20170516'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/6/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.24.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.24'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.27.11:
    /home/dc/perl-blead/lib/site_perl/5.27.11/x86_64-linux-thread-multi
    /home/dc/perl-blead/lib/site_perl/5.27.11
    /home/dc/perl-blead/lib/5.27.11/x86_64-linux-thread-multi
    /home/dc/perl-blead/lib/5.27.11


Environment for perl 5.27.11:
    HOME=/home/dc
    LANG=en_GB.UTF-8
    LANGUAGE=en_GB:en
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/dc/bin:/opt/perlbrew/bin:/opt/perlbrew/perls/perl-5.22.3/bin:/home/dc/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
    PERLBREW_BASHRC_VERSION=0.39
    PERLBREW_HOME=/home/dc/.perlbrew
    PERLBREW_MANPATH=/opt/perlbrew/perls/perl-5.22.3/man
    PERLBREW_PATH=/opt/perlbrew/bin:/opt/perlbrew/perls/perl-5.22.3/bin
    PERLBREW_PERL=perl-5.22.3
    PERLBREW_ROOT=/opt/perlbrew
    PERLBREW_VERSION=0.39
    PERL_BADLANG (unset)
    PERL_UNICODE=AS
    SHELL=/bin/bash

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 26, 2018

From @Abigail

On Mon, Mar 26, 2018 at 09​:14​:26AM -0700, David Cantrell wrote​:

# New Ticket Created by David Cantrell
# Please include the string​: [perl #133027]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133027 >

This is a bug report for perl from david@​cantrell.org.uk,
generated with the help of perlbug 1.41 running under perl 5.27.11.

-----------------------------------------------------------------
[Please describe your issue here]

If you interpolate a list element - ie $x[...] into a regex,
It appears to not always be parsed correctly. I would expect
this code to print "bar" four times. Instead it prints​:

bar
bar
foo
bar

$x[1] = "foo"; $_ = "foo"; s/$x[1]/bar/; print "$_\n";
$x[10] = "foo"; $_ = "foo"; s/$x[10]/bar/; print "$_\n";
$x[100] = "foo"; $_ = "foo"; s/$x[100]/bar/; print "$_\n";
$x[1000] = "foo"; $_ = "foo"; s/$x[1000]/bar/; print "$_\n";

NB I didn't discover this myself. User 'ernix' on stackoverflow
did, but I couldn't see anything for it in rt.perl.org. Here's his
original message​:

https://stackoverflow.com/questions/49492181/regex-substitution-using-a-list-element-with-just-3-digits-index-doesnt-work-as

If one uses Deparse, it becomes clear where the difference is​:

  $x[1] = 'foo'; $_ = 'foo'; s/$x[1]/bar/; print "$_\n";
  $x[10] = 'foo'; $_ = 'foo'; s/$x[10]/bar/; print "$_\n";
  $x[100] = 'foo'; $_ = 'foo'; s/${x}[100]/bar/; print "$_\n";
  $x[1000] = 'foo'; $_ = 'foo'; s/$x[1000]/bar/; print "$_\n";

No idea why this happens though.

Abigail

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 26, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 26, 2018

From @ilmari

Abigail <abigail@​abigail.be> writes​:

If one uses Deparse, it becomes clear where the difference is​:

$x[1] = 'foo'; $_ = 'foo'; s/$x[1]/bar/; print "$_\n";
$x[10] = 'foo'; $_ = 'foo'; s/$x[10]/bar/; print "$_\n";
$x[100] = 'foo'; $_ = 'foo'; s/${x}[100]/bar/; print "$_\n";
$x[1000] = 'foo'; $_ = 'foo'; s/$x[1000]/bar/; print "$_\n";

No idea why this happens though.

Additionally, if all the three digits are the same, it doesn't happen​:

$ perl -MO=Deparse -e '$x[666] = "foo"; $_ = "foo"; s/$x[666]/bar/; print "$_\n";'
$x[666] = 'foo';
$_ = 'foo';
s/$x[666]/bar/;
print "$_\n";

from #london.pm​:

  <@​mst> oh, there's a thingy in the parser where it tries to work
  out what [] means that adds up a bunch of things to get a
  score to decide which way to break
<@​ilmari> huh, $x[111] works
<@​ilmari> and 222, 333 etc
<@​ilmari> it deparses at s/${x}[...]/bar/ when it fails
  <@​mst> my guess would be that 'all the same character inside the
  []' changes the scorte
<@​ilmari> ô_Ô
<@​ilmari> oh, it's trying to guess if it's an array subscript or a
  character class
<@​ilmari> but only if it's three characters...
  <@​mst> that heuristic scares me more than S_intuit_method
<@​DrHyde> i'm kinda surprised that the parser doesn't first try as hard
  as it can to find variables to interpolate, and only treat
  $x[...] as normal regex line-noise if it can't
  <@​mst> the treatment of [] is *special*
 

- ilmari
--
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
the consequences of." -- Skud's Meta-Law

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 27, 2018

From @iabyn

On Mon, Mar 26, 2018 at 06​:08​:26PM +0100, Dagfinn Ilmari Mannsåker wrote​:

<@​mst> oh, there's a thingy in the parser where it tries to work
out what [] means that adds up a bunch of things to get a
score to decide which way to break

The code is in S_intuit_more(). From the code comments​:

  * Returns TRUE if there's more to the expression (e.g., a subscript),
  * FALSE otherwise.
  ...
  * if we're in a pattern and the first char is a [
  * [] returns FALSE
  * [SOMETHING] has a funky algorithm to decide whether it's a
  * character class or not. It has to deal with things like
  * /$foo[-3]/ and /$foo[$bar]/ as well as /$foo[$\d]+/
  ....
  /* This is the one truly awful dwimmer necessary to conflate C and sed. */

  ....
  /* this is terrifying, and it works */

That last comment is followed by 100 lines of code which calculates a
weight heuristic, and which I haven't even attempted to understand.

--
Spock (or Data) is fired from his high-ranking position for not being able
to understand the most basic nuances of about one in three sentences that
anyone says to him.
  -- Things That Never Happen in "Star Trek" #19

@khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Apr 10, 2022

I now understand what's going on here.

The issue is how does one decide between [13579] meaning a subscript vs meaning a class of the odd digits. I haven't checked but it could well be that [97531] would be considered a subscript, and [13579] a character class. See #19614 for more details

The code changed in that PR is a heuristic for deciding. Now that I understand it, I can think of more criteria that could be thrown in, but it would still remain a heuristic. It's hard for me to imagine that it works as poorly as it does with perl working as well as it does. This ticket indicates that if it guess wrong in favor of a character class that things don't work. I think it must be that if it guesses things wrong the other direction, that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one.

The code special cases a construct consisting of precisely 1 or 2 digits to be a subscript. It doesn't special case other all-digit operands. The bias is towards it being a character class. So at three digits, it remains so. At 4 digits, with the final three being the same number '000' in this case, the bias is overcome and it is considered a subscript.

The bias is increased if the digits are consecutive [123] would be more likely to be considered a character class; and it is decreased if the same character occurs multiple times, with each extra occurrence value more towards making it a subscript than the previous such occurrence.

If in fact guessing wrong towards it being a subscript is recovered from, then we could just assume all digits means subscript.

But if guessing a subscript always is recovered from when wrong, why have this routine at all?

@hvds
Copy link
Contributor

@hvds hvds commented Apr 10, 2022

I doubt I could find it back, but many (20?) years ago I reported this with a script that showed the very random looking patterns the heuristic throws up for a long list of integers. It's crazy stuff, but nobody seemed interested in touching it.

I think it must be that if it guesses things wrong [in favour of a subscript], that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one.

It would be useful to find evidence of that, I don't recall any mention of such a correction mechanism before.

@khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Apr 17, 2022

The current code is buggy, it turns out. I don't see a correction mechanism, and the heuristics do fail

@demerphq
Copy link
Contributor

@demerphq demerphq commented Apr 18, 2022

This ticket indicates that if it guess wrong in favor of a character class that things don't work. I think it must be that if it guesses things wrong the other direction, that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one.

I would assume that if it biases towards assuming it is a character class then it probably does the right thing most of the time, especially if the subscript is long. I find it hard to believe that many people are writing patterns with high subscript values in array lookups.

eg,

/$thing[13579]/

its hard to believe that many people actually would have an array with 13k+ elements in it that they want to interpolate in a pattern. So if it assumes that is a character-class I doubt many, if any, people would stumble on the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants