Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tied separator and arguments in join() give unexpected results #21458

Closed
book opened this issue Sep 6, 2023 · 11 comments · Fixed by #21502
Closed

Tied separator and arguments in join() give unexpected results #21458

book opened this issue Sep 6, 2023 · 11 comments · Fixed by #21502
Assignees

Comments

@book
Copy link
Contributor

book commented Sep 6, 2023

This is a bug report for perl, generated with the help of perlbug 1.43 running under perl 5.39.3.


Using a self-modifying tied variable as the separator in join gives unexpected results.

Description

While implementing PPC 0013, I've been exploring the behaviour of join with overloaded separator and arguments, and also with tied separator and arguments.

This case is about using a self-modifying (in FETCH) tied variable both as a separator and multiple times in the argument list.

Expected behavior

The following (passing) test case sets expectations:

  • FETCH is called once on a tied separator, and the returned string is used to join the arguments
  • a join with a single argument detected at compilation is replaced with an OP_STRINGIFY, so FETCH is never called
  • when the tied separator is also used in the argument list, FETCH is called once to get the separating string, and once for each occurence of the variable in the argument list
use Test::More;
    
# simple tied variable
{
    package S;
    our $fetched;
    sub TIESCALAR { my $x = '-';   $fetched = 0; bless \$x }
    sub FETCH     { my $y = shift; $fetched++;   $$y }

    package main;
    my $t;

    tie $t, 'S';
    is( join( $t, a .. c ), 'a-b-c', 'tied separator' );
    is( $S::fetched,    1,       'FETCH called once' );

    tie $t, 'S';
    is( join( $t, 'a' ), 'a', 'tied separator on single item join' );
    is( $S::fetched,     0,   'FETCH not called' );

    tie $t, 'S';
    is( join( $t, 'a', $t, 'b', $t, 'c' ),
        'a---b---c', 'tied separator also in the join arguments' );
    is( $S::fetched, 3, 'FETCH called 1 + 2 times' );
}

done_testing;

Steps to Reproduce

When the tied variable is self-modifying in FETCH, the result does not seem to follow the above expectations:

use Test::More;

# self-modifying tied variable
{

    package SM;
    our $fetched;
    sub TIESCALAR { my $x = "1";   $fetched = 0; bless \$x }
    sub FETCH     { my $y = shift; $fetched++;   $$y += 3 }

    package main;
    my $t;

    tie $t, "SM";
    is( join( $t, a .. c ), 'a4b4c', 'tied separator' );
    is( $SM::fetched,   1,       'FETCH called once' );

    tie $t, "SM";
    is( join( $t, 'a' ), 'a', 'tied separator on single item join' );
    is( $SM::fetched,    0,   'FETCH not called' );

    tie $t, "SM";
    is( join( $t, "a", $t, "b", $t, "c" ),
        'a474b4104c', 'tied separator also in the join arguments' );
    is( $SM::fetched, 3, 'FETCH called 1 + 2 times' );
}

done_testing;

In the last call to join, FETCH is called the expected number of times, but the result is not what we expect, based on the behaviour described in the first case:

#     expected: 'a474b4104c'
                  ^ separator (FETCH)
                   ^ tied argument (FETCH)
                    ^ separator (cached)
                      ^ separator (cached)
                       ^ tied argument (FETCH)
                         ^ separator (cached)

versus:

#          got: 'a477b7101c'
                  ^ separator (FETCH)
                   ^ tied argument (FETCH)
                    ^ ??? separator (previously fetched value?)
                      ^ ??? separator (same as before?)
                       ^ tied argument (FETCH)
                         ^ ???

Flags

  • category=core
  • severity=medium

Perl configuration

Site configuration information for perl 5.39.3:

Configured by book at Wed Sep  6 08:18:57 CEST 2023.

Summary of my perl5 (revision 5 version 39 subversion 3) configuration:
  Commit id: 008ae6a0ef7f87df8ee75ca523c6adf1ae0b768d
  Platform:
    osname=linux
    osvers=6.2.0-31-generic
    archname=x86_64-linux
    uname='linux zlopp 6.2.0-31-generic #31~22.04.1-ubuntu smp preempt_dynamic wed aug 16 13:45:26 utc 2 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Dusedevel'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-O2'
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='11.4.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/x86_64-linux-gnu /usr/lib /usr/lib64
    libs=-lpthread -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=/lib/x86_64-linux-gnu/libc.so.6
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.35'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'


---
@INC for perl 5.39.3:
    lib
    /usr/local/lib/perl5/site_perl/5.39.3/x86_64-linux
    /usr/local/lib/perl5/site_perl/5.39.3
    /usr/local/lib/perl5/5.39.3/x86_64-linux
    /usr/local/lib/perl5/5.39.3

---
Environment for perl 5.39.3:
    HOME=/home/book
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_ADDRESS=fr_FR.UTF-8
    LC_IDENTIFICATION=fr_FR.UTF-8
    LC_MEASUREMENT=fr_FR.UTF-8
    LC_MONETARY=fr_FR.UTF-8
    LC_NAME=fr_FR.UTF-8
    LC_NUMERIC=fr_FR.UTF-8
    LC_PAPER=fr_FR.UTF-8
    LC_TELEPHONE=fr_FR.UTF-8
    LC_TIME=fr_FR.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/book/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash
@leonerd
Copy link
Contributor

leonerd commented Sep 15, 2023

This feels like definitely a bug in Perl_do_join(), in doop.c.
It begins by taking the length and string buffer pointer of the delim sv:

 664     STRLEN delimlen;
 665     const char * const delims = SvPV_const(delim, delimlen);

This pointer is now just pointing into the SvPV buffer of the delim SV itself. Meaning, later on every time it gets sv_catpvn'ed into the buffer that join is building, it copies the current value of that delim SV, not the initial value at entry time into the function.

I suspect a fix for this would be to copy the delim SV into a temporary buffer and use that.

    STRLEN delimlen;
    const char * const delims = SvPV_const(delim, delimlen);
    delims = savepvn(delims, delimlen);
    SAVEFREEPV(delims);

@leonerd
Copy link
Contributor

leonerd commented Sep 15, 2023

Attempted fix at: #21484

@iabyn
Copy link
Contributor

iabyn commented Sep 15, 2023 via email

@demerphq
Copy link
Collaborator

demerphq commented Sep 15, 2023

Personally I think we should:

  • declare in the documentation that the behaviour of join is undefined if
    the value of the separator changes during the course of the join. This
    leaves us with flexibility in future for whatever optimisations or
    overload adding or whatever we later choose to do.

I think we should declare in the docs that ties and overloads on the separator will be triggered at most once, and that the value of the separator will not change for the lifetime of the join operation. Anything else means the user has no "fast" way to stringify and concatenate a set of items. Put alternatively I think join should be implemented such that the user can assume it is the fastest way to concatenate a set of items, and that every aspect of its implementation should be optimized for speed.

Part of the issue here is that there are multiple valid interpretations of what join does in terms of the operators it represents. So for instance I can think of multiple different interpretations of join($sep,$x,$y,$z)

  1. $x . $sep . $y . $sep . $z (being one of:)
  2. (($x . $sep) . $y) . $sep) . $z
  3. (($x . $sep) . $y) . ($sep . $z))
  4. (($x . $sep) . ($y . $sep)) . $z
  5. ($x . ( $sep . $y)) . ( $sep . $z )
  6. $x . (( $sep . $y) . ( $sep . $z ))
  7. "$x" . "$sep" . "$y" . "$sep" . "$z"
  8. do{ $tmp_sep = "$sep"; "$x" . $tmp_sep . "$y" . $tmp_sep . "$z" }
  9. do{ my $tmp_sep = "$sep"; my @parts = stringify_in_undefined_order($x,$y,$z); $x . $sep . $y . $sep . $z }

Personally I would vote for #7 or #8, as under those interpretations join() would NEVER return an overloaded object as it would be the result of joining the stringified version of its arguments (which IMO should never return anything but a string). The point being that I would prefer we leave it completely undefined what order the arguments are stringified in, and what order they are joined in so that we are allowed to implement join in whatever way would achieve maximum speed.

I understand that there are those that want join() to be equivalent to some particular operator sequence, but we have never defined that sequence, and there are multiple valid optrees for this type of expression, which would be equally valid depending on the type of parser used to process the expression.

  • the code should be robust enough to handle the separator changing without being potentially insecure or crashworthy.

I strongly agree. This has come up in other places, like named user defined unicode sequences in the regex engine.

Or perhaps, after the delimiter is appended to the target string for the first time, use that part of the target string as the source for any future delimiter appends?

This makes sense to me personally and how i would expect it to be done.

Yves

@book
Copy link
Contributor Author

book commented Sep 15, 2023

I think we should declare in the docs that ties and overloads on the separator will be triggered at most once, and that the value of the separator will not change for the lifetime of the join operation.

I agree with this for tied variables. "At most once" is also backwards compatible with the current behaviour.

(@demerphq: I'll reply separately to the rest of your points.)

@book
Copy link
Contributor Author

book commented Sep 15, 2023

@demerphq I didn't list "overload" in my reply above, because I wanted to gather my thoughts and not reply too quickly.

However, if we say "ties will be triggered at most once", then there's definitely a point to be made about also "applying the concat overload on the separator at most once", if only for consistency regarding side-effects.

So I've come to agree with the full sentence I quoted. 🎉

My wording would be:

Ties and overloads on the separator will be triggered at most once during a join. The value used for the join operation is the stringification of the given separator (and therefore will not change during the join operation). If the separator is not needed to perform the join (e.g. on a single item or a list with at most one element), then no side-effects are performed on the separator and the rest of the arguments, and a single item will be passed through as is. Joining an empty list returns the empty string.

So, the separator should be stringified at most once, if needed to perform the join. Any concat overload on the separator would be bypassed (since it's stringified instead).

Here's what I mean by "if needed": while looking at the code with @leonerd, we discovered that join $sep, $item (single item detected at compile-time) is actually optimized to "$item" (stringification of the item).

It's rather easy to check the size of the argument list, so I think we should actually modify the join compile-time optimization and run-time implementation so that a join with a single item will return the item as-is (no stringification, no side-effects) and not trigger any side-effect on the separator.

@demerphq
Copy link
Collaborator

@book I think there are a couple of problems with this:

a join with a single item will return the item as-is (no stringification, no side-effects) and not trigger any side-effect on the separator

First there is existing behavior, join on a single item returns the stringified form of the item, possibly throwing warnings if that item is undefined.

$ perl -le' my $r = join "", []; print "is ref" if ref $r; print $r;'
ARRAY(0x55d58a3825e8)
$ perl -MData::Dumper -wle'my $undef; print Dumper(join "", $undef);'
Use of uninitialized value $undef in join or string at -e line 1.
$VAR1 = '';

Additionally, I would expect all of the following to return the same thing:

  1. join("", $x, "")
  2. join("","", $x)
  3. join($x,"","")
  4. join("",$x)
  5. join($anything,$x)

My point is I would expect that join() returns a string always, and currently that is what happens. The separator is special however, I would not expect it to be stringified if there were less than two items to be joined. IOW, I would not expect it to be stringified if there were 0 or 1 items in the join expression.

Also worth considering is what happens with an empty list. I would expect join to return the empty string when joining 0 items.

$ perl -MData::Dumper -le'my @empty; print Dumper(join "", @empty);'
$VAR1 = '';

So I think that join() must continue to return a string. I dont think it should ever return an object or pass through an argument. I think there is room to debate how many times the separator should be stringified, but not the return from the operator.

@book
Copy link
Contributor Author

book commented Sep 15, 2023

The conversation on what join returns is out of scope for this issue, I suggest we continue it elsewhere (the list, or some issue on the PPCs repository).

About this issue, if everyone agrees with:

Ties and overloads on the separator will be triggered at most once during a join. The value used for the join operation is the stringification of the given separator (and therefore will not change during the join operation). If the separator is not needed to perform the join (e.g. on a single item or a list with at most one element), then no side-effects are performed on the separator.

(i.e. what I wrote above, without the pass-through part)

then:

  • @leonerd's patch fixes the issue I found
  • we need to prevent side effects on the separator if the list has 0 or 1 elements
  • we can add the test I wrote to the t/op/join.t script
  • we should update the join documentation to mention side-effects on the separator

@tonycoz
Copy link
Contributor

tonycoz commented Sep 18, 2023

So, the separator should be stringified at most once, if needed to perform the join. Any concat overload on the separator would be bypassed (since it's stringified instead)

Isn't this a step away from ppc0013?

ppc0013 states:

# when @list contains elements with concat overloading,
# we expect this code:
my $ret = join $sep, @list;

# to behave like this code:
my $ret = reduce { ( $a . $sep ) . $b } @list;

but the above means that if only $sep has an overloaded concat that overloaded concat will be ignored.

It also means that unlike the reduce(...) the overloading on $sep is only called once.

For the trivial common case where $sep is not modified, is a PV, isn't overloaded and isn't magical, SvPV() is cheap, most likely cheaper than savepvn() as Dave pointed out.

Or as an optimisation, only SvPV() it anew after each time you encounter a SMG argument?

If delim is overloaded SvPV() might return a new string each time, no SMG needed at all. Or string overloading on the list elements can modify delim, still no SMG needed.

@book
Copy link
Contributor Author

book commented Sep 18, 2023

Yes, it is a step away from PPC 0013. It is also a step forward in a negotiation with @demerphq, who doesn't want PPC 0013 to happen at all. (Quoting an earlier message of his: "I think that join() must continue to return a string. I dont think it should ever return an object or pass through an argument.")

However, I'd rather have this conversation outside of this issue, which is about an actual bug in the current implementation of join.

@demerphq
Copy link
Collaborator

@book, FWIW, its not that I /want/ to block PSC 0013, it is because I think it is sufficiently flawed that we are /required/ to block it until those flaws can be resolved. I agree this ticket isnt the forum for that discussion however. I will try to find time in the next day or two to post something which summarizes the problems I see in it.

tonycoz added a commit to tonycoz/perl5 that referenced this issue Sep 21, 2023
tonycoz added a commit to tonycoz/perl5 that referenced this issue Sep 21, 2023
This code had a few problems:

- changes to the content of delim from set or overload magic could
  result in the separator between elements changing during the
  process of the join.
- changes to the content of delim which allocated a new PVX
  resulted in access to freed memory
- changes to the flags of delim, the UTF-8 flag in particular, could
  result in an invalid joined string, either mojibake or an invalidly
  encoded upgraded string

To avoid that, we copy the separator, either into a local buffer
if it's large enough, or an allocated buffer, and save the flag we
use, to prevent changes to the delim SV from changing or invalidating
the delimpv value.

Fixes Perl#21458 and some similar problems.
tonycoz added a commit that referenced this issue Oct 5, 2023
tonycoz added a commit that referenced this issue Oct 5, 2023
This code had a few problems:

- changes to the content of delim from set or overload magic could
  result in the separator between elements changing during the
  process of the join.
- changes to the content of delim which allocated a new PVX
  resulted in access to freed memory
- changes to the flags of delim, the UTF-8 flag in particular, could
  result in an invalid joined string, either mojibake or an invalidly
  encoded upgraded string

To avoid that, we copy the separator, either into a local buffer
if it's large enough, or an allocated buffer, and save the flag we
use, to prevent changes to the delim SV from changing or invalidating
the delimpv value.

Fixes #21458 and some similar problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants