${10}, ${11}, ... does not work when 'use utf8' and 'use strict' is in use #19989

bram-perl · 2022-07-24T13:16:44Z

Description

This is basically the same issue as #12948, in #12948 fixes were applied¹ but these
only work when use utf8 is not in use².

In addition to that: under use utf8 it allows things like ${010} which were also forbidden in #12948.

Steps to reproduce

Using ${10}

 $ ./perl -Ilib -le 'use strict; use utf8; "abcdefghijkl" =~ m/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/ or die; print ${10};'
 Can't use string ("10") as a SCALAR ref while "strict refs" in use at -e line 1.

 $ ./perl -Ilib -le 'use strict;           "abcdefghijkl" =~ m/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/ or die; print ${10};'
 j

Using ${010}

 $ ./perl -Ilib -e 'use utf8; print ${010};'
 $ ./perl -Ilib -e '          print ${010};'
 Numeric variables with more than one digit may not start with '0' at -e line 1.

Additional information

Looking at the second set of changes applied commit 60267e1 (L10174-10187) with a bit of context (L10166-10173) shows that the changes are made inside the else block that belongs to if (is_utf8) { which is why it falls back to the old (incorrect) behaviour under use utf8.

    10166        if (is_utf8) {
    10167            const STRLEN skip = UTF8SKIP(s);
    10168            STRLEN i;
    10169            d[skip] = '\0';
    10170            for ( i = 0; i < skip; i++ )
    10171                d[i] = *s++;
    10172        }
    10173        else {
    10174            *d = *s++;
    10175            /* special case to handle ${10}, ${11} the same way we handle ${1} etc */
    10176            if (isDIGIT(*d)) {
    10177                bool is_zero= *d == '0' ? TRUE : FALSE;
    10178                char *digit_start= d;
    10179                while (s < PL_bufend && isDIGIT(*s)) {
    10180                    d++;
    10181                    if (d >= e)
    10182                        Perl_croak(aTHX_ "%s", ident_too_long);
    10183                    *d= *s++;
    10184                }
    10185                if (is_zero && d - digit_start > 1)
    10186                    Perl_croak(aTHX_ ident_var_zero_multi_digit);
    10187            }
    10188            d[1] = '\0';
    10189        }

commit 60267e1 (by @demerphq) ↩
this was observed/diagnosed with the help of @simcop2387 because perlbot did not trigger the error ↩

The text was updated successfully, but these errors were encountered:

demerphq · 2022-07-27T10:31:24Z

This is curious. I am surprised I didn't notice the discrepancy before, but if you look at that code it is related to parsing a single digit. Or seems to be anyway.

toke.c line 10161:

    if ((s <= PL_bufend - ((is_utf8)
                          ? UTF8SKIP(s)
                          : 1))
        && VALID_LEN_ONE_IDENT(s, PL_bufend, is_utf8))
    {
        if (is_utf8) {
            const STRLEN skip = UTF8SKIP(s);
            STRLEN i; 
            d[skip] = '\0';
            for ( i = 0; i < skip; i++ )
                d[i] = *s++;
        }
        else {
            *d = *s++;
            /* special case to handle ${10}, ${11} the same way we handle ${1} etc */
            if (isDIGIT(*d)) {
                bool is_zero= *d == '0' ? TRUE : FALSE;
                char *digit_start= d;
                while (s < PL_bufend && isDIGIT(*s)) {
                    d++; 
                    if (d >= e)
                        Perl_croak(aTHX_ "%s", ident_too_long);
                    *d= *s++;
                }   
                if (is_zero && d - digit_start > 1)
                    Perl_croak(aTHX_ ident_var_zero_multi_digit);
            }
            d[1] = '\0';
        }

VALID_LEN_ONE_IDENT() and the d[1] = '\0' and d[skip]= '\0' all strongly indicate that this logic relates to parsing a single character of input, so why would this logic be involved at all in parsing a multi digit variable. I note that the VALID_LEN_ONE_IDENT() macro, used only here, doesnt check if the identifier is actually one character:

toke.c line 10092

#define VALID_LEN_ONE_IDENT(s, e, is_utf8)                                  \
    (isGRAPH_A(*(s)) || ((is_utf8)                                          \
                         ? isIDFIRST_utf8_safe(s, e)                        \
                         : (isGRAPH_L1(*s)                                  \
                            && LIKELY((U8) *(s) != LATIN1_TO_NATIVE(0xAD)))))

It feels like there is more than one thing going wrong here. At the very least that macro is misleading as heck!

demerphq · 2022-07-27T11:08:39Z

Ok, so I get it now. That macro totally threw me off. That logic is just involved in parsing the first digit and the macro is misleading. I have a PR for this brewing.

In 60267e1 I patched toke.c to refuse $00 but did not properly handle ${00} and related cases when the code was unicode. Part of the reason was the confusing macro VALID_LEN_ONE_IDENT() which despite its name does not restrict what it matches to things which are one character long. Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its name and placement is confusing I have moved it back into the code inline as part of this fix. I have also added more comments about what is going on, and moved the related comment directly next to the code that it affects. If it moved out of this code then we should think of a better name and be more careful and clear about checking things like length. I would argue the logic is used to parse what might be called a variable "description", and thus it is not identical to code which might validate an actual parsed variable name. Eg, ${^Var} is a description of the variable whose "name" is "\026ar". The exception of course is $^ whose name actually is "^". A byproduct of this change is that the logic to detect duplicated leading zeros is now quite a bit simpler. This includes more tests for leading zero checks. See Issue #12948, Issue #19986, and Issue #19989.

demerphq · 2022-07-27T12:02:34Z

See #20000

bram-perl · 2022-07-27T12:32:48Z

I think some tests need to be added for this (/ for #12948) as well;
The only test that I can find that uses ${10} in in t/re/subst.t for something that is unrelated; and that appears to be running without use strict; so it won't catch the issue.

I'm not sure where a good place would be to add those; could add some of them in t/re/re_tests which does get run with does use strict; but a test were use strict is enabled explicitly might be better?

demerphq · 2022-07-27T12:40:50Z

We have tests in b/t/comp/parser_run.t but maybe that isn't what you had in mind?

bram-perl · 2022-07-27T12:49:47Z

There are two issues;

variables such as: ${00}, ${0123}, ... for those tests are added and these look fine;
the original problem (in On RHS of s///, ${9} works but ${10} does not #12948) that lead to such variables being disallowed: ${10} doesn't/didn't work when use strict "refs" is/was in use.

i.e.

$ ./perl -Ilib -le 'use strict; "abcdefghijkl" =~ m/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/ or die; print ${10};'

With older perls (and currently in blead when use utf8 is in use) it would error with:

Can't use string ("10") as a SCALAR ref while "strict refs" in use at -e line 1.

With blead (and without use utf8) this prints the same as $10 (i.e. 'j')

So what I expect(ed) to see somewhere is a test that verifies that ${10}, ${11}, ... work under use strict "refs"; (i.e. same behavior as ${9}, that works under use strict so ${10} should work as well)

demerphq · 2022-07-27T14:07:10Z

Ok. For what its worth I have not found the code that causes ${9} and ${10} to be handled differently. It is very annoying. ${9} seems to produce a different op tree but I have yet to find it.

In 60267e1 I patched toke.c to refuse $00 but did not properly handle ${00} and related cases when the code was unicode. Part of the reason was the confusing macro VALID_LEN_ONE_IDENT() which despite its name does not restrict what it matches to things which are one character long. Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its name and placement is confusing I have moved it back into the code inline as part of this fix. I have also added more comments about what is going on, and moved the related comment directly next to the code that it affects. If it moved out of this code then we should think of a better name and be more careful and clear about checking things like length. I would argue the logic is used to parse what might be called a variable "description", and thus it is not identical to code which might validate an actual parsed variable name. Eg, ${^Var} is a description of the variable whose "name" is "\026ar". The exception of course is $^ whose name actually is "^". A byproduct of this change is that the logic to detect duplicated leading zeros is now quite a bit simpler. This includes more tests for leading zero checks. See Issue #12948, Issue #19986, and Issue #19989.

In 60267e1 I patched toke.c to refuse $00 but did not properly handle ${00} and related cases when the code was unicode. Part of the reason was the confusing macro VALID_LEN_ONE_IDENT() which despite its name does not restrict what it matches to things which are one character long. Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its name and placement is confusing I have moved it back into the code inline as part of this fix. I have also added more comments about what is going on, and moved the related comment directly next to the code that it affects. If it moved out of this code then we should think of a better name and be more careful and clear about checking things like length. I would argue the logic is used to parse what might be called a variable "description", and thus it is not identical to code which might validate an actual parsed variable name. Eg, ${^Var} is a description of the variable whose "name" is "\026ar". The exception of course is $^ whose name actually is "^". This includes more tests for leading zero checks. See Issue #12948, Issue #19986, and Issue #19989.

…strict. Executive summary: in ${ .. } style notation consistently forbid octal and allow multi-digit longer decimal values under strict. The vars ${1} through ${9} have always been allowed under strict, but ${10} threw an error unlike its equivalent variable $10. In 60267e1 I patched toke.c to refuse octal like $001 but did not properly handle ${001} and related cases when the code was under 'use utf8'. Part of the reason was the confusing macro VALID_LEN_ONE_IDENT() which despite its name does not restrict what it matches to things which are one character long. Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its name and placement is confusing I have moved it back into the code inline as part of this fix. I have also added more comments about what is going on, and moved the related comment directly next to the code that it affects. If it moved out of this code then we should think of a better name and be more careful and clear about checking things like length. I would argue the logic is used to parse what might be called a variable "description", and thus it is not identical to code which might validate an actual parsed variable name. Eg, ${^Var} is a description of the variable whose "name" is "\026ar". The exception of course is $^ whose name actually is "^". This includes more tests for allowed vars and forbidden var names. See Issue #12948, Issue #19986, and Issue #19989.

bram-perl · 2022-08-01T14:01:42Z

Change was merged on blead (c432f9f), closing the issue

…strict. Executive summary: in ${ .. } style notation consistently forbid octal and allow multi-digit longer decimal values under strict. The vars ${1} through ${9} have always been allowed under strict, but ${10} threw an error unlike its equivalent variable $10. In 60267e1 I patched toke.c to refuse octal like $001 but did not properly handle ${001} and related cases when the code was under 'use utf8'. Part of the reason was the confusing macro VALID_LEN_ONE_IDENT() which despite its name does not restrict what it matches to things which are one character long. Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its name and placement is confusing I have moved it back into the code inline as part of this fix. I have also added more comments about what is going on, and moved the related comment directly next to the code that it affects. If it moved out of this code then we should think of a better name and be more careful and clear about checking things like length. I would argue the logic is used to parse what might be called a variable "description", and thus it is not identical to code which might validate an actual parsed variable name. Eg, ${^Var} is a description of the variable whose "name" is "\026ar". The exception of course is $^ whose name actually is "^". This includes more tests for allowed vars and forbidden var names. See Issue Perl#12948, Issue Perl#19986, and Issue Perl#19989.

bram-perl added the Needs Triage label Jul 24, 2022

demerphq mentioned this issue Jul 27, 2022

toke.c - improve handling of $00 and ${00} #20000

Merged

jkeenan removed the Needs Triage label Jul 27, 2022

bram-perl closed this as completed Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

${10}, ${11}, ... does not work when 'use utf8' and 'use strict' is in use #19989

${10}, ${11}, ... does not work when 'use utf8' and 'use strict' is in use #19989

bram-perl commented Jul 24, 2022 •

edited

Loading

demerphq commented Jul 27, 2022

demerphq commented Jul 27, 2022

demerphq commented Jul 27, 2022

bram-perl commented Jul 27, 2022

demerphq commented Jul 27, 2022

bram-perl commented Jul 27, 2022

demerphq commented Jul 27, 2022

bram-perl commented Aug 1, 2022

${10}, ${11}, ... does not work when 'use utf8' and 'use strict' is in use #19989

${10}, ${11}, ... does not work when 'use utf8' and 'use strict' is in use #19989

Comments

bram-perl commented Jul 24, 2022 • edited Loading

Footnotes

demerphq commented Jul 27, 2022

demerphq commented Jul 27, 2022

demerphq commented Jul 27, 2022

bram-perl commented Jul 27, 2022

demerphq commented Jul 27, 2022

bram-perl commented Jul 27, 2022

demerphq commented Jul 27, 2022

bram-perl commented Aug 1, 2022

bram-perl commented Jul 24, 2022 •

edited

Loading