Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

${10}, ${11}, ... does not work when 'use utf8' and 'use strict' is in use #19989

Closed
bram-perl opened this issue Jul 24, 2022 · 8 comments
Closed

Comments

@bram-perl
Copy link

bram-perl commented Jul 24, 2022

Description

This is basically the same issue as #12948, in #12948 fixes were applied1 but these
only work when use utf8 is not in use2.

In addition to that: under use utf8 it allows things like ${010} which were also forbidden in #12948.

Steps to reproduce

  1. Using ${10}

     $ ./perl -Ilib -le 'use strict; use utf8; "abcdefghijkl" =~ m/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/ or die; print ${10};'
     Can't use string ("10") as a SCALAR ref while "strict refs" in use at -e line 1.
    
     $ ./perl -Ilib -le 'use strict;           "abcdefghijkl" =~ m/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/ or die; print ${10};'
     j
    
  2. Using ${010}

     $ ./perl -Ilib -e 'use utf8; print ${010};'
     $ ./perl -Ilib -e '          print ${010};'
     Numeric variables with more than one digit may not start with '0' at -e line 1.
    

Additional information

Looking at the second set of changes applied commit 60267e1 (L10174-10187) with a bit of context (L10166-10173) shows that the changes are made inside the else block that belongs to if (is_utf8) { which is why it falls back to the old (incorrect) behaviour under use utf8.

    10166        if (is_utf8) {
    10167            const STRLEN skip = UTF8SKIP(s);
    10168            STRLEN i;
    10169            d[skip] = '\0';
    10170            for ( i = 0; i < skip; i++ )
    10171                d[i] = *s++;
    10172        }
    10173        else {
    10174            *d = *s++;
    10175            /* special case to handle ${10}, ${11} the same way we handle ${1} etc */
    10176            if (isDIGIT(*d)) {
    10177                bool is_zero= *d == '0' ? TRUE : FALSE;
    10178                char *digit_start= d;
    10179                while (s < PL_bufend && isDIGIT(*s)) {
    10180                    d++;
    10181                    if (d >= e)
    10182                        Perl_croak(aTHX_ "%s", ident_too_long);
    10183                    *d= *s++;
    10184                }
    10185                if (is_zero && d - digit_start > 1)
    10186                    Perl_croak(aTHX_ ident_var_zero_multi_digit);
    10187            }
    10188            d[1] = '\0';
    10189        }

Footnotes

  1. commit 60267e1 (by @demerphq)

  2. this was observed/diagnosed with the help of @simcop2387 because perlbot did not trigger the error

@demerphq
Copy link
Collaborator

This is curious. I am surprised I didn't notice the discrepancy before, but if you look at that code it is related to parsing a single digit. Or seems to be anyway.

toke.c line 10161:

    if ((s <= PL_bufend - ((is_utf8)
                          ? UTF8SKIP(s)
                          : 1))
        && VALID_LEN_ONE_IDENT(s, PL_bufend, is_utf8))
    {
        if (is_utf8) {
            const STRLEN skip = UTF8SKIP(s);
            STRLEN i; 
            d[skip] = '\0';
            for ( i = 0; i < skip; i++ )
                d[i] = *s++;
        }
        else {
            *d = *s++;
            /* special case to handle ${10}, ${11} the same way we handle ${1} etc */
            if (isDIGIT(*d)) {
                bool is_zero= *d == '0' ? TRUE : FALSE;
                char *digit_start= d;
                while (s < PL_bufend && isDIGIT(*s)) {
                    d++; 
                    if (d >= e)
                        Perl_croak(aTHX_ "%s", ident_too_long);
                    *d= *s++;
                }   
                if (is_zero && d - digit_start > 1)
                    Perl_croak(aTHX_ ident_var_zero_multi_digit);
            }
            d[1] = '\0';
        }

VALID_LEN_ONE_IDENT() and the d[1] = '\0' and d[skip]= '\0' all strongly indicate that this logic relates to parsing a single character of input, so why would this logic be involved at all in parsing a multi digit variable. I note that the VALID_LEN_ONE_IDENT() macro, used only here, doesnt check if the identifier is actually one character:

toke.c line 10092

#define VALID_LEN_ONE_IDENT(s, e, is_utf8)                                  \
    (isGRAPH_A(*(s)) || ((is_utf8)                                          \
                         ? isIDFIRST_utf8_safe(s, e)                        \
                         : (isGRAPH_L1(*s)                                  \
                            && LIKELY((U8) *(s) != LATIN1_TO_NATIVE(0xAD)))))

It feels like there is more than one thing going wrong here. At the very least that macro is misleading as heck!

@demerphq
Copy link
Collaborator

Ok, so I get it now. That macro totally threw me off. That logic is just involved in parsing the first digit and the macro is misleading. I have a PR for this brewing.

demerphq added a commit that referenced this issue Jul 27, 2022
In 60267e1 I patched toke.c to refuse
$00 but did not properly handle ${00} and related cases when the code
was unicode. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

A byproduct of this change is that the logic to detect duplicated
leading zeros is now quite a bit simpler.

This includes more tests for leading zero checks.

See Issue #12948, Issue #19986, and Issue #19989.
@demerphq
Copy link
Collaborator

See #20000

@bram-perl
Copy link
Author

I think some tests need to be added for this (/ for #12948) as well;
The only test that I can find that uses ${10} in in t/re/subst.t for something that is unrelated; and that appears to be running without use strict; so it won't catch the issue.

I'm not sure where a good place would be to add those; could add some of them in t/re/re_tests which does get run with does use strict; but a test were use strict is enabled explicitly might be better?

@demerphq
Copy link
Collaborator

We have tests in b/t/comp/parser_run.t but maybe that isn't what you had in mind?

@bram-perl
Copy link
Author

There are two issues;

  • variables such as: ${00}, ${0123}, ... for those tests are added and these look fine;
  • the original problem (in On RHS of s///, ${9} works but ${10} does not #12948) that lead to such variables being disallowed: ${10} doesn't/didn't work when use strict "refs" is/was in use.

i.e.

$ ./perl -Ilib -le 'use strict; "abcdefghijkl" =~ m/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/ or die; print ${10};'

With older perls (and currently in blead when use utf8 is in use) it would error with:

Can't use string ("10") as a SCALAR ref while "strict refs" in use at -e line 1.

With blead (and without use utf8) this prints the same as $10 (i.e. 'j')

So what I expect(ed) to see somewhere is a test that verifies that ${10}, ${11}, ... work under use strict "refs"; (i.e. same behavior as ${9}, that works under use strict so ${10} should work as well)

@demerphq
Copy link
Collaborator

Ok. For what its worth I have not found the code that causes ${9} and ${10} to be handled differently. It is very annoying. ${9} seems to produce a different op tree but I have yet to find it.

demerphq added a commit that referenced this issue Jul 27, 2022
In 60267e1 I patched toke.c to refuse
$00 but did not properly handle ${00} and related cases when the code
was unicode. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

A byproduct of this change is that the logic to detect duplicated
leading zeros is now quite a bit simpler.

This includes more tests for leading zero checks.

See Issue #12948, Issue #19986, and Issue #19989.
demerphq added a commit that referenced this issue Jul 28, 2022
In 60267e1 I patched toke.c to refuse
$00 but did not properly handle ${00} and related cases when the code
was unicode. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

This includes more tests for leading zero checks.

See Issue #12948, Issue #19986, and Issue #19989.
demerphq added a commit that referenced this issue Jul 28, 2022
In 60267e1 I patched toke.c to refuse
$00 but did not properly handle ${00} and related cases when the code
was unicode. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

This includes more tests for leading zero checks.

See Issue #12948, Issue #19986, and Issue #19989.
demerphq added a commit that referenced this issue Jul 28, 2022
…strict.

Executive summary: in ${ .. } style notation consistently forbid octal
and allow multi-digit longer decimal values under strict. The vars
${1} through ${9} have always been allowed under strict, but ${10} threw
an error unlike its equivalent variable $10.

In 60267e1 I patched toke.c to refuse
octal like $001 but did not properly handle ${001} and related cases when
the code was under 'use utf8'. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

This includes more tests for allowed vars and forbidden var names.

See Issue #12948, Issue #19986, and Issue #19989.
demerphq added a commit that referenced this issue Jul 30, 2022
…strict.

Executive summary: in ${ .. } style notation consistently forbid octal
and allow multi-digit longer decimal values under strict. The vars
${1} through ${9} have always been allowed under strict, but ${10} threw
an error unlike its equivalent variable $10.

In 60267e1 I patched toke.c to refuse
octal like $001 but did not properly handle ${001} and related cases when
the code was under 'use utf8'. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

This includes more tests for allowed vars and forbidden var names.

See Issue #12948, Issue #19986, and Issue #19989.
@bram-perl
Copy link
Author

Change was merged on blead (c432f9f), closing the issue

scottchiefbaker pushed a commit to scottchiefbaker/perl5 that referenced this issue Nov 3, 2022
…strict.

Executive summary: in ${ .. } style notation consistently forbid octal
and allow multi-digit longer decimal values under strict. The vars
${1} through ${9} have always been allowed under strict, but ${10} threw
an error unlike its equivalent variable $10.

In 60267e1 I patched toke.c to refuse
octal like $001 but did not properly handle ${001} and related cases when
the code was under 'use utf8'. Part of the reason was the confusing macro
VALID_LEN_ONE_IDENT() which despite its name does not restrict what it
matches to things which are one character long.

Since the VALID_LEN_ONE_IDENT() macro is used in only one place and its
name and placement is confusing I have moved it back into the code
inline as part of this fix. I have also added more comments about what
is going on, and moved the related comment directly next to the code
that it affects. If it moved out of this code then we should think of a
better name and be more careful and clear about checking things like
length. I would argue the logic is used to parse what might be called a
variable "description", and thus it is not identical to code which might
validate an actual parsed variable name. Eg, ${^Var} is a description of
the variable whose "name" is "\026ar". The exception of course is $^
whose name actually is "^".

This includes more tests for allowed vars and forbidden var names.

See Issue Perl#12948, Issue Perl#19986, and Issue Perl#19989.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants