Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about_parsing additional expression mode start characters #3440

Closed
msftrncs opened this issue Dec 16, 2018 · 13 comments · Fixed by #8053
Closed

about_parsing additional expression mode start characters #3440

msftrncs opened this issue Dec 16, 2018 · 13 comments · Fixed by #8053
Assignees
Labels
area-engine Area - PowerShell engine area-native-cmds Area - native command support

Comments

@msftrncs
Copy link

msftrncs commented Dec 16, 2018

Issue Details

I have left the template blank at this time, but I believe this might affect all versions.


  • In argument mode, each value is treated as an expandable string unless it begins with one of the following special characters: dollar sign ($), at sign (@), single quotation mark ('), double quotation mark ("), or an opening parenthesis (().

If preceded by one of these characters, the value is treated as a value expression.


On about_parsing, there are additional characters that cause the expression mode, including:

  • '-' or '+' or '!' as long as it is in turn followed by a numeric value or one of the other mentioned characters, ie, a valid expression.

-$a or -2 or !$a is an expression mode
-hello or +hello or !test is an argument mode

Also, I think the last line should actually be part of the bulleted paragraph.

I think this whole section also fails to explain that the mode determination is based on the first token of each command statement and then applies to the remainder of the statement, noting that parenthesis and braces begin a new sub-statement. Instead this section implies that the mode is determine for each token (it mentions the tokens are interpreted independently, but they are not, as interpretation of all tokens after the first one depend on the first one's evaluation). Assuming a function named 'hello', hello $a-3, the token $a-3 is still treated as an expandable string, not an expression. An example in the doc also shows this, Write-Output $a/H.

Also, I think the word 'value' may have gotten used where 'token' should have been used, for consistency. The doc started off talking about 'tokens' and then switches to 'values' in the bulleted items.


Document Details

Do not edit this section. It is required for docs.microsoft.com GitHub issue linking.

@sdwheeler sdwheeler added Reference area-engine Area - PowerShell engine labels Dec 17, 2018
@msftrncs
Copy link
Author

I noticed today that the opening bracket [ may also be one of those characters that define expression mode.

@mklement0
Copy link
Contributor

@msftrncs, some good points, but I don't think -, +, or [ force expression mode, as the following examples demonstrate:

(write-output -10).GetType().Name
(write-output +10).GetType().Name
(write-output !0).GetType().Name
(write-output [string]).GetType().Name

All these commands output String, indicating that the arguments were parsed as strings (following these chars. with a variable reference wouldn't make a difference).

Unquoted "number-looking" literals without a sign - e.g., 10, 0xa, 2.0 - are, however, half-parsed in expression mode: they are parsed as (suitably typed) numbers that, however, retain their original string representation via their [psobject] wrapper:

PS> (Write-Output 0xa).GetType().Name
Int32   # !! Parsed as number

# However, on output the original string representation is retained:
PS> Write-Output 0xa
0xa  # !! Not, 10, as you would get with Write-Output (0xa)

This awkward hybrid behavior must be retained for backward compatibility, however; it is implemented inconsistently in PowerShell code, unfortunately: see PowerShell/PowerShell#9157 for background information (note what said issue proposes as a resolution is misguided - I'm planning to revise it soon).

However, a char. that is missing from the list of expression-initiating chars. is {, because an unquoted {...} token is parsed as a script block:

PS> Write-Output {ha}
ha   # !! stringified script-block == literal contents between { and }

As for compound tokens such as $a/H, see #3038, whose basis is PowerShell/PowerShell#6467.

@msftrncs
Copy link
Author

@mklement0, in the examples of (write-output xxx), you have already switched to aurgument mode. My issue was when sitting at the point between when deciding to go to expression mode or argument mode. Once in a given mode, that mode remains until certain delimiting characters come along, such as the closing ')' in your examples. The opening '(' does two things, it switches the mode to expression mode, but then enters a subexpression, which starts with a new decision of argument mode or expression mode. The command name sends the sub-expression to argument mode. If instead of a command name, a + character appeared, it would have switched to expression mode.

@msftrncs
Copy link
Author

I should note in my previous comment about '[', that only if the contents after the '[' is a type name, and not an attribute, will it result in expression mode, so the '[' by itself does not mark the switch to expression mode.

@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2019

@msftrncs: I thought this issue is about what characters at the start of a token in argument mode decide whether argument mode or expression mode is chosen for that argument - that's the passage from the docs you're quoting in the initial post is about - and that passage is missing {.

My point was that +, -, -, and [ at the start of a token in argument mode do not switch to expression mode - are we in agreement there?

[ starting something that looks like a type literal doesn't change anything:

PS> Write-Output [int]
[int]  # string literal
PS> (Write-Output [int]).GetType().Name
String  

Now, intra-token use of special characters seems to follow the same rules: inside an unquoted token, encountering one of the special chars.:

  • starts a new parsing context
  • implicitly ends the previous argument
PS> Write-Output a(2)  # parses as *2* arguments
a  # string literal 'a'
2 # [int] 2 due to (2) being parsed as an expression

Again, a + does not do that:

PS> Write-Output foo+10
foo+10  # single string literal 

@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2019

@msftrncs: I think I now understand where the confusion lies:

You are talking about what characters determine whether to enter argument or expression mode, either at the start of a statement (start of a line or after ; or |) or after (, $( , @(, or { have forced a new parsing context in argument mode.

And, yes, I agree that it's worth spelling out how that decision is made in the docs.

@msftrncs
Copy link
Author

@mklement0,

You are talking about what characters determine whether to enter argument or expression mode, either at the start of a statement (start of a line or after ; or |) or after (, $( , or @( have forced a new parsing context in argument mode.

Correct, as that is what I take the 'about_parsing' document to be referring to.

@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2019

@msftrncs:

  • about_Parsing currently discusses the two parsing modes separately.

  • The passage you quote in your the initial post relates to already being in argument mode ("In argument mode, ..."), and what rules apply to command arguments - it is that behavior that I've tried to clarify in my previous comments.

What's missing is a description of what determines which parsing mode is chosen when:

The rules, from what I understand are (I have not looked at the source code):

  • A new parsing context (in which the decision between argument and expression mode must be made) is entered:

    • at the start of a new statement (e.g., the start of a line or after a statement-separating ;)
    • at the start of a new pipeline segment (after |), though semantically only commands (argument mode), not expressions are allowed there.
    • inside $(...), @(...), (...), and {...} in expression mode
    • inside $(...) in double-quoted strings
    • inside $(...), @(...), (...), and {...} in argument mode, where recognized as such
  • Argument mode is entered:

    • if the first token is syntactically an unquoted command name (e.g., Get-Date or git)
    • or it is one of the command-invocation operators, & or .
      • & is by itself unequivocally the call operator.
      • ., by contrast, is only recognized as the dot-sourcing operator if followed by a space, $, (, $(, ', ", or {
        • otherwise:
          • if followed by a decimal integer, it is interpreted as a decimal fraction and therefore starts an expression (e.g., .7)
          • otherwise: it is interpreted as the start of a command name (e.g., .foo)
  • Expression mode is entered with any of the following:

    • The characters that are also special when already in argument mode (which includes {, as mentioned)

      • To recap, these are: $ @ ' " ( {

      • $(, @(, and ( start subexpressions ($(...), @(...), (...)) that constitute a new, nested parsing context, inside of which the parsing mode is determined anew.

      • Similarly, {, which starts a script block ({...}), constitutes a new, nested parsing context.

    • [ - it is invariably interpreted as the start of a type literal such as [int].

    • +

    • -, but only if followed by +, -, a space, a number literal, or any of the special characters from argument mode.

      • Counterexample: -foo is interpreted as a command name (argument mode).
    • ., but only if followed by a decimal integer without sign; e.g., .123, which is the same as 0.123 and therefore a [double].

    • Number literals (e.g., 10, 0xa, 2.0, 1e2) - optionally preceded by + or -

      • A digit by itself do not necessarily start an expression; counterexample: 7z

@msftrncs
Copy link
Author

I understand what you are getting at, @mklement0. Its dawned on me now that this particular document is trying to describe something differently that I am thinking. However, I still have a problem with it. I think its describing the wrong subject, and its phrasing everything poorly. I think it was understood well from other areas that '(' starts a subexpression. However, '$' does not start an expression in the same sense. Yes, I can reference members of the object, but I cannot go beyond that, so I do not feel that is an expression, and definitely not expression mode. (I consider it a reference)

I think we both agree that people need a better explanation of the mode switch that occurs when you start a statement with a function name, versus a variable reference (as one example). A function name expects to be followed by parameter arguments, of which may be expressions if that's what they are called, but expression mode, allows arithmetic and the other operators, which cannot be used directly in an argument without a subexpression, and this is what I always thought the 'about_parsing' document was trying to describe.

BTW, Here is my list of what is allowed with the dot-source operator, in REGEX form, as is in my PR #156 in EditorSyntax:

\.(?=\*?[\s,;&|{}\(\)]|\$[\p{L}$?^:_{])

The . may be followed directly by a *, or the characters you stated, or a &, or a |. I am missing the quotes, and the sub-statement. I don't know how I determined a '', but I cannot confirm that works now. I see I need to add more, and clean up the ''. (Yes, the & or | represent invalid arguments, but from a syntax highlighting point of view, they have to be allowed.)

Everything else you list I have handled.

Ultimately I think I will close this issue. I think the document needs clarified, but that can be a new issue that starts off on the right context.

@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2019

Thanks, @msftrncs.

Fully agreed that there's a lot of room for improvement of the help topic at hand.

Good point re a variable reference (e.g. $var) at the start of an argument in argument mode not being a full expression, though you can access a property or call a method; the help topic currently calls it a value expression, but that term isn't defined.
Inside an argument string expansion rules then apply, where you can only reference a variable as a whole (Write-Output $PSVersionTable.PSEdition -> 'Core' vs. Write-Output foo$PSVersionTable.PSEdition -> 'fooSystem.Management.Automation.PSVersionHashTable.PSEdition')
I just remembered that I tried to write a comprehensive overview of how unquoted arguments are parsed in this Stack Overflow answer.

@msftrncs
Copy link
Author

On the unquoted expansion, the only thing I am having problems parsing in TextMate is a reference such as

echo $a$b$c.length

You get .length (if all variables are unassigned), but

echo $a.length$b.length$c.length yields '0 0 0' (each on separate lines)

@mklement0
Copy link
Contributor

mklement0 commented Apr 1, 2019

Yeah, that's surprising:

a$b$c.length is treated like "a$b$c.length", i.e., like an expandable string.

$a.length$b.length$c.length is treated like $a.length $b.length $c.length, i.e., 3 distinct arguments.

Such compound tokens - by that I mean the direct concatenation of two or more distinct syntax constructs that may or may not be parsed as a single argument - show surprising behavior:

  • If the first token is an expression (something that starts with one of the special argument-mode chars.), whatever comes after starts a new argument.

    • The exception - which you've observed - is if the expression is a simple variable reference ($a), as opposed to a variable reference plus member access ($a.length) - see next point.
  • If the first token is either an unquoted literal or a simple variable reference, it is combined with what comes after as an implicit expandable string, with the added feature of recognizing not only $(...) as part of the same compound token, but also quoted tokens, which have their quotes stripped.

    • Note that an additional token starting with just ( rather than $( as well as with { again starts a new argument (with @( and @{ not getting recognized as such, so that the @ is appended to the previous token).

This makes for a surprising asymmetry, which is the subject of PowerShell/PowerShell#6467

Notable examples:

##  $(...) asymmetry

# $(...) after unquoted literal: 1 string argument
PS> Write-Output 3$(1+2)
33

# $(...) before unquoted literal: 2 arguments (both [int])
PS> Write-Output $(1+2)3
3
3

## Quoted-string asymmetry:

# Quoted string after unquoted literal: 1 string argument (with quotes stripped)
PS> Write-Output 3'3'
33

# Quoted string before unquoted literal: 2 arguments (1 string, 1 [int])
PS> Write-Output '3'3
3
3

@sdwheeler
Copy link
Contributor

Adding link to PowerShell/PowerShell#6467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-engine Area - PowerShell engine area-native-cmds Area - native command support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants