Declarative macro expansion #573

SimplyTheOther · 2021-07-17T13:32:24Z

This issue is designed to document the planned macro expansion system for declarative macros and to allow discussion regarding potential improvements to it.

All macro invocations are parsed in the Parser into either MacroInvocation or MacroInvocationSemi objects, depending on context. MacroInvocationSemis can be Items, associate items, or statements - basically anywhere where a semicolon would be required (though curly bracket macros do not require one). MacroInvocations can be types, patterns or expressions - basically anywhere where a semicolon should never be used. This was done largely due to the Reference's separation into these two forms, and has its advantages and disadvantages.

However, in order to simplify expansion, both MacroInvocation and MacroInvocationSemi classes have a MacroInvocData member variable, which stores the actual "data" of the macro (including name of macro and all tokens used inside). This was intended to allow the MacroExpander to operate by expanding instances of the data class rather than having to write two virtually identical methods for each expand method.

The basic idea for macro expansion is for the MacroExpander to basically scan through the entire AST, storing any macro definitions found (MacroRulesDefinition for declarative macros) and then attempting to expand any macro invocations found. The result of the expanded macro invocation (if it is possible to expand it currently) will then replace the macro invocation in the AST.

To actually expand declarative macros, it is planned for the expand_decl_macro method in MacroExpander be used. This takes the invocation data and the rules definition, and should return an ASTFragment. This sounds simple enough, but the actual implementation required is quite complicated and has multiple potential approaches, as detailed in #17. In short, Rust macro definitions can (and frequently do) have several potential "match patterns", and if one does not match, then the next should be tried. The basic consensus at the time was to follow this "backtracking" approach, rather than attempt a predictive one or use a generated parser.

With the backtracking approach, the macro parser would call specific methods of a new Parser instance on the token stream stored in the macro - e.g. parse_expr when an expression non-terminal was found as the next token in the match pattern. If there was a parse error, the macro parser would abort parsing that match pattern and attempt the next. If the macro parser successfully matched the entire "match pattern" with no parse errors, then the "replacement pattern" specified by that match-replacement pair would be used to generate the AST fragment.

Since the regular Parser class is being reused by the MacroParser, it requires some sort of (wrapped) stream of tokens to operate on. However, due to how Rust uses the concept of a "token tree" (either a non-delimiter token or a token tree within delimiters) in macros (and also since the Reference models it as such), I thought it was easier to store macro tokens in a token tree form. As such, there needs to be a conversion of this token tree form into a token stream before the Parser uses it. However, intuition tells me that there are likely issues with doing this that I haven't thought of yet.

Another issue is that the Parser may, in rare cases, mutate the token stream. For example, it will split a >> (right-shift) token into two > (right angle bracket) tokens if it expects right angle brackets at the end of a generic type, for instance. However, this will theoretically apply for some different "matched patterns" but not others - i.e. it could be a generic close in one, but a right-shift in a different one. The best way I could think of to solve that issue (i.e. to avoid expensive copies of the entire token stream each time) was to record each time the parser modified the token stream inside the stream's wrapper, and have a "reset" function on the wrapper that reversed all changes. For example, a stack holding pairs of "token pointer" and "stream index" inside the wrapper could be used to implement this.

The last paragraph was where I got up to when thinking about the implementation. I'm sure there's other issues or implementation details that I haven't thought of yet.

Notes:

By "declarative macro", I refer to what the Rust Reference calls "macros by example" or "MBE" - macros that are based entirely on declarative "if it matches this, replace it with this" logic. This is in contrast to "procedural macros" that can actually run code at compile time. Declarative macros, or at least user-defined ones, are defined using macro_rules, while procedural macros are defined differently.
By "macro invocation", I mean a place where a macro is called, or executed if you like. It is the place where the macro will be replaced by a result at compile time.
By "macro definition", I mean the place where a specific declarative macro is defined, i.e. the rules regarding what to match and what to replace said matches with.
The Reference mathematically defines how declarative macros should behave in naively ambiguous situations here.
According to the Reference, Rust uses a somewhat complicated double-scoping system for macros that probably complicates the simple model that I have assumed above. There is "textual scope" (comparable to the scope of a declared variable - to end of enclosing scope) and "path scope", and different macro invocations can force one or the other.
- Similar to variable declarations, macro definitions can shadow others. As a result, macro definitions would have to be stored in some sort of "list of maps" data structure.
Macro expansion is significantly complicated by the fact that it and name resolution are interlaced, in the sense that neither can be fully completed without the other. This is because some macro lookups require name resolution to occur, and some name resolution requires macro expansion to have finished to generate something, etc. rustc has a complex way of doing this.
Declarative macros are hygienic, which means that the variables within are in their own scope. Special handling that I have not thought about yet is likely to be required for this.
An AST fragment is conceptually just a "fragment" of the AST that can syntactically replace the macro invocation at its specific context. Because of the many different contexts a macro invocation can be in (statement, pattern, expression, etc.), an AST fragment has to be very freeform.
- My current best idea for its representation is as an ordered list of so-called SingleASTNodes, where a SingleASTNode is a single AST node of any type - expression, statement, item, etc. SingleASTNode should be a tagged union, but due to implementation difficulties it is currently a struct and so wastes memory. There are advantages and disadvantages to this representation as well.
The reason why AST::Token stores a pointer to Lexer::Token is to prevent the requirement for potentially expensive re-conversion of tokens between the two types, while still allowing AST::Token to be represented heterogeneously as a kind of TokenTree. The re-conversions would be required if, for example, there was a macro invocation within the expression being matched.
The reason why the Parser splits tokens like >> into two >s in certain contexts is because it seemed to me like the best solution to the problem at the time. The problem is that the lexer, which has no syntactic information available (only regular information), cannot disambiguate between a right-shift and two right angle brackets without a space between them. I thought that this would be the best option as it best reflects the "true" token stream as interpreted by the parser, and options seemed to require more special handling.
The reason why the Parser is a template class is to allow the use of different "wrappers" for token streams, other than the Lexer class that it was originally built for. Similarly, the reason why the parser calls add_error instead of rust_error_at is to allow parse errors to be discarded by the macro parser when it tries a new match pattern - rust_error_at cannot be recovered from.
Some previous discussion mainly on declarative macros can be found in Macro expansion #17.

The text was updated successfully, but these errors were encountered:

bjorn3 · 2021-07-17T13:59:58Z

Another issue is that the Parser may, in rare cases, mutate the token stream. For example, it will split a >> (right-shift) token into two > (right angle bracket) tokens if it expects right angle brackets at the end of a generic type, for instance. However, this will theoretically apply for some different "matched patterns" but not others - i.e. it could be a generic close in one, but a right-shift in a different one. The best way I could think of to solve that issue (i.e. to avoid expensive copies of the entire token stream each time) was to record each time the parser modified the token stream inside the stream's wrapper, and have a "reset" function on the wrapper that reversed all changes. For example, a stack holding pairs of "token pointer" and "stream index" inside the wrapper could be used to implement this.

Rustc is moving towards a model where >> and the like are represented as individual tokens with additional spacing information I believe. Proc macros already behave this way: https://doc.rust-lang.org/stable/proc_macro/enum.Spacing.html Rust-analyzer's parser does the same.

SimplyTheOther · 2021-07-17T14:37:01Z

Another issue is that the Parser may, in rare cases, mutate the token stream. For example, it will split a >> (right-shift) token into two > (right angle bracket) tokens if it expects right angle brackets at the end of a generic type, for instance. However, this will theoretically apply for some different "matched patterns" but not others - i.e. it could be a generic close in one, but a right-shift in a different one. The best way I could think of to solve that issue (i.e. to avoid expensive copies of the entire token stream each time) was to record each time the parser modified the token stream inside the stream's wrapper, and have a "reset" function on the wrapper that reversed all changes. For example, a stack holding pairs of "token pointer" and "stream index" inside the wrapper could be used to implement this.

Rustc is moving towards a model where >> and the like are represented as individual tokens with additional spacing information I believe. Proc macros already behave this way: https://doc.rust-lang.org/stable/proc_macro/enum.Spacing.html Rust-analyzer's parser does the same.

I did consider implementing it like that, and I suppose it would be the best option if you really wanted to keep your token stream immutable, but to me it seemed like a total pain to handle multi-character tokens with that method. For example, if you have a right-shift-assign operator, that’s three tokens you now have to care about and deal with. So I think that the logic in the parser would be significantly complicated by that.

Additionally, it would also make it much more complicated for the Pratt parser-based expression precedence system to work with a system such as that.

bjorn3 · 2021-07-17T14:43:38Z

Rust-analyzer uses at_composite2 and at_composite3: https://github.com/rust-analyzer/rust-analyzer/blob/a2f83c956e277b23e20ea2f4f0ac8c5034fe2507/crates/parser/src/parser.rs#L128-L148

This is the first pass at implementing macros more testcases are needed. This does not support repetition matchers but it supports simple declarative macros and transcribes them. The approach taken here is that we reuse our existing parser to call the apropriate functions as specified as part of the MacroFragmentType enum if the parser does not have errors parsing that item then it must be a match. Then once we match a rule we have a map of the token begin/end offsets for each fragment match, this is then used to adjust and create a new token stream for the macro rule definition so that when we feed it to the parser the tokens are already substituted. The resulting expression or item is then attached to the respective macro invocation and this is then name resolved and used for hir lowering. Fixes #17 #22 Addresses #573

938: First pass at declarative macro expansion r=philberty a=philberty This does not support repetition matchers but it supports simple declarative macros and transcribes them. The approach taken here is that we reuse our existing parser to call the apropriate functions as specified as part of the MacroFragmentType enum if the parser does not have errors parsing that item then it must be a match. Then once we match a rule we have a map of the token begin/end offsets for each fragment match, this is then used to adjust and create a new token stream for the macro rule definition so that when we feed it to the parser the tokens are already substituted. The resulting expression or item is then attached to the respective macro invocation and this is then name resolved and used for hir lowering. Fixes #17 #22 Addresses #573 Co-authored-by: Philip Herron <philip.herron@embecosm.com>

philberty · 2022-03-14T09:46:23Z

This is now being done as part of the macros milestone we used much of the content here to begin our approach

SimplyTheOther added the plan label Jul 17, 2021

SimplyTheOther added this to the Macro Expansion milestone Jul 17, 2021

philberty added this to To do in Control Flow 3 Macros via automation Jan 4, 2022

philberty moved this from To do to In progress in Control Flow 3 Macros Feb 8, 2022

philberty mentioned this issue Feb 16, 2022

First pass at declarative macro expansion #938

Merged

philberty moved this from In progress to Review in progress in Control Flow 3 Macros Feb 17, 2022

philberty moved this from Review in progress to Reviewer approved in Control Flow 3 Macros Feb 18, 2022

philberty closed this as completed Mar 14, 2022

Control Flow 3 Macros automation moved this from Reviewer approved to Done Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declarative macro expansion #573

Declarative macro expansion #573

SimplyTheOther commented Jul 17, 2021 •

edited

bjorn3 commented Jul 17, 2021

SimplyTheOther commented Jul 17, 2021

bjorn3 commented Jul 17, 2021

philberty commented Mar 14, 2022

Declarative macro expansion #573

Declarative macro expansion #573

Comments

SimplyTheOther commented Jul 17, 2021 • edited

bjorn3 commented Jul 17, 2021

SimplyTheOther commented Jul 17, 2021

bjorn3 commented Jul 17, 2021

philberty commented Mar 14, 2022

SimplyTheOther commented Jul 17, 2021 •

edited