Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declarative macro expansion #573

Closed
SimplyTheOther opened this issue Jul 17, 2021 · 4 comments
Closed

Declarative macro expansion #573

SimplyTheOther opened this issue Jul 17, 2021 · 4 comments

Comments

@SimplyTheOther
Copy link
Member

SimplyTheOther commented Jul 17, 2021

This issue is designed to document the planned macro expansion system for declarative macros and to allow discussion regarding potential improvements to it.

All macro invocations are parsed in the Parser into either MacroInvocation or MacroInvocationSemi objects, depending on context. MacroInvocationSemis can be Items, associate items, or statements - basically anywhere where a semicolon would be required (though curly bracket macros do not require one). MacroInvocations can be types, patterns or expressions - basically anywhere where a semicolon should never be used. This was done largely due to the Reference's separation into these two forms, and has its advantages and disadvantages.

However, in order to simplify expansion, both MacroInvocation and MacroInvocationSemi classes have a MacroInvocData member variable, which stores the actual "data" of the macro (including name of macro and all tokens used inside). This was intended to allow the MacroExpander to operate by expanding instances of the data class rather than having to write two virtually identical methods for each expand method.

The basic idea for macro expansion is for the MacroExpander to basically scan through the entire AST, storing any macro definitions found (MacroRulesDefinition for declarative macros) and then attempting to expand any macro invocations found. The result of the expanded macro invocation (if it is possible to expand it currently) will then replace the macro invocation in the AST.

To actually expand declarative macros, it is planned for the expand_decl_macro method in MacroExpander be used. This takes the invocation data and the rules definition, and should return an ASTFragment. This sounds simple enough, but the actual implementation required is quite complicated and has multiple potential approaches, as detailed in #17. In short, Rust macro definitions can (and frequently do) have several potential "match patterns", and if one does not match, then the next should be tried. The basic consensus at the time was to follow this "backtracking" approach, rather than attempt a predictive one or use a generated parser.

With the backtracking approach, the macro parser would call specific methods of a new Parser instance on the token stream stored in the macro - e.g. parse_expr when an expression non-terminal was found as the next token in the match pattern. If there was a parse error, the macro parser would abort parsing that match pattern and attempt the next. If the macro parser successfully matched the entire "match pattern" with no parse errors, then the "replacement pattern" specified by that match-replacement pair would be used to generate the AST fragment.

Since the regular Parser class is being reused by the MacroParser, it requires some sort of (wrapped) stream of tokens to operate on. However, due to how Rust uses the concept of a "token tree" (either a non-delimiter token or a token tree within delimiters) in macros (and also since the Reference models it as such), I thought it was easier to store macro tokens in a token tree form. As such, there needs to be a conversion of this token tree form into a token stream before the Parser uses it. However, intuition tells me that there are likely issues with doing this that I haven't thought of yet.

Another issue is that the Parser may, in rare cases, mutate the token stream. For example, it will split a >> (right-shift) token into two > (right angle bracket) tokens if it expects right angle brackets at the end of a generic type, for instance. However, this will theoretically apply for some different "matched patterns" but not others - i.e. it could be a generic close in one, but a right-shift in a different one. The best way I could think of to solve that issue (i.e. to avoid expensive copies of the entire token stream each time) was to record each time the parser modified the token stream inside the stream's wrapper, and have a "reset" function on the wrapper that reversed all changes. For example, a stack holding pairs of "token pointer" and "stream index" inside the wrapper could be used to implement this.

The last paragraph was where I got up to when thinking about the implementation. I'm sure there's other issues or implementation details that I haven't thought of yet.

Notes:

  • By "declarative macro", I refer to what the Rust Reference calls "macros by example" or "MBE" - macros that are based entirely on declarative "if it matches this, replace it with this" logic. This is in contrast to "procedural macros" that can actually run code at compile time. Declarative macros, or at least user-defined ones, are defined using macro_rules, while procedural macros are defined differently.
  • By "macro invocation", I mean a place where a macro is called, or executed if you like. It is the place where the macro will be replaced by a result at compile time.
  • By "macro definition", I mean the place where a specific declarative macro is defined, i.e. the rules regarding what to match and what to replace said matches with.
  • The Reference mathematically defines how declarative macros should behave in naively ambiguous situations here.
  • According to the Reference, Rust uses a somewhat complicated double-scoping system for macros that probably complicates the simple model that I have assumed above. There is "textual scope" (comparable to the scope of a declared variable - to end of enclosing scope) and "path scope", and different macro invocations can force one or the other.
    • Similar to variable declarations, macro definitions can shadow others. As a result, macro definitions would have to be stored in some sort of "list of maps" data structure.
  • Macro expansion is significantly complicated by the fact that it and name resolution are interlaced, in the sense that neither can be fully completed without the other. This is because some macro lookups require name resolution to occur, and some name resolution requires macro expansion to have finished to generate something, etc. rustc has a complex way of doing this.
  • Declarative macros are hygienic, which means that the variables within are in their own scope. Special handling that I have not thought about yet is likely to be required for this.
  • An AST fragment is conceptually just a "fragment" of the AST that can syntactically replace the macro invocation at its specific context. Because of the many different contexts a macro invocation can be in (statement, pattern, expression, etc.), an AST fragment has to be very freeform.
    • My current best idea for its representation is as an ordered list of so-called SingleASTNodes, where a SingleASTNode is a single AST node of any type - expression, statement, item, etc. SingleASTNode should be a tagged union, but due to implementation difficulties it is currently a struct and so wastes memory. There are advantages and disadvantages to this representation as well.
  • The reason why AST::Token stores a pointer to Lexer::Token is to prevent the requirement for potentially expensive re-conversion of tokens between the two types, while still allowing AST::Token to be represented heterogeneously as a kind of TokenTree. The re-conversions would be required if, for example, there was a macro invocation within the expression being matched.
  • The reason why the Parser splits tokens like >> into two >s in certain contexts is because it seemed to me like the best solution to the problem at the time. The problem is that the lexer, which has no syntactic information available (only regular information), cannot disambiguate between a right-shift and two right angle brackets without a space between them. I thought that this would be the best option as it best reflects the "true" token stream as interpreted by the parser, and options seemed to require more special handling.
  • The reason why the Parser is a template class is to allow the use of different "wrappers" for token streams, other than the Lexer class that it was originally built for. Similarly, the reason why the parser calls add_error instead of rust_error_at is to allow parse errors to be discarded by the macro parser when it tries a new match pattern - rust_error_at cannot be recovered from.
  • Some previous discussion mainly on declarative macros can be found in Macro expansion #17.
@SimplyTheOther SimplyTheOther added this to the Macro Expansion milestone Jul 17, 2021
@bjorn3
Copy link

bjorn3 commented Jul 17, 2021

Another issue is that the Parser may, in rare cases, mutate the token stream. For example, it will split a >> (right-shift) token into two > (right angle bracket) tokens if it expects right angle brackets at the end of a generic type, for instance. However, this will theoretically apply for some different "matched patterns" but not others - i.e. it could be a generic close in one, but a right-shift in a different one. The best way I could think of to solve that issue (i.e. to avoid expensive copies of the entire token stream each time) was to record each time the parser modified the token stream inside the stream's wrapper, and have a "reset" function on the wrapper that reversed all changes. For example, a stack holding pairs of "token pointer" and "stream index" inside the wrapper could be used to implement this.

Rustc is moving towards a model where >> and the like are represented as individual tokens with additional spacing information I believe. Proc macros already behave this way: https://doc.rust-lang.org/stable/proc_macro/enum.Spacing.html Rust-analyzer's parser does the same.

@SimplyTheOther
Copy link
Member Author

Another issue is that the Parser may, in rare cases, mutate the token stream. For example, it will split a >> (right-shift) token into two > (right angle bracket) tokens if it expects right angle brackets at the end of a generic type, for instance. However, this will theoretically apply for some different "matched patterns" but not others - i.e. it could be a generic close in one, but a right-shift in a different one. The best way I could think of to solve that issue (i.e. to avoid expensive copies of the entire token stream each time) was to record each time the parser modified the token stream inside the stream's wrapper, and have a "reset" function on the wrapper that reversed all changes. For example, a stack holding pairs of "token pointer" and "stream index" inside the wrapper could be used to implement this.

Rustc is moving towards a model where >> and the like are represented as individual tokens with additional spacing information I believe. Proc macros already behave this way: https://doc.rust-lang.org/stable/proc_macro/enum.Spacing.html Rust-analyzer's parser does the same.

I did consider implementing it like that, and I suppose it would be the best option if you really wanted to keep your token stream immutable, but to me it seemed like a total pain to handle multi-character tokens with that method. For example, if you have a right-shift-assign operator, that’s three tokens you now have to care about and deal with. So I think that the logic in the parser would be significantly complicated by that.

Additionally, it would also make it much more complicated for the Pratt parser-based expression precedence system to work with a system such as that.

@bjorn3
Copy link

bjorn3 commented Jul 17, 2021

@philberty philberty added this to To do in Control Flow 3 Macros via automation Jan 4, 2022
@philberty philberty moved this from To do to In progress in Control Flow 3 Macros Feb 8, 2022
philberty added a commit that referenced this issue Feb 16, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
bors bot added a commit that referenced this issue Feb 17, 2022
938: First pass at declarative macro expansion  r=philberty a=philberty

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.
    
Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.
    
Fixes #17 #22
Addresses #573

Co-authored-by: Philip Herron <philip.herron@embecosm.com>
@philberty philberty moved this from In progress to Review in progress in Control Flow 3 Macros Feb 17, 2022
@philberty philberty moved this from Review in progress to Reviewer approved in Control Flow 3 Macros Feb 18, 2022
@philberty
Copy link
Member

This is now being done as part of the macros milestone we used much of the content here to begin our approach

Control Flow 3 Macros automation moved this from Reviewer approved to Done Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

3 participants