Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Macro expansion #17

Closed
NalaGinrut opened this issue May 5, 2020 · 13 comments · Fixed by #938
Closed

Macro expansion #17

NalaGinrut opened this issue May 5, 2020 · 13 comments · Fixed by #938

Comments

@NalaGinrut
Copy link
Contributor

The macro expansion happens before the resolution.

This issue is opened for the record of pipeline in kanban.

@NalaGinrut
Copy link
Contributor Author

NalaGinrut commented May 16, 2020

Here're related RFCs:

@NalaGinrut
Copy link
Contributor Author

NalaGinrut commented May 16, 2020

The current parser will return the MacroItem as unique_ptr, so that we lose the chance to store the macro definition into a table for expansion step. It's too slow if we query its definition by traversing the AST since the AST is unnecessarily balanced.

If we just store its unique_ptr into the table and don't put MacroItem on AST tree, then it can be a solution. But we will lose the chance to dump the macro definition in AST dumper. Because the expansion will only substitute for the matched case.

So I think we may have to maintain 2 copies of each macro definition. This is not the best way, but could be optimized in the future refactoring.

Any better idea?
@philberty @SimplyTheOther

@SimplyTheOther
Copy link
Member

SimplyTheOther commented May 17, 2020

Wouldn’t you be able to store a reference to the MacroItem (or rather a reference to a unique_ptr containing the MacroItem) in the table?

Alternatively, you could parse the macro definition into a different form during the expansion phase and put the results of that into the table, and maybe even remove the macro definition item from the constructed AST after that stage if it isn’t required anymore.

Or just move the MacroItem unique_ptr into the table while in the expansion phase, so that the MacroItem is still in the AST during the AST dump phase and so remains in the dump.

@NalaGinrut
Copy link
Contributor Author

NalaGinrut commented May 17, 2020

Wouldn’t you be able to store a reference to the MacroItem (or rather a reference to a unique_ptr containing the MacroItem) in the table?

Yes, I can, but sounds not safe to store the unwrapped reference into the table, since it's not unique anymore.

Alternatively, you could parse the macro definition into a different form during the expansion phase and put the results of that into the table, and maybe even remove the macro definition item from the constructed AST after that stage if it isn’t required anymore.

I think you're suggesting a pass to eliminate dead macros, yes I can do that.
But I don't think it's related to my question, my question is about how to query it quickly. The elimination has to rely on it.

Or just move the MacroItem unique_ptr into the table while in the expansion phase, so that the MacroItem is still in the AST during the AST dump phase and so remains in the dump.

But I have to search for the MacroItem in expansion by traversing AST for the first time.
Hmm...maybe you mean a pass to do eliminate-macro-node in AST and build lookup table in a row. Let me think about it...

Let me elaborate on the situation for other readers here: the parser interfaces are all unified to return unique_ptr, so I have no chance to move it into a lookup table before return it. But I can do it in expansion since the reference is unique.

@SimplyTheOther
Copy link
Member

Yeah, I think I misunderstood what you were trying to say. I assumed you originally meant that you'd do a "macro definition finding" pass over the AST that located all definitions and stored a parsed form in a table.
Personally I think that doing this pass would be the "cleanest" option if it isn't too slow or have other problems.

Yes, I can, but sounds not safe to store the unwrapped reference into the table, since it's not unique anymore.
As long as the lifetime of the table is shorter than the lifetime of the AST node (which it should be, unless the macro nodes are deleted during some other process), the reference should be fine. However, I can see how this could become an issue in the future and why you'd be wary about it.

I think you're suggesting a pass to eliminate dead macros, yes I can do that.
But I don't think it's related to my question, my question is about how to query it quickly. The elimination has to rely on it.

But I have to search for the MacroItem in expansion by traversing AST for the first time.
Hmm...maybe you mean a pass to do eliminate-macro-node in AST and build lookup table in a row. Let me think about it...

Not even mainly centred around "eliminating" macros rather than just "detecting" them, but yes, that's what I was suggesting. It would be the same kind of thing as a pass for name resolution - the AST is recursively traversed and any macro definitions found can be parsed to a usable form and stored in a table (or possibly some kind of reverse-linked list of tables to preserve scope or something).

Let me elaborate on the situation for other readers here: the parser interfaces are all unified to return unique_ptr, so I have no chance to move it into a lookup table before return it. But I can do it in expansion since the reference is unique.

We could take the scorched-earth approach and make everything shared_ptr if we can't find any other satisfactory options, though I'm a bit worried about the performance implications of that.

@NalaGinrut
Copy link
Contributor Author

Not even mainly centred around "eliminating" macros rather than just "detecting" them, but yes, that's what I was suggesting. It would be the same kind of thing as a pass for name resolution - the AST is recursively traversed and any macro definitions found can be parsed to a usable form and stored in a table (or possibly some kind of reverse-linked list of tables to preserve scope or something).

I think we may have to do it in the name resolution, since the expansion and resolution may be interleaved if there are macros in macros. Fortunately, Rust doesn't support forward declaration, so the lookup table can be built without any backward tracing.

We could take the scorched-earth approach and make everything shared_ptr if we can't find any other satisfactory options, though I'm a bit worried about the performance implications of that.

Definitely, I just want to make the minimum change. So far, I think unique_ptr is not a bad choice, we don't have to change if the macro is the only concern.

@SimplyTheOther
Copy link
Member

@NalaGinrut Are you currently working on this? If not, I was intending to have a try at creating some of the macro expansion code.

@NalaGinrut
Copy link
Contributor Author

NalaGinrut commented Aug 15, 2020

Currently, no. I'm very busy on my startup issues. Feel free to try what you like. I'll be back ASAP.

@SimplyTheOther
Copy link
Member

One issue I have discovered while attempting to start introducing macro expansion is how the parsing of macro_rules declarative macros should be implemented.

Essentially, macro_rules declarative macros work by declaring several different patterns that the invocation tokens can match, which are tried in order until one pattern "matches" the invocation. This is kind of a bad description, so it might help to look at the description of it in the spec or on Daniel Keep's website (which I personally believe gives a particularly good explanation).

Anyway, the issue I encountered was in implementing this "matching" feature. I have thought of a few approaches:

  1. Do what is stated as conceptually occurring, i.e. matching the invocation and definition tokens one at a time until they differ (or a non-terminal in the definition, e.g. an expression capture, fails to parse). This could be considered a "backtracking" approach.
    The advantages of this approach are its conceptual simplicity and probably more straightforward implementation.
    The disadvantage is inefficiency - any parsed captures will be disposed of after each rule is tried upon failure, even if captures would be in the "same" spot, requiring lots of re-parsing, which could be very inefficient and slow if there are a large number of rules and small number of captures.
    mrustc seems to have originally implemented a system similar to this, and decided to change it, according to their docs, due to efficiency considerations regarding tokens. I believe that the token efficiency consideration would not apply to gccrs due to a different lexer model (which uses shared_ptr, meaning copies should not have to occur).

  2. Take an approach similar to rustc, which is a "NFA-based regex parser" "similar to an Earley parser". A basic description of it can be found on the rustc dev guide.
    The advantage of this approach is that it seems to work well, judging by rustc's use of it and apparent lack of plans to replace it. From my preliminary research on Earley parsers (i.e. reading the Wikipedia article), it appears that they prevent the "combinatorial explosion" caused by backtracking with the first possible solution, which would increase efficiency in the worst case - it has cubic time complexity.
    The disadvantages are unknown to me at this point - I will have to do a lot more research to understand the concepts behind it and their implications. Also, rustc's use of it is apparently different to a pure Earley parser.
    I believe mrustc also may have historically used a system similar to this - the "decision tree" approach mentioned in their docs seems to be a way of attempting to use a similar concept. On the other hand, I have a very poor understanding of what an Earley parser actually is, so I may be misunderstanding here, since the "decision tree" seems much weaker than an Earley parser judging by its failure to work correctly.

  3. Take an approach similar to mrustc, which seems to use a simplified second parser for lookahead to determine whether a rule would work (i.e. only a cheaper parsing has to occur before a matching rule is found) and then uses the regular parser to actually parse the matching rule. The system is somewhat described here.
    The advantage of this approach is that the use of the second parser reduces resource usage compared to using a full parser.
    The disadvantage is that two separate parsers need to be maintained and kept "in sync", which is probably really annoying if they are not programmed in a similar "style".

  4. Attempt to do some kind of lookahead (maybe just for non-terminal tokens) in a different way. Advantages and disadvantages obviously depend on the exact method of implementation.

Does anyone else have any opinions or thoughts on the possible options we could take here? I'm a bit unsure on how to proceed, but I'm leaning towards the first option (at least initially).

@philberty
Copy link
Member

philberty commented Jan 3, 2021

My gut feeling would be approach 1 or 3. I doubt many people have much experience with Earley parsers. Approach 3 seems reasonable and might be simpler to implement compared to approach 1, but i don't have any real opinion either way.

@NalaGinrut
Copy link
Contributor Author

NalaGinrut commented Jan 4, 2021

Before I make my comment, I have to mention that we're still in brainstorm, so I'm not pushing any of my suggestions.

The MBE in rustc is a NFA-based regex parser inspired by the spirit of Earley parser. This sounds like they designed a dedicated algorithm for their case, not using Earley parser directly. IMO, this is a good way for a project, but it may require much time.

Can we make our design more flexible that we can replace the algorithm in the future? If so, we may use backtracking way in the beginning. This may save time for us to find the best solution in the beginning.
@philberty @SimplyTheOther

@SimplyTheOther
Copy link
Member

My gut feeling would be approach 1 or 3. I doubt many people have much experience with Earley parsers. Approach 3 seems reasonable and might be simpler to implement compared to approach 1, but i don't have any real opinion either way.

Can we make our design more flexible that we can replace the algorithm in the future? If so, we may use backtracking way in the beginning. This may save time for us to find the best solution in the beginning.
@philberty @SimplyTheOther

I was also leaning towards implementing a "backtracking"-style algorithm (i.e. option 1).

As for replacing the algorithm, I'm sure that the macro parser can be implemented in such a way that whatever calls the parse_single_macro function (or whatever it will be called) does not have to worry about the inner operations of the macro parsing algorithm. I'm not sure if this is what you mean by "replace the algorithm", or if you meant something more flexible than this.

@NalaGinrut
Copy link
Contributor Author

@SimplyTheOther I think your opinion is to hygienically abstract it to parse_single_macro function, This is flexible enough to implement my idea "replace the algorithm". Let me emphasize hygiene here, in case it introduces global side-effects.

@philberty philberty added this to the Macro Expansion milestone Jan 4, 2022
philberty added a commit that referenced this issue Feb 16, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
philberty added a commit that referenced this issue Feb 17, 2022
This is the first pass at implementing macros more testcases are needed.

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.

Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.

Fixes #17 #22
Addresses #573
bors bot added a commit that referenced this issue Feb 17, 2022
938: First pass at declarative macro expansion  r=philberty a=philberty

This does not support repetition matchers but it supports simple
declarative macros and transcribes them. The approach taken here is that
we reuse our existing parser to call the apropriate functions as specified
as part of the MacroFragmentType enum if the parser does not have errors
parsing that item then it must be a match.
    
Then once we match a rule we have a map of the token begin/end offsets
for each fragment match, this is then used to adjust and create a new token
stream for the macro rule definition so that when we feed it to the parser
the tokens are already substituted. The resulting expression or item is
then attached to the respective macro invocation and this is then name
resolved and used for hir lowering.
    
Fixes #17 #22
Addresses #573

Co-authored-by: Philip Herron <philip.herron@embecosm.com>
@bors bors bot closed this as completed in 2c03f34 Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants