Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trivia interface #2

Open
c42f opened this issue Dec 16, 2021 · 2 comments
Open

Trivia interface #2

c42f opened this issue Dec 16, 2021 · 2 comments
Labels

Comments

@c42f
Copy link
Member

c42f commented Dec 16, 2021

I was thinking a bit about the "right" interface to trivia.

The rust-analyzer people are discussing it over at rust-lang/rust-analyzer#6584 so they've got some good background reading there. It seems generally awkward with no obviously right answer.

IIUC there's two common interfaces:

  • Roslyn, Swift libsyntax — attach multiple trivia tokens trivia to each side of nontrivia nodes (or just nontrivia leaf nodes?). The nontrivia nodes themselves become "fatter", and the depth of the tree is increased by 1.
  • rust-analyzer, IntelliJ(??) — attach trivia tokens as arbitrary children of any interior nodes. So they're generally siblings of nontrivia nodes.

The rust-analyzer model is appealing because it leads to simpler data structures with less internal structure. Also it's more general because the trivia might be naturally interspersed with nontrivia children but without a natural attachment to any of the children. But we could go for either approach, or something else entirely.

Whitespace trivia

A useful observation: we can't attaching whitespace so that

  • Every node represents a contiguous span of bytes in the source file - a fundamental property of green trees
  • We respect the visual tree structure — in the sense that nested nodes should only contain whitespace relevant to their own internal tree structure (rather than where they are placed in the larger tree.)

In general for a refactoring pass, I guess whitespace will become inconsistent during refactoring and will need to be regenerated. This is obviously true for moving blocks but it's even true for refactoring as simple as renaming identifiers. For example, renaming elements of expressions which span multiple lines:

func(arg1, arg2, ...
     argN, argN1)
^^^^
# problematic whitespace if length of func symbol changes

So I'm kind of convinced that there's no natural representation of whitespace within the green tree, so we may as well do whatever is efficient and simple to implement.

Symbols

Consider a simple thing like (b + c) + (b + c)^2 and a pass which identifies common subexpressions to get

x = b + c
(x) + (x)^2

Here we can and should remove the parentheses (which are trivia after parsing, due to being used for grouping only). What do we even do here? Like whitespace, it seems refactorings will regularly break this kind of trivia and require that it's regenerated from a model of the precedence rules.

Comments

What about comments? This is much more relevant and I think we should aim for "comments are likely to survive symbolic refactoring and remain attached in the right places".

It seems likely there's cases where one or other model wins here, depending on the situation. Some prototyping with simple example refactoring passes might be necessary to get a feel for the pros and cons.

Impact on the parser

One big benefit we have in the ParseStream interface is that trivia is mostly invisible to the parser. So in theory we can adjust trivia attachment heuristics (within whichever model is chosen) independently of the parser code. Julia is sensitive to whitespace and newlines in selected situations, but after parsing is done this information is no longer needed and it may be consistent to split and recombine trivia however we like by floating the boundaries of nodes across the trivia tokens.

@c42f c42f added enhancement New feature or request expressions labels May 18, 2022
@c42f c42f added the design label Sep 21, 2022
@davidanthoff
Copy link
Contributor

I think the rust-analyzer folks consider their choice a mistake in hindsight. At least that is how I am interpreting matklad's comment helix-editor/helix#5231 (comment) from December (he seems to be the person who contributed the vast majority of rust-analyzer initially). I think the discussion in rust-lang/rust-analyzer#6584 is probably less clear on that point because they of course face the choice of disrupting a lot of work if they change now, years after the initial choice.

To me this all kind of points towards following the Roslyn model... At least right now I have the impression that most of the folks that have really dug into this question and then built large scale tools on top of things seem to think that the Roslyn design is the way to go. I think the fact that rust-analyzer tried out the other option and now considers that a mistake is particularly telling.

I guess one other option would be that we try to implement a Rosly like tree structure just for the LS side of things?

@c42f
Copy link
Member Author

c42f commented Mar 18, 2023

That's an interesting comment for sure. Thanks so much for pointing it out :)

I have several thoughts about this.

face the choice of disrupting a lot of work if they change now, years after the initial choice

I doubt there is "one true trivia attachment to rule them all", or if there is, it's quite heuristic - not necessarily something you want to hardcode into the parser, I'd say. However, it's possible to support various whitespace attachment algorithms with the existing parser, because we can shuffle whitespace between nodes after parsing but before tree building. This is what I did in the prototype conversion to CSTParser.jl data structures here

# Rearrange whitespace trivia tokens so that they're always *trailing*
. Thus we can play with whitespace attachment heuristics without any need to touch the parser code. In principle, different applications could choose different rules for whitespace attachment. This might make composition of such code difficult though?

I think we can support "whitespace attached to AST leaf nodes", (the Roslyn model?), as a generalization of our current rules; a fairly small tweak to how things work right now.

Right now,

  • A GreenNode leaf is a token (trivia or nontrivia)
  • A SyntaxNode leaf is a token (nontrivia)

How about, instead:

  • A GreenNode leaf is a token (trivia or nontrivia)
  • A SyntaxNode leaf may be an internal node of the green tree, covering exactly one nontrivia token, and has the kind of that nontrivia token. (Alternatively, to avoid introducing extra internal GreenNodes to the tree, it could be allowed that a SyntaxNode covers a range of children.)

I think this provides a data model which supports the various kinds of trivia attachment.

I somewhat suspect we may need a tree type or utilities which is somewhere between GreenNode and SyntaxNode; GreenNode isn't much fun to work with directly, and SyntaxNode doesn't include trivia.


Related to the example shown in helix-editor/helix#5231 (comment), I think this is a good example, though I've got some other thoughts about that: I feel like moving around code like that generally requires indentation or other whitespace changes, and thus it's a hopeless task to think that a "proper" attachment of trivia would ever solve this problem.

Instead, it would be better to focus on really good code reformatting which tries to preserve the ambient style while also adjusting to the location where the code is pasted. (I've been thinking about formatting a lot lately. I feel like training an ML model of whitespace is the right way to go here - with a data driven approach there's some hope of going beyond a bunch of complicated heuristics and into the land of "good taste in formatting is following these examples but knowing when to break them". Which is how humans express style guides. I've yet to see any formatting tool achieve this to my satisfaction.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants