New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOML stdlib] Support parsing comments into data structure so they can be written back out #42697
Comments
I thought about this but one reason why I didn't attempt this is that the parser here is also used in Julia Base for loading pacakge and there you want it to be as simple as possible. |
That makes sense, but it seems a real shame to not allow comments in Project.tomls or things like preference files (though I haven’t used those yet). Maybe Base can use something like a (vendored copy of?) an older version of the package until all the bugs are worked out, eg some kind of LTS release of the package. |
There is no strict requirement that Base uses the same parser, no. The first step however is to actually have a TOML parser that produces a "concrete syntax tree". You would need to keep track of comments, indentation, if tables are inline or not etc. Probably a bit tricky to get right. |
This is very tricky to do. As @KristofferC says, you'd want to parse a "concrete syntax tree" instead of a simple data structure that is currently returned. But you still want that to behave like a data structure since otherwise you can't access the parsed data. Ok, so maybe you can have a concrete syntax tree type that also behaves like a dict/array/whatever and remembers comments. Does it also allow modification of the data structure? What happens to comments when you modify the data structure? I can come up with messy examples, but I suspect you can see what I'm getting at. The interaction between comments and data structure modification is just super unclear, let alone the implementation. Note that without modification this is a non-issue: if you only need to read a TOML fine and not modify it then preserving comments doesn't matter. That doesn't mean that this is impossible, but before any implementation is considered, the first step is having some reasonable rules about what happens to comments when the TOML data is modified. |
Yeah, that makes sense. I was thinking maybe there can be a canonical form that it could be parsed into, so that it wouldn't fully roundtrip with spacing and formatting and everything, but could at least preserve the contents of the comments. E.g. every value in the nested set of dicts is a tuple with the actual value and any comments encountered in the parsing of that value, and they get written out in some canonical way (the comment is always written the line above the value or something). With something like that, I think modification is more straightforward; either you modify the comment or not, and if you drop the key then you lose the comments. And the dict with tuple values could of course be an AbstractDict that exposes the value element of the tuple as the value for Obviously there's a lot of details in there to figure out in there still (how to choose which value to associate each comment to, how to write them out etc). |
Another consideration: currently we canonicalize the format of a file when we write the TOML out — we sort table keys and sections. We also convert some ways of expression data structures to other ways of expressing the same data structure. Would this involve not doing that and keeping things in the same order? If so, I'm not sure it's worth it. Keeping the file in a canonical form is really quite nice, IMO. Maybe we can simplify the scope of this: what kinds of comments did you want to preserve? Some comments are much more obvious how to preserve than others. |
FWIW it looks like tomlkit does the full white-space preserving thing in 4k lines of python, which doesn't sound absolutely terrible. (edit: I'm not saying we should do that though).
Yeah, I don't think it needs to and I don't think it's incompatible with canonicalizing, we just should also parse and canonicalize the comments IMO.
Hm, I was thinking things like any subtleties in compat annotations or maybe explaining dependencies name = "MyPkg"
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"
[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
# `Dep` is used for fooing bars
Dep = "0796e94c-ce3b-5d07-9a54-7f471281c624"
[compat]
julia = "1"
Dep = "~1.1" # see issue#123 before bumping to v1.2
[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[targets]
test = ["Test"] |
Would be helpful to see what its rules are. Based on your example, it looks like the rule is that a comment on a line by itself is associated with the following item and a trailing comment is associated with the item where it appears. What about comments before a table section? name = "MyPkg"
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"
# what is this comment attached to?
[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
# `Dep` is used for fooing bars
Dep = "0796e94c-ce3b-5d07-9a54-7f471281c624"
[compat]
julia = "1"
Dep = "~1.1" # see issue#123 before bumping to v1.2
[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[targets]
test = ["Test"] Do blank lines affect this? Is there a way to put a comment at the top of the file that won't get deleted if the first key in the file is deleted? Is there a way to put a comment at the end of a section without it being attached to the following section? Btw, I suspect that access to reading or modifying comments is not worth it: they should be viewed as something that humans insert into the file manually which are only for human consumption. |
tomlkit seems to have zero documentation of any of its design decisions for how comments are associated with TOML data, which is not exactly reassuring. I would guess that its behavior is full of unpleasant surprises. |
Hmm, maybe there should be a notion of a top-level comment as well? So something like # this is a "top-level" comment since there is a blank line following it
name = "MyPkg"
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"
# This comment is attached to the `deps` table since there is no blank line separating it
[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
# `Dep` is used for fooing bars
Dep = "0796e94c-ce3b-5d07-9a54-7f471281c624"
[compat]
julia = "1"
Dep = "~1.1" # see issue#123 before bumping to v1.2
[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[targets]
test = ["Test"] |
Ok, now we're getting somewhere. So there seem to be three kinds of standalone comments (as opposed to comments on the same line as or inside of some data):
That still leaves some questions. What about this: foo = 123
# comment
bar = 456 Is this comment associated with foo = 123
# comment
bar = 456 This comment doesn't look associated with either foo = 123
# 1
# 2
# 3
bar = 456 Do we view this as one big comment block attached to |
Those rules make sense. I wonder though if one could get away without trailing comments? And just have "any top-level comment in the section is canonicalized as a leading comment". So then foo = 123
# 1
# 2
# 3
bar = 456 would be canoncialized as # 1
# 2
foo = 123
# 3
bar = 456 And I think the same could be inside a table etc (though maybe this is what you mean by a section). In other words, a comment can either be attached or not, and if it is not attached it shows up at the top of whatever container it is in (the whole TOML file, the table, etc). And for what counts as |
We could, but eliminating trailing comments doesn't simplify the problem at all: if we decide that comment 2 belongs to |
Ah, I'm not saying 2 is associated to |
Ah, ok. That's certainly simple enough. Not sure how people will feel about it though — I suspect that people will feel that this is moving their comments around arbitrarily, but maybe this is fine. I do think that at least inline comments should be supported and remain attached to whatever item they're next to. |
Note that rather than modifying TOML to return a special concrete syntax tree data structure, you could instead have a |
Yeah, I agree it might be seen as annoying. I think however going from "comments get deleted" to "attach your comments like docstrings or they get moved to the top" is a big win and a simple enough rule that I'm hoping folks will just be happy to have comments and quickly learn the rule. Though maybe once we get used to having comments we will want more and more 😅. But I do think moving is a lot better than deleting (deleting is scary especially if it seems unpredictable, e.g. you don't know the rules, and feels like data loss) so I think some kind of fallback is needed if we don't want to support all positions, and putting them at the top seems simple enough and supports the "intro comment" usage. But maybe there's a better fallback. |
Trying the "comments are just a block of text that goes before an element" approach seems like a good idea. I like that it associates just one (multiline) comment string with each node in the TOML data structure. It would transform a comment like this: foo = 123 # comment into a comment like this: # comment
foo = 123 but I think that's ok. Similarly it would turn this: # comment 1
foo = 123 # comment 2 into this: # comment 1
# comment 2
foo = 123 I would probably make the API like this: data, comments = TOML.parsefile(file, comments = true)
TOML.print(io, data, comments = comments) The |
What's holding this up? I don't know what other people think, but I'd be fine with this approach. |
As always in open source, the default is for nothing to happen, because someone needs to spend time for it to be otherwise. So "no blockers" doesn't mean "it'll happen now", it means "now anyone who wants to spend time on it is empowered to do so". So comments like this are pretty annoying in my opinion, because usually it means you want other people to spend time on it but don't want to spend your own. edit: FYI, you can convey pretty much the same thing without annoying people as much by saying something like "+1 for this feature, it would be great to have. The approach sounds fine to me." The key difference is that would be expressing your desire for the feature without implying it is someone else's duty to implement it for you. |
Didn't mean it to be read in any other way, apologies if the tone came out wrong. Also was genuinely curious if there was any fundamental blocking issue. I actually thought (probably very naively) that this wouldn't be so difficult. A practical solution (perhaps) is to support only small comments at the end of a line:
For me this is the most important thing, since I always forget why I put a compat bound. I would say if a comment is at the end of a line like this, then it's fine to remove that line (with the comment) after All the modifications Pkg does to Project.toml file are adding or removing entire lines (is that so?). What if: Before modifying the Project.toml, we copy all the lines into a variable. This hack though would probably go on Pkg.jl, not in TOML.jl. Again, this is probably naive since I'm only considering how I typically use my Project.toml, and there are likely edge cases I'm not seeing. |
Ah okay, sorry for misinterpreting. I don’t think there are any fundamental blockers; seems like the API Stefan laid out would work and it “just” needs implementation. I’m not really sure how hard that would be; I took a quick look and saw comments are currently handled with a few functions that skip them when encountered, so I guess that would have to be replaced with functions that parse them and associate them with the right keys. |
so we can e.g. use comments in Project.toml's without Pkg removing them.
I found toml-lang/toml#284 which led me to https://github.com/joelself/tomllib (rust, 5 years since the last commit so maybe not supporting TOML v1) and https://github.com/sdispater/tomlkit (python, actively updated) which do support comments.
I wonder what would be involved in doing this? I wonder if a
TOMLDict
would have to change from just aDict{String, Any}
.The text was updated successfully, but these errors were encountered: