-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Declarative Grammar for the Nix Language #1102
Comments
Looks like the BNFC source requires I think compiles using Not sure about the portability of GHC-generated LLVM bitcode... probably has the same issue as GHCJS + Node should still definitely be workable, assuming we are okay with adding Node to our build closure (AFAIK it's available on pretty much every platform, considering that it is based on V8, so bootstrapping shouldn't be a big deal, and its runtime closure is only 30 MB). |
cc @Gabriel439 |
Can BNFC track state for our string escaping logic? See Gabriella439/nixfmt#3 |
@domenkozar |
2: Performance. Sometimes we do parse quite large files that are mainly JSON-like data describing even thousands of packages. I expect these are relatively easy on the evaluator and it wouldn't be nice if it should slow down significantly. In as slightly different 3: Making nix depend on Haskell stuff could be problematic, but IMHO it wouldn't be a significant obstacle if we could lift that dependency in released tarballs (by distributing the generated stuff within; few enough people will want to patch the parser). x: Some small corners of the syntax do feel unintuitive to me, but I expect most users won't appreciate a BNF-like description. True parser "bugs" seem very rare to me. |
How fancy a grammar do we need? I suppose we could write a grammar that can be compiled to C++ by C++ and compiled to Haskell by Haskell. |
There are some less difficult changes we could make to the Nix expression parser that may still be worthwhile:
|
Hmm? I am quite surprised by this. |
At the very least, it requires fancy lexing to be parsed remotely efficiently. |
Sure that makes sense, but is a rather different from being context-sensitive :). |
Fancier lexing might work. I expect |
IIRC it shouldn't matter that the |
Ericson2314: yeah you're definitely right; it was one of those situations where my inability to come up with a working BNFC led me to rationalize vcunat: I had that thought, but what of |
Hmm, yes that's also a correct antiquation :-) If the lexer does something like: |
Yeah we just need context-free, not regular, lexing. This isn't so bad: you can think of s-expresion -> lisp and token tree -> Rust as an analogous composition of context-free grammars. A "token tree" language for Nix, as @vcunat hints at, could look something like:
the string stuff is really nothing but escape sequence handling---until you actually parse an antiquoted token tree its regular. I did a fairly ugly and probably not-quite-correct job---somebody with more experience in this area could clean it up. In practice, I'd hope we could use some sort of macro / parameterized rule to deduplicate string and multi-line-string. |
I think this should be a commit hook that generates C++ code to check in to minimize the closure size This discussion also won't get very far until we actually implement a proof-of-concept parser using BNFC to get actual numbers on how fast it is compared to the current lparser One other approach is to factor out Nix's lexer logic into a small C library that can be linked into other languages since the lexer contains most of the complexity. This C library would just take an input string and output a stream of tokens. Once you have something that can generate the token stream it's much easier to write a parser in each downstream language. |
@Gabriel439 one problem with that is the optimal way to stream is in fact quiet language-specific. There's great things Haskell and Rust do that are just too much of a pain to write in C (not sure about C++). |
C can "stream" well enough e.g. by invoking a callback with each token separately. If you inline these calls, it could be rather efficient. EDIT: but we're digressing, I think. |
FWIW I've made an attempt at the BNF and LEX in NixIDEA using Grammar-Kit(extendedPin false). They're not perfect though, the lexer keeps a state and I had to manually add I would also really like it if there was one canonical BNF/LEX definition to consume/work with. |
BNFC's LBNF grammar targets LALR(1) parsers generators which are not powerful enough to parse Nix. The one-token look-ahead cannot decide whether
|
It might be interesting to see if we can describe Nix as a parsing expression grammar (PEG), and, if so, evaluate the performance of a parser generated from such a PEG. In discussion with @peti and others, we came to the conclusion that we might be able to get away with finite lookahead if we restrict the maximum length of a Symbol. Basically, when we encounter a
I can't think of any other cases, but I am not an expert on the Nix grammar. |
I've started implementing a PEG parser for nix based on PEGTL a while ago. The ambiguity of attribute set and argument set causes backtracking and is of course a performance penalty. The other situation where backtracking happens rather often is caused by native url parsing ( I can share the work if someone want's to have a look, but the current state is nowhere near readable 😆 |
@groxxda Please do share, I'd like to see it. I think the URL parsing can be handled by special-casing (not in such a way that it compromises the correctness of the grammar; we can use take advantage of the assymmetric choice in PEG to do such optimizations) the scenario where the URI begins with |
@taktoa Grammar-Kit parsers are PEGs |
Historical note: Once upon a time Nix used SGLR instead of Bison/Yacc. SGLR is a scannerless GLR parser, so lexical syntax is defined in the same formalism as the context-free syntax. I switched to Bison/Yacc because parsing speed wasn't quite good enough (you don't want a command like Here is the last version of the SDF grammar: https://github.com/NixOS/nix/blob/c41a3ec3a9ebf0bfde5fe23fa10d800afe2d9bd9/src/libexpr/nix.sdf Nix used libsglr in C++, but you can also use it from to command line to get an AST in ATerm format (e.g. |
@edolstra thanks for the insight 👍 My current state on the parser implementation with PEGTL is the following: I'd be very thankful if you have an idea how I could get more meaningful numbers for the current parser. |
I've created a tool that parses Nix files provided on the command line using the current Bison-based parser, but does not print the output expression. Running this command with all of nixpkgs as input gives roughly 2 seconds of runtime, so presumably that's our target for performance. Currently, nixexpr (the parser @groxxda is working on) seems to parse all of nixpkgs in about 2.3 seconds, but currently does not desugar its AST; in the current Flex/Bison parser, we do a lot of desugaring and correctness-checking (e.g.: for unbound variables) before we escape the parser. I think we should split these phases; it may cost some performance when we evaluate everything, but it will save some performance when we only want to evaluate a small portion of nixpkgs (and this is usually the case). Also, by splitting parsing into separate parsing and checking phases, we can use the parsed output for things like editor integration, which need to handle partially incorrect input and need an AST that can be pretty printed back into the text it was parsed from without data loss. Plus, having the AST parser and pretty printer be exact inverses makes them easy to test with things like autocheck. Of course, when you are developing and want these checks to occur at "compile-time", rather than at evaluation time, you can just call nix-instantiate with an appropriate flag, so there's really not much loss. |
BTW, @groxxda, I'm currently writing a little parser generator in Haskell that takes a PEG grammar and targets PEGTL (and also it tries to approximate a BNFC-like syntax). So far I've written an ADT for a PEG and I'm now mostly done writing a parser for that type (using trifecta). If you could come up with BNFC-style rules (augmented with asymmetric choice, |
@taktoa Note there is a ABNF -> PEGTL converter written in PEGTL: https://github.com/ColinH/PEGTL/blob/master/examples/abnf2pegtl.cc |
I was about to bring up Earley but apparently #1491 . |
That issue is actually about something different: the ease of writing parsers in Nix. |
Can we close this issue? |
Sure, though at some point it would probably still be a good idea to rewrite Nix's parser. |
I completely agree. I think this ticket should be closed solely because I don't believe that keeping it open serves any purpose. That's not to say that the subject matter discussed in this ticket wouldn't be important -- it is! |
Intro
I think we probably all agree that it would be nice to have a proper BNF for the Nix language, and that ideally the "official" parser would be compiled from that BNF. Here's why:
nix-parse
executable that either pretty-prints the AST or outputs it in JSON/XML/s-expression format).There is a tool that would allow for such a thing, called BNFC. It has a very BNF-like syntax and has many backends (C/C++ via Bison/Flex, Haskell via Happy/Alex, OCaml, LaTeX for documentation), and can even generate pretty-printers.
Obstacles
The obstacles I see to using BNFC for the Nix parser are:
Obstacle 1
Regarding obstacle 1, this is unavoidable but I might be willing to do a large amount of the work here (though it would be nice to have a point of contact on IRC for questions about the Nix internals).
Obstacle 2
Regarding obstacle 2: I think we mainly need an answer to the question "How much time is currently spent in the Nix parser versus other parts of evaluation?". If the answer is "not very much", then the speed of the generated parser will probably not be much of an issue. Otherwise, I would be willing to write a BNFC grammar and compare its speed to the current parser, assuming we agree on a solution for obstacle 3.
Obstacle 3
Regarding obstacle 3, I have thought of a few different options for mitigating the issue (I've put the "cons" associated with each option in bold, and any notable "pros" in italics):
Mitigations
nativeBuildInput
.-fvia-C
, check that into a repo, and have Nix depend on the built version of that code.-fvia-C
may not support the required extensions.patchelf
s in the libraries needed for GHC's runtime (GMP, libc, and libpthread, mainly).Personally, I would say that:
-fvia-C
or JHC.Summary
I might be willing to make the BNF for Nix and do the preliminary work to compare efficiency with the existing parser, iff people think this is a good idea.
The main open questions are:
-fvia-C
or JHC?The text was updated successfully, but these errors were encountered: