-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer sometimes is not reversible #360
Comments
Well, it is standard that a lexer (or regular expression matcher) takes the longest match, so
then |
@andreasabel , hi, thanks a lot for answer. It seems you don't understand me. My point is this: BNFC should guarantee that printing is always reversible. In other words, BNFC should guarantee that for any AST "x" the following statement is true: "if If BNFC unable to provide such guarantee, then BNFC should give error instead of producing .y files. I want reliable tool. Such that I can simply write input file for BNFC and be sure that BNFC will check everything for me |
Ok, I see, Askar, you want what is often called a round-trip property giving you correctness of your parser and printer. I think you cannot get this property in the form you stated, without any constraints on Consider this BNFC grammar: Prg. Program ::= "return" Ident; Then A typical round-trip property is (writing
A bit slopply said, In the OP, if I take One way to exclude your counterexample would be to forbid whitespace in tokens. But there are other obstacles to the round-trip property.
All in all, lexer, parser, and printer need to be carefully developed to ensure the correctness property. This is likely a few years R&D time, and I would do this in Agda rather than Haskell, in order to prove the round-trip property. And, I would not target lexer and parser generators, but output code directly a parser, so I can reason about its behavior in Agda... Anyway, do you have a concrete list of situations you want to be warned about by BNFC? |
I agree. For this reason printing should be partial function (something like I think printing and parsing should be functions with such signatures: print :: AST -> Maybe String
parse :: String -> Maybe AST And the following should be true: Of course, this is just sketch, you may have "Either SomeError String" instead of "Maybe String", etc. Currently I am searching "perfect parsing solution" for my needs, I often need to parse something. I often use Haskell. I need something like BNFC, i. e. a tool which takes complete description of a language and produces both parser and printer. (Not necessary pretty printer, just a printer). I described my problem here: https://mail.haskell.org/pipermail/haskell-cafe/2021-January/133275.html , I recommend you reading that, I give a lot of links to other projects. It seems BNFC currently doesn't provide reversibility, so I will continue searching. If I found nothing, I will write such solution. I already spent a lot of time thinking how to do this, so I am nearly sure this is possible.
All they can be dealt with. By putting additional restrictions on the language, which should not be too restrictive in practice.
Yes. For this reason BNFC should find all pairs of token types which should be always separated by whitespace (in your example identifier and semicolon will be such pair). Such pairs can be easily found using some library for dealing with regular languages, i. e. library which can determine whether two given regular languages intersect. For Haskell such library is http://hackage.haskell.org/package/kleene . (But, of course, you can manually implement needed algorithms.) In fact, I already started to write my parsing solution. I already wrote lexer using "kleene". The next step will be converting list of tokens back to text. I am sure I will get reversibility thanks to rejecting of "wrong" languages using "kleene". I can show you code if you want.
Yes. We should check ambiguity and reject ambiguous grammars. This is possibly the hardest part, because checking for ambiguity for arbitrary CFG is impossible on Turing machine. But we can restrict ourselves to some subset of set of all CFGs (see my letter https://mail.haskell.org/pipermail/haskell-cafe/2021-January/133275.html for options). Or we can just try to find ambiguity using brute force and (if grammar seems to be non-ambiguous) parse input using some method which can detect ambiguities in parse time, for example, Earley algorithm. So we will find most ambiguities at moment of invocation of BNFC and will find remaining ones at moment of actual parse. This is method I plan to use in my solution.
Yes. So we should either force user to manually rewrite this grammar to be non-ambiguous (this may grow its size x2) or give user some tools to disambiguate it (bison has such constructs).
Yes
I am more optimistic. :) I already did some research. I think it will take me 1 month to finish my project. I already started, as I said. I already have plan for my project. First, I will write reversible lexer. This could be done by carefully checking properties of regular languages of tokens using "kleene". Then reversible parser. I will probably use some existing implementation of Earley algorithm. Good property of this algorithm is this: it gives all possible parse trees, so we can detect ambiguity in parse time, if we overlooked it before. I will use brute force in parser generation time combined with additional checking for ambiguity in parse time, as I described. The following stage is converting raw output of parser (parse tree) to AST and vice versa. I didn't think about this a lot. But I think this is possible using something like this: http://hackage.haskell.org/package/invertible-0.2.0.7/docs/Data-Invertible-TH.html . As additional guard I plan to check my implementation using https://hackage.haskell.org/package/QuickCheck . I. e. just test that reversibility equations using QuickCheck.
I want full reversibility. Okey, I understand you. I will continue my search and probably will write my own solution. Also. https://www.brics.dk/xsugar/ does reversible transformation between arbitrary language and XML. Xsugar even proves non-ambiguity for CFG. So, Xsugar proves that this is possible. Xsugar uses single file to describe both lexer and parser, Xsugar's descriptions are similar to BNFC's. The reason why I don't want to use Xsugar is this: Xsugar's lexer is not flexible, in particular it is not even possible to make Xsugar's lexer to correctly parse JSON strings. Also, https://github.com/boomerang-lang/boomerang does reversible transformation between two languages. So, it proves possibility of such solution, too. The problem is this: boomerang does not produce AST, it produces text (as well as I understand, I am currently investigating this project). So, I will have to convert target language, say, to s-expression and then parse this s-expression in my Haskell code. Still, boomerang is reasonable, and possibly I will stick to it |
Thanks for the pointers. Other works that you could look at verified parsing:
My take on this would be a more refined version of Unfortunately, atm I do not have time to work on the grand plan of a verified parser generator. My goals for BNFC since I took it over are more modest: If BNFC accepts a grammar, then the parser generation does not crash and produces compilable code. This is still not true in all cases, I did work on this in #235, #278, #289... |
@andreasabel , I wrote library I sketched here in #360 (comment) . Here it is: https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html |
I'm using BNFC 2.9.1. Consider this .cf file:
Generate haskell code from it and call this from
main
:You will get
Right (EAB (AB "a b"))
. I. e. your lexer is not reversible.I think it is hard to make lexer reversible in this pathological case. But at least BNFC should print some error to warn user that something is wrong
The text was updated successfully, but these errors were encountered: