Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow designation of tokens as garbage #46

Closed
arrdem opened this issue Aug 20, 2013 · 6 comments
Closed

Allow designation of tokens as garbage #46

arrdem opened this issue Aug 20, 2013 · 6 comments

Comments

@arrdem
Copy link

arrdem commented Aug 20, 2013

At present, one can emulate the behaviour of a lex/yacc grammar using instaparse only by suitably modifying the source grammar to explicitly account for traditionally ignored whitespace characters. Over the course of the C language grammar or the Pascal language grammar this can easily amount to hundreds of rule changes as one must explicitly provide for the possibility of whitespace in every nonterminal concatenation where the standard defined source grammars state simply "discard whicespace" assuming a separate lexer.

It would be awesome if the top level parser took an :ignored "ignored-forms-rule", which would be implicitly used as a token sink throughout the grammar.

@Engelberg
Copy link
Owner

A few people have asked for the ability to work with pre-tokenized
strings. The reasons, so far, have fallen into two categories:

  1. Wanting to work with a language where indenting is meaningful (like
    Python), so it would be easier to have an "indent" and "dedent" token.
  2. People wanting to ignore whitespace.

Your request falls into the second category. So this may resolve itself
once I figure out how to incorporate the token request. However, I am
intrigued by your suggestion that there may be a simpler way to handle
category #2, with some sort of :ignored keyword. Can you elaborate on this
some more and provide an example? I'd love to hear how you envision this
working. If there is a way to address this need without providing
full-fledged token handling, that would be useful for me to think about.

@arrdem
Copy link
Author

arrdem commented Aug 21, 2013

In a classic shift/reduce parser as generated by Yacc the entire
generation of the parse tree is a result of the parser state
transition tables which transform known sequences of tokens into
tree fragments and compose tree fragments.

What I'm proposing is simply a special case of such a rule
transition which rather than yielding a token or tree fragment
yields an empty tree fragment or nothing. This would allow any
token series which produces the designated "garbage" token to be
rendered entirely invisible to the parser.

Let's assume that I have some grammar
production "translation-unit" which happens to represent the
grammar for the ANSI C99 language. This grammar comes with the
caviats that the sub-grammar for macros is entirely
discarded (all macros having been textually expanded before the
'true' ast is built), all comments and all whitespace are also so
discarded.

I can trivially provide a grammar for macros as they are a
regular language, and likewise for whitespace and C99
comments. From these three I can define a rule "junk" to be the
alternation with * repetition of these tokens. By indicating to
the parser engine that "junk" is to be discarded just as < >
blocked rules match producing no tokens the "junk" rule would be
able to match everywhere consuming but producing no tokens.

If "junk" happens to conflict with the primary production of the
language then I'm screwed and will get no parse tree but a valid
parse outcome, however I suspect that your ambiguous grammar
support should be able to just do the right thing.

My apologies if I'm misunderstanding the exact mechanism that
instaparse uses, my background in untooled language
implementation is yacc and RDPs,
sad being an instaparse
equivalent implemented along those lines.

@Engelberg
Copy link
Owner

I've added this as an experimental feature in 1.2.3. Check it out and let me know what you think:
https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace

@arrdem
Copy link
Author

arrdem commented Oct 10, 2013

Awesome! After reading, while I agree that whitespace obliteration will be the primary use case for this feature, I would personally use a more general name than :auto-whitespace. Other than that, looks awesome! Thanks for spending the time to add it to this toolkit.

@Engelberg
Copy link
Owner

Thanks for suggesting the feature.
Do you have suggestions for a better keyword than :auto-whitespace?
I know you mentioned :ignored before, but I was worried that was too far at
the other extreme in terms of not really making clear what its intended
purpose was. Are we ignoring rules? Is information being thrown out?
That's basically what I was concerned about.

@arrdem
Copy link
Author

arrdem commented Oct 16, 2013

After some pondering a better keyword doesn't really spring to mind. In either case I'd retain the demo above as part of the documentation which should serve to make its behavior plain no matter the given name. Within reason of course. :auto-whitespace would seem to be your most requested use case, so I'd just go with that. Someone who knows enough about parsers to want to do token discarding by name will read the docs and realize that :auto-whitespace is sufficiently powerful. Or at least I would hope so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants