Allow designation of tokens as garbage #46

arrdem · 2013-08-20T17:51:47Z

At present, one can emulate the behaviour of a lex/yacc grammar using instaparse only by suitably modifying the source grammar to explicitly account for traditionally ignored whitespace characters. Over the course of the C language grammar or the Pascal language grammar this can easily amount to hundreds of rule changes as one must explicitly provide for the possibility of whitespace in every nonterminal concatenation where the standard defined source grammars state simply "discard whicespace" assuming a separate lexer.

It would be awesome if the top level parser took an :ignored "ignored-forms-rule", which would be implicitly used as a token sink throughout the grammar.

The text was updated successfully, but these errors were encountered:

Engelberg · 2013-08-21T20:39:09Z

A few people have asked for the ability to work with pre-tokenized
strings. The reasons, so far, have fallen into two categories:

Wanting to work with a language where indenting is meaningful (like
Python), so it would be easier to have an "indent" and "dedent" token.
People wanting to ignore whitespace.

Your request falls into the second category. So this may resolve itself
once I figure out how to incorporate the token request. However, I am
intrigued by your suggestion that there may be a simpler way to handle
category #2, with some sort of :ignored keyword. Can you elaborate on this
some more and provide an example? I'd love to hear how you envision this
working. If there is a way to address this need without providing
full-fledged token handling, that would be useful for me to think about.

arrdem · 2013-08-21T20:55:10Z

In a classic shift/reduce parser as generated by Yacc the entire
generation of the parse tree is a result of the parser state
transition tables which transform known sequences of tokens into
tree fragments and compose tree fragments.

What I'm proposing is simply a special case of such a rule
transition which rather than yielding a token or tree fragment
yields an empty tree fragment or nothing. This would allow any
token series which produces the designated "garbage" token to be
rendered entirely invisible to the parser.

Let's assume that I have some grammar
production "translation-unit" which happens to represent the
grammar for the ANSI C99 language. This grammar comes with the
caviats that the sub-grammar for macros is entirely
discarded (all macros having been textually expanded before the
'true' ast is built), all comments and all whitespace are also so
discarded.

I can trivially provide a grammar for macros as they are a
regular language, and likewise for whitespace and C99
comments. From these three I can define a rule "junk" to be the
alternation with * repetition of these tokens. By indicating to
the parser engine that "junk" is to be discarded just as < >
blocked rules match producing no tokens the "junk" rule would be
able to match everywhere consuming but producing no tokens.

If "junk" happens to conflict with the primary production of the
language then I'm screwed and will get no parse tree but a valid
parse outcome, however I suspect that your ambiguous grammar
support should be able to just do the right thing.

My apologies if I'm misunderstanding the exact mechanism that
instaparse uses, my background in untooled language
implementation is yacc and RDPs,
sad being an instaparse
equivalent implemented along those lines.

Engelberg · 2013-10-08T08:41:29Z

I've added this as an experimental feature in 1.2.3. Check it out and let me know what you think:
https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace

arrdem · 2013-10-10T02:40:27Z

Awesome! After reading, while I agree that whitespace obliteration will be the primary use case for this feature, I would personally use a more general name than :auto-whitespace. Other than that, looks awesome! Thanks for spending the time to add it to this toolkit.

Engelberg · 2013-10-10T02:57:34Z

Thanks for suggesting the feature.
Do you have suggestions for a better keyword than :auto-whitespace?
I know you mentioned :ignored before, but I was worried that was too far at
the other extreme in terms of not really making clear what its intended
purpose was. Are we ignoring rules? Is information being thrown out?
That's basically what I was concerned about.

arrdem · 2013-10-16T17:08:47Z

After some pondering a better keyword doesn't really spring to mind. In either case I'd retain the demo above as part of the documentation which should serve to make its behavior plain no matter the given name. Within reason of course. :auto-whitespace would seem to be your most requested use case, so I'd just go with that. Someone who knows enough about parsers to want to do token discarding by name will read the docs and realize that :auto-whitespace is sufficiently powerful. Or at least I would hope so.

Engelberg closed this as completed Oct 8, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow designation of tokens as garbage #46

Allow designation of tokens as garbage #46

arrdem commented Aug 20, 2013

Engelberg commented Aug 21, 2013

arrdem commented Aug 21, 2013

Engelberg commented Oct 8, 2013

arrdem commented Oct 10, 2013

Engelberg commented Oct 10, 2013

arrdem commented Oct 16, 2013

Allow designation of tokens as garbage #46

Allow designation of tokens as garbage #46

Comments

arrdem commented Aug 20, 2013

Engelberg commented Aug 21, 2013

arrdem commented Aug 21, 2013

Engelberg commented Oct 8, 2013

arrdem commented Oct 10, 2013

Engelberg commented Oct 10, 2013

arrdem commented Oct 16, 2013