Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Open
Dimagog opened this Issue Apr 14, 2013 · 8 comments

Comments

Projects
None yet
4 participants

Dimagog commented Apr 14, 2013

This may be a duplicate of #9, but maybe not.
If I already have a lexer that generates a seq of tokens I'd like to be able to parse it with instaparse.

Owner

Engelberg commented Apr 15, 2013

When I started this project, I specifically wanted to make it easy to create a lexer-less parser. To achieve that goal, it was essential to allow regexes for the terminals. Regexes in Java only work on text, so I decided to focus on parsing strings, rather than general sequences of tokens.

The underlying algorithm could in principle be adapted to work on sequences of tokens, it just wouldn't be possible to use regular expressions with such a sequence. I don't see a realistic way to mix and match the lexer approach with the regexes-as-terminals approach -- it seems you'd have to pick one or the other to specify your tokens. I suppose I could just say "No regular expressions are allowed if you're operating on something other than strings."

One thing that worries me is that it complicates the specification of the grammar since the token definitions and token format have to be defined elsewhere.

Overall, this seems like a worthwhile extension (and your Python example is a great use case), but I need to think through some of these issues before diving in. Thanks for the suggestion.

Owner

Engelberg commented Apr 15, 2013

In the meantime, there's a workaround you could potentially employ. You could map tokens like INDENT and DEDENT to unused characters and then rebuild it as a string, then run instaparse on that. I believe ASCII characters 0-8 and 11-31 are unused and could serve as tokens.

Just a thought: could you allow mixed streams, with some pre-parsed terminals and some strings?

Another case which is hard to parse without an external lexer: mustache.with set delimiters.

Owner

Engelberg commented May 20, 2013

I'm continuing to think about this, but it's definitely a non-trivial
change that requires careful consideration.

Here's one question:

If instaparse provides for streams of tokens, is instaparse also obligated
to provide an easy-to-use lexer? Or are there already good tools for
that? Or is the stream-of-tokens scenario mostly relevant for things like
Python whitespace and mustache's set delimiters which would probably
require a very customized lexer anyway?

Right now, if I needed to parse mustache, first, I'd use a regular
expression to scan the string for spots where the delimiters change. Then,
I'd preprocess the sections, converting the delimiters to unused ASCII
characters, say, 30 and 31. Then, I'd put it all back as a single string
to use with instaparse.

I should say that 1) I'm not convinced this is the right thing for instaparse, but want to explore it 2) the workaround is neat but is pretty bad for preserving line information in error messages or anything that should preserve formatting 3) if you did decide to do it, and I'm not saying you should, I don't think you should feel the need to deliver a kick-ass lexer as well. Leave that for someone else :)

Dimagog commented May 21, 2013

I'd say that if Instaparse is given a stream of tokens, then by definition it is out of lexing business for this particular scenario. Now it's possible (but not required) that this token stream was generated by another pass of Instaparse (lexer), then massaged by some custom code, and then passed to Instaparse (parser).

What about a mechanism for injecting a partial function that would be mapped across the sequence of terminals generated by a given production? I'm not sure how you'd implement this, or whether it would even be possible, but something akin to the positive and negative lookahead features of Instaparse.

tokens ::= #(maptok %) token*

word-batches ::= #(partition 10 %) words
words ::= words+

Perhaps the parser could be given an extra :mappings argument, to map symbols to partials. This would allow for injecting a function that could inject INDENT and DEDENTS appropriately.

In a sense, this is using Instaparse as both the lexer and the grammar processor at the same time, and having a syntax for stitching both parts together in the same syntax. You could do this in code, as Mark points out, but the convenience of declaring it in the grammar would make a big difference.

This paper discusses an approach to the indentation problem

http://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf

@aengelberg aengelberg pushed a commit to aengelberg/instaparse that referenced this issue Jun 17, 2015

@lbradstreet lbradstreet Merge pull request #10 from aengelberg/aengelberg-cljs-patch-3
Tweak cljs test runner
e28de20

@daveyarwood daveyarwood referenced this issue in alda-lang/alda May 2, 2016

Closed

Experiment: streaming parse and eval #225

@Engelberg Engelberg self-assigned this May 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment