Allow creating custom terminal parse rules #9

Closed
Dimagog opened this Issue Apr 14, 2013 · 3 comments

3 participants

@Dimagog

Imagine I wanted to parse an indentation-sensitive language like Python.
The standard technique is to have a lexer that analyzes whitespace and generates WS or INDENT and DEDENT special tokens.
I'd like to be able to write a WS rule in Clojure and plug it into existing grammar.

@Engelberg
Owner

Issue #10 is going to be enough to remind me of what you're looking for. I don't offhand see a way to solve Issue 9 without solving 10. So I'm going to close this one and keep issue 10 open. If you have an idea about how to tackle issue 9 in a way that is independent of issue 10, feel free to reopen the issue.

@Engelberg Engelberg closed this Apr 15, 2013
@roryokane

It is possible to solve this issue without being able to parse arbitrary tokens as in #10. The Ruby PEG-parsing library Parslet supports enough functionality to describe indentation, while only being able to parse strings. The specific functionality that allows this are the dynamic, capture, and scope methods, described in “Parser construction”. Instaparse could probably support parsing of indentation by copying those three parser combinators.

Here is a Ruby program that uses Parslet’s scoped captures and dynamic to parse indentation:

require 'parslet'

class IndentationSensitiveParser < Parslet::Parser
    rule(:indent) { str('  ') }

    rule(:one_more_indent) {
        dynamic { |s,c|
            begin
                (str(c.captures[:indents]) >> indent).capture(:indents)
            rescue Parslet::Scope::NotFound
                # start at 0 indents
                str('').capture(:indents)
            end
        }
    }

    rule(:newline) { str("\n") }

    rule(:identifier) { match['A-Za-z0-9'].repeat(1).as(:identifier) }

    rule(:node) {
        scope {
            one_more_indent >>
            identifier >>
            newline.maybe >>
            node.repeat(0).as(:children)
        }
    }

    rule(:document) { node.repeat }

    root :document
end

(Parslet currently has a caching bug that makes this code not work – its cache doesn’t realize that the rule one_more_indent can return different results in different scopes – but the principle is sound.)

For this approach to work in Instaparse, parser combinators must be used, so that you can pass an actual function to dynamic. Here is a mockup of the previous parser represented in Instaparse, assuming that dynamic, scope, and capture were implemented:

(ns example.core
    (:require [instaparse.core :as insta])
    (:use [instaparse.combinators]))

(defn one-more-indent-body [source context]
      (let [captures (:captures context)]
        (if (contains? captures :indents)
          (capture :indents (cat (string (get captures :indents)) (nt :indent)))
          (capture :indents Epsilon))))
(def indentation-sensitive-parser
     (insta/parser
       {:indent (hide-tag (hide (string "  ")))
        :one-more-indent (hide-tag (hide (dynamic one-more-indent-body)))
        :newline (hide-tag (hide (string "\n")))
        :identifier (plus (regexp "[A-Za-z0-9]"))
        :node (scope (cat
                       (hide (nt :one-more-indent))
                       (hide (nt :identifier))
                       (hide (opt (nt :newline)))
                       (nt :children)))
        :children (star (nt :node))
        :document (star (nt :node))}
       :start :document))

I could have made that more concise by using merge and ebnf to represent most of those rules in strings, and representing only one-more-indent using parser combinators.

An alternative is not implementing capture and scope, and instead using dynamic along with a function to generate various parsers of different numbers of indents, equivalent to the code in this Stack Overflow answer. The following code mockup is equivalent to the code in that answer, and uses only dynamic.

(defn indent [depth]
      (hide (rep depth depth (string '  '))))
(defn node [depth]
      (cat
        (hide (indent depth))
        (hide (nt :identifier))
        (hide (opt (nt :newline)))
        (children depth)))
(defn children [depth]
      (dynamic (fn [source context]
                   (star (node (+ 1 depth))))))

(def indentation-sensitive-parser
     (insta/parser
       {:newline (hide-tag (hide (string "\n")))
        :identifier (plus (regexp "[A-Za-z0-9]"))
        :document (star (node 0))}
       :start :document))

My only difficulty in translation was converting .as(:children) – I could hide the other parts of the node rule, but I couldn’t make the children be captured under a name, since it is a function in this representation, rather than a non-terminal in the grammar map.

Depending on how much preprocessing of grammars your parsing engine does, dynamic might be hard to implement. Still, I would prefer implementing this approach over implementing #10 and parsing my own INDENT and DEDENT tokens. That would require me to write a pre-parser by hand to generate those tokens while ignoring them in places where indentation is insignificant (such as in multi-line strings), which would either duplicate the work I would have already put into describing those places in Instaparse, or require careful calls to my Instaparse parser that must not rely on indentation. Whereas this way, reading indentation is part of the main phase of parsing, so it can conveniently and safely use the existing rules describing multi-line strings and such.

@Engelberg
Owner

Thanks for the idea.

I think most people are interested in indent-aware parsing because they want to parse Python or Python-like files. So before pursuing this avenue too deeply, I'd like to be sure this approach is robust enough to handle the full complexities of Python's indentation details. For example, lines that are a mixture of tabs and spaces. Lines that split in the middle of a list, continuing on the next line (so indentation is ignored). Continuation marks at the end of the line to ignore indentation on the next line. etc.

So if you can point me to further evidence that this technique is sufficient to handle Python's full indentation syntax without a lexing pass, that would be wonderful.

On the implementation front, it sounds doable. Caching would indeed be a thorny issue (as it apparently is for the Ruby parser, since you said the Ruby version doesn't work quite right).

For example,
S = AB D | BC D
AB = 'a' | 'b'
BC = 'b' | 'c'
D = 'd'

So imagine parsing the string 'bd'. When AB matches the first 'b', the (AB D) parser triggers the D parser to try parsing starting from the second character (which works). Later the (BC D) parser triggers the D parser to try parsing starting from the second character. Because the grammar is context-free, we can safely use the cached result.

Imagine that AB and BC consumed different amounts of whitespace, and D was something dependent upon the indentation level. Now the D parse depends upon context, and the caching needs to take that context into account. That's probably doable, but I worry that it will cause an explosion of possibilities that need to be stored in the cache. Unfortunately, memory consumption caused by caching is already one of instaparse's weak points -- I don't relish the thought of adding a feature that would make it even worse. But as long as it doesn't worsen the memory consumption of parsers that aren't taking advantage of these context-saving features, it's still an option to consider.

One bit of good news is that I don't think implementing the "scope" operator would be necessary. From what I can tell, the reason it is necessary in Ruby is because they are storing everything in a mutable hash. So "scope" does some kind of stack-like preservation of the mutable hash so it can be safely overwritten. But with Clojure's immutable hash maps, this should be a non-issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment