Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic Parsing #2241

Open
Thom1729 opened this issue Mar 28, 2018 · 2 comments

Comments

@Thom1729
Copy link

commented Mar 28, 2018

This idea has been discussed elsewhere, both here and on Discord. I'm writing it up more formally here.

Background

Sublime's parser is effectively a deterministic pushdown automaton. This makes it capable of parsing any deterministic context-free language. Many common programming languages are deterministic context-free, but not all. Some languages like YAML and Python are context-sensitive. Parsing those languages is outside the scope of this proposal. But some, like JavaScript, are context-free but not deterministic context-free. Sublime cannot correctly parse these languages.

Motivating Example

For a concrete example, consider the following two JavaScript snippets:

(foo = 100);
(foo = 100) => bar;

The first statement contains a parenthesized expression that in turn contains an assignment expression. By the scope naming guidelines, foo should be scoped variable.other. The second statement contains an arrow function whose parameter foo has a default value of 100, and foo should be scoped variable.parameter.function. This distinction is highly visible in most color schemes.

In very simple cases like this, we can write syntax rules that can tell the examples apart. But such rules must fail in more complex cases for two reasons. The first and best-known is that Sublime only parses one line at a time. The second and more important is that in the example, 100 could be replaced by an arbitrarily complicated expression, which cannot generally be matched by a regular expression anyway.

Here is a more complex example:

const myFunc = ({
    someArg,
    anotherArg: renamedArg,
    argWithA: (defaultValue()),
    ...others,
}) => {
    // Do stuff.
};

Code like this is very idiomatic. In particular, it is a common way of defining a React component. I have code like this all over the place. Unfortunately, Sublime will parse the arguments list as a parenthesized expression and the function parameters will not be highlighted correctly. (The JavaScript syntax could instead guess that it was an arrow function arguments list, but this would be wrong even more often.) There is simply no way for a deterministic parser to handle this common example.

Workarounds

Currently, we work around the problem with a regexp lookahead. This fails when there are newlines between the opening paren and the eventual =>. It also fails when there are complex expressions within an arrow function argument list. These workarounds fail conspicuously in many common cases. Unfortunately, there is no way to substantially improve them. When in doubt, we parse an ambiguous construct as a parenthesized expression, which does fail gracefully.

Implementing multiline regexp matching would improve the situation. Lookaheads could identify more common patterns. The heuristics involved would increase in complexity as well as effectiveness. Perfect parsing would still be beyond reach with a nondeterministic context-free parser.

Implementation With Oniguruma Regexps

The astute reader may observe that old-style Oniguruma regexps actually have more power than the Sublime parser itself. In theory, such a regexp could "look past" a complex expression in a way that Sublime's parser cannot. Such a regexp would be extremely complicated. A first attempt at an Oniguruma regexp that recognizes a JavaScript expression is as follows:

(?x)((?:
  [^(){}\[\]/'"`]+ # Everything else
  
  | //.*?$    # Line comment
  | /\*.*?\*/ # Block comment

  | '(?:[^'\\]+|\.)*' # Single-quoted string
  | "(?:[^"\\]+|\.)*" # Double-quoted string
  | /(?:[^/\\]+|\.)*/ # Regexp literal

  | `(?: # Template literal
      [^`\\\$]+
      | \.
      | \$(?!\{)
      | $\{\g<-1>\} # Interpolated expression
    )*`

  | \( \g<-1> \) # Parentheses
  | \[ \g<-1> \] # Square brackets
  | \{ \g<-1> \} # Curly brackets
)*)

This is certainly a formidable regexp. Anyone extending the syntax (e.g. with YAML Macros) would likely have to modify this arcane expression. It would use the old, slower regexp engine, and its heavy reliance on recursive subexpression calls raises performance concerns. Parenthesized expressions are very common in JavaScript, so this expression would be run many times in a typical parse.

While this expression should handle valid JavaScript, it is unlikely that it would handle invalid JavaScript in exactly the same way as the syntax itself, and if it doesn't, then the user could see badly mangled highlighting while they type. Finally, this approach requires unlimited multiline lookahead; without that, it is a non-starter. For all of these reasons, I don't believe that this is a good or adequate solution to the problem.

Proposed Solution

To parse nondeterministic context-free grammars, I propose an extension to sublime-syntax to allow it to function as a nondeterministic pushdown automaton. This is a fully general solution.

Example

Before delving into the technical details of the proposed solution, this is what it might look like when used to solve the motivating example above:

contexts:
  expression:
    - match: (?=\()
      push:
        - parenthesized-expression-after
        - parenthesized-expression

  parenthesized-expression:
    - match: \(
      scope: punctuation.section.group
      set:
        - meta_scope: meta.group
        - match: \)
          scope: punctuation.section.group
          set: parenthesized-expression
        - include: expression

  parenthesized-expression-after:
    - meta_backtracking_key: PARENTHESIZED_EXPRESSION
    - meta_backtracking_else: arrow-function
    - match: =>
      fail: PARENTHESIZED_EXPRESSION
    - match: (?=\S) # Anything else.
      pop: true

Explanation

I presume that the parser currently maintains a stack of contexts, and that it can look up meta information related to each context. In order to allow nondeterministic parsing, it will need to maintain a stack of (context, index) pairs, also keeping track of the character index when the context was pushed. The overhead for this should hopefully be negligible.

This proposal adds two context-level meta rules, meta_backtracking_key and meta_backtracking_else. These rules would be mutuall dependent; using only one wouldn't make sense. It also adds a fail key that would be used in match rules. This fail key would be exclusive with push, pop, set, and embed.

When a rule with a fail key is matched, the parser will:

  • Pop contexts until it has popped a context whose meta_backtracking_key equals the value of the fail key.*
  • Reset the current parsing index to the one popped off the stack.
  • push the value of the popped context's meta_backtracking_else.
  • Resume parsing.

* If no such exists, shout at at the syntax author; this is an avoidable error. It may be possible to detect this statically.

Discussion

This proposal should allow Sublime to parse any context-free grammar, a rare distinction. The benefits to JavaScript in particular would be significant.

Performance

When no fail rule is encountered, the performance impact of this proposal should be negligible. The only additional overhead is a single index per context on the stack. As a result, there should be no performance impact for existing syntax definitions.

Under this proposal, the parser only does extra work when it encounters a fail rule, signifying a wrong guess. In this case, the penalty is proportional to the amount of backtracking that the parser must do. This, in turn, depends on the nature of the language and the syntax author's use of the feature. (In this respect, this proposal resembles a backtracking regexp engine.)

For example, in JavaScript, there would be one "miss" per arrow function.** Each arrow function argument list would have to be re-parsed. Typically, these lists are short and simple, although one could construct pathological cases that would require re-parsing large chunks of code. The worst-case runtime is exponential (like backtracking regexps).

** In practice, many of these misses could be avoided via lookaheads that "short-circuit" the cases that can already be handled today. It's an open question whether such lookaheads or this implementation would be faster in such simple cases.

There are some more sophisticated solutions offering guaranteed-polynomial-time parsing, such as Earley's algorithm or GLL. If better worse-case performance were desired, it's likely that Sublime's implementation of this proposal could be extended to incorporate more sophisticated techniques. For instance, pathological reparsing could be avoided using a sort of branch prediction cache.

In a typical syntax, in a typical file, the performance should be perfectly adequate -- no worse than occasionally using an Oniguruma regexp.

Complexity

One of the main benefits of this proposal is that it does not interfere with any existing syntaxes. Syntax authors would only need (or want) to use it when no other solution exists.

When it is needed, I believe that this proposal is the simplest thing that could possibly work. I've played around with other implementations, and this was the simplest to implement and use.

Other Implications

Any parsing extension that works across multiple lines runs into a known issue: the current line limitation exists to enable an optimization. That optimization, in its most general form, is incompatible with correct parsing of JavaScript and other languages. However, we can preserve that optimization to the greatest possible degree.

For each line L, we would keep track of the earliest prior line P whose parse is affected by the contents of L (one number per line in the view). When no nondeterministic contexts are used, each line would refer to itself. However, when a context with a meta_backtracking_name is popped from the stack on line L, we would mark L as affecting the line P that pushed that context (which we're already keeping track of on the context stack) We would mark L thus regardless of whether the context was popped normally or as part of a fail action. Then, when doing a partial reparse that included line L, we would see that line P would also need to be reparsed, so we'd expand the reparsing region to include it. (We wouldn't have to check these newly included lines. This is an important advantage over multiline regexp matching, where this expansion could snowball.)

By keeping track of dependencies like this, it should always be possible to re-use partial results from prior parses without missing changes.

Related issues

This proposal would probably solve the following issues:

  • #1693 - Multi-line lookahead.
  • #1495 - Pop multiple contexts.

In addition, in JavaScript it could also be used to reliably:

  • Parse import.meta and similar constructs with unexpected newlines.
  • Determine whether a variable or object key is assigned a function literal (to populate the symbol list).
  • Detect the last name in a chain of accessors (to give it a special scope in some cases).

These problems could be solved with multiline regexps; however, they could be solved just as easily with this proposal, with no need to implement both features. I expect that the same is true of features in many languages that would otherwise require multiline regexps.

@Thom1729

This comment has been minimized.

Copy link
Author

commented Mar 30, 2018

After some thought, I've determined that an additional optimization is needed to avoid poor performance in pathological cases.

Motivating example

(
    a = (
        b = (
            c = (
                0
            ) => null
        ) => null
    ) => null
) => null

As established above, the JavaScript syntax should guess that an open paren denotes a parenthesized expression. Then, if it turns out to actually be an arguments list, it would backtrack. In this case, the parser would backtrack at each level, but at each level it would redo all of the work it had backtracked. This would lead to exponential runtime in a naive implementation. While such patterns would never occur in real JavaScript, an unscrupulous person could deliberately create a pathological code sample.

Solution

The solution, fortunately, is not very complicated. We can cache the result of each "decision" for the duration of a single parse.

Whenever we fail, we create a "failure record" specifying the failed context (the one with the meta_backtracking_name) and the index when it was pushed (which we had on the stack). Then, whenever we push a context with a meta_backtracking_name, we check for a failure record for that context at that index. If we find one, then we immediately fail that context and push the value of meta_backtracking_else.

The correctness of this shortcut is due to the stack-based nature of the parser. If the text does not change (e.g. within a single parse), then a given context pushed at a given index will always succeed or always fail; contexts below it on the stack, including nondeterministic contexts, cannot affect this result. We do not need to record successes (the typical case).

The set of failure records could be stored in any manner allowing fast lookup. Various straightforward implementations would provide effective constant-time access.

In addition to handling pathological parsing, this approach should slightly improve parsing performance in typical cases. Each alternative need only be tried once per parse, limiting the total amount of backtracking that can occur.

@Thom1729

This comment has been minimized.

Copy link
Author

commented Feb 10, 2019

N.B.: Atom's new tree-sitter system does support nondeterministic context-free languages, which allows it to handle JavaScript arrow functions and similar constructs. If Sublime were to implement this proposal, then it would again be strictly more powerful than Atom's highlighter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.