Skip to content

Debugging Grammar rules

rocky edited this page Oct 2, 2017 · 7 revisions

Table of Contents

Introduction

I've spent a lot of time working on and debugging grammar rules. As a result, I've accumulated some street experience.

Also in the process, I've refined the spark parser to make it easier to debug.

CheckSets()

First, the parser has an instance method called checkSets() which does some grammar verification. This returns 4 lists:

  • non-terminals on the left-hand side that are not on the right-hand side of any rule. Note that the start symbol is also somehting that is not expected to be on the right-hand side of a rule.
  • non-terminals (not tokens) that are on the right-hand side of some rule, but not on the left-hand side of a rule.
  • right-recursive rules
  • tokens or terminal symbols seen in the grammar

The last two items are not errors like the first two but are more for information. An Earley parser parser left-recursive rules faster than right-recursive rules. So when possible, those right-recursive rules should be rewritten. The list of terminal symbols may be useful in conjunction with other phases like lexing or possibly semantic actions which may also produce a list of terminal symbols to compare against.

Below I'll give grammars that errors or non-optimal behavior

These examples come from the test_checker.py unit test.

Unused LHS

Consider this grammar:

class UnusedLHS(GenericParser):
    def __init__(self, start='expr'):
        GenericParser.__init__(self, start)

    def p_unused_lhs(self, args):
        """
        expr ::= expr ADD term
        expr ::= term
        factor ::= term
        term ::= NUMBER
        """

When this python code is run:

        parser = UnusedLHS()
        lhs, rhs, tokens, right_recursive = parser.checkSets()

then the variable lhs will be set(['factor']). because factor isn't on the right-hand side of any rule. If we wrote this as:

class UnusedLHS(GenericParser):
    def __init__(self, start='calc'):
        GenericParser.__init__(self, start)

    def p_unused_lhs(self, args):
        """
        calc   ::= expr
        expr   ::= expr ADD term
        expr   ::= term
        factor ::= term
        term   ::= NUMBER
        """

We would have the same lhs value. Even though calc doesn't appear on the right-hand side of a rule, that is okay because it is the start symbol.

Unexpanded Nonterminal

Consider this grammar:

class UnexpandedNonterminal(GenericParser):
    def __init__(self, start='expr'):
        GenericParser.__init__(self, start)

    def p_expr_unexpanded(self, args):
        """
        expr ::= expr ADD term
        expr ::= term2
        term ::= NUMBER
        """

When this python code is run:

        parser = UnexpandedNonterminal()
        lhs, rhs, tokens, right_recursive = parser.checkSets()

then variable rhs will be set(['term2']) because term2 doesn't appear on the left-hand side of a rule and it isn't a terminal symbol. The parser assumes symbols in all capitals are terminal symbols.

Grammar hacking

Simplify

An useful important technique I've found to trace problems in to simpifiy: both the test case you try as well as the grammar.

In contrast to the first spark parser, you can now add comments. Please use them. Comments can also be used in testing a feature of a grammar to block out grammar rules that are irrelevant to what you are testing. This often simplifies tracing a grammar.

[give example]

Similarly, one should simplify the program that you are tracing the grammar on.

[give example]

Overspecify

Sometimes I've found it useful to put in a custom rule temporarily just to see that I get a match.

[give example]

Overgeneralize

This parser is extremely flexibile in the grammars that it accepts and the means for filtering things out.

Reduction Rules

Although this isn't preferred, one can write a very general set of rules and then check specifics in a reduction rule .

Here is an example. In Python 3, function definitions can be annotated such as like this:

def foo(a: int):

A fragment of the bytecode in some versions of Python 3, e.g 3.4, looks like this:

LOAD_NAME                'int'
LOAD_CONST               ('a',)
LOAD_CONST               <code object foo>'

And the grammar rule for the annotation arg, the middle instruction or LOAD_CONST above is:

annotate_arg ::= expr

The two LOAD_CONST instructions could also match a keyword argument. The code:

def foo(*, a=5)

gives bytecode fragment:

LOAD_CONST            'a'
LOAD_CONST            5

And the grammar rule for a keyword argument is:

kwarg ::= LOAD_CONST expr

The way the Python decompiler distinguishes these cases is to have reduction rules for both annotate_arg and kwarg. For annotate_arg its token attribute has to be a tuple for the rule to reduce. For a kwarg, its LOAD_CONST token argument has to be a string.

Tracing rules

[to be completed]

Here I'm going to debug a real problem I encountered in working on an Emacs Lisp decompiler. I'm not going to explain Emacs Lisp, or describe either the LAP bytecode from that, or my deparsing grammar. I'm just going to walk trough the debugging! From that you may infer a little about the languages.

First my Lisp code:

(defun if2(e)
  (if (<= e 4)
      (setq e 10))
  (if (null f)
      (setq f 3)))

Now with the LAP code after it is been processed for parsing (COME_FROM and LABEL instructions have been added while the function header is omitted):

    0 VARREF     e
    1 CONSTANT   4
    2 LEQ
    3 GOTO-IF-NIL 1
    6 CONSTANT   10
    7 VARSET     e
    8 COME_FROM  3
    8 LABEL      :1
  8:1 VARREF     f
    9 NOT
   10 GOTO-IF-NIL-ELSE-POP 2
   13 CONSTANT   3
   14 DUP
   15 VARSET     f
   16 COME_FROM  10
   16 LABEL      :2
 16:2 RETURN

Now we'll go into the trace:

name_expr ::= VARREF (1)
expr ::= name_expr (1)
expr_stmt ::= expr \e_opt_discard (1)
exprs ::= expr_stmt (1)
body ::= exprs (1)
fn_body ::= body \e_opt_return (1)
START ::= |- fn_body (1)
...

On each line is a grammar reduction rule. The number in parenthesis is where the last token we have read is. In the above, All of those reduction come from reading just the first instruction VARREF. The fact that we have reduced all the way to START means that one instruction corresponds to a full valid Emacs Lisp function:

(defun foo (e) e )
;;             ^  that single VARREF instruction

So by tracking portions that reduce to START and keep track of the instruction number, you might be able undertand how far we have parsed and what that means.

I had a bug where when I got the last token, I had seen a reduction of that reduction to fn_body but not START. I had thought something must be wrong in the algorithm. No, what is going on is that there must have been additional nonterminal before fn_body that hadn't been consumed.

Now continuing...

name_expr ::= CONSTANT (2)
expr ::= name_expr (2)
expr_stmt ::= expr \e_opt_discard (2)
expr_stmt ::= expr \e_opt_discard (2)
exprs ::= expr_stmt (2)
exprs ::= exprs expr_stmt (2)
body ::= exprs (2)
body ::= exprs (2)
fn_body ::= body \e_opt_return (2)
START ::= |- fn_body (2)
...

Ok. So now with just two instructions we have a complete Emacs lisp function. It is now

(defun foo (e) e 2)
;;             ^ ^  a VARREF plus a CONSTANT instruction

We're really zipping along! Note that we see \e_opt_return There is a grammar rule for opt_return that looks like this:

opt_return ::= RETURN?

which is a shorthand for

opt_return ::= RETURN
opt_return ::=

The e_ in the reduction simply means that we took the empty second or empty grammar reduction rule.