-
-
Notifications
You must be signed in to change notification settings - Fork 9
Debugging Grammar rules
Table of Contents
I've spent a lot of time working on and debugging grammar rules. As a result, I've accumulated some street experience.
Also in the process, I've refined the spark parser to make it easier to debug.
First, the parser has an instance method called checkSets() which does some grammar verification. This returns 4 lists:
- non-terminals on the left-hand side that are not on the right-hand side of any rule. Note that the start symbol is also somehting that is not expected to be on the right-hand side of a rule.
- non-terminals (not tokens) that are on the right-hand side of some rule, but not on the left-hand side of a rule.
- right-recursive rules
- tokens or terminal symbols seen in the grammar
The last two items are not errors like the first two but are more for information. An Earley parser parser left-recursive rules faster than right-recursive rules. So when possible, those right-recursive rules should be rewritten. The list of terminal symbols may be useful in conjunction with other phases like lexing or possibly semantic actions which may also produce a list of terminal symbols to compare against.
Below I'll give grammars that errors or non-optimal behavior
These examples come from the test_checker.py
unit test.
Consider this grammar:
class UnusedLHS(GenericParser):
def __init__(self, start='expr'):
GenericParser.__init__(self, start)
def p_unused_lhs(self, args):
"""
expr ::= expr ADD term
expr ::= term
factor ::= term
term ::= NUMBER
"""
When this python code is run:
parser = UnusedLHS()
lhs, rhs, tokens, right_recursive = parser.checkSets()
then the variable lhs
will be set(['factor'])
. because factor
isn't on the right-hand side of any rule. If we wrote this as:
class UnusedLHS(GenericParser):
def __init__(self, start='calc'):
GenericParser.__init__(self, start)
def p_unused_lhs(self, args):
"""
calc ::= expr
expr ::= expr ADD term
expr ::= term
factor ::= term
term ::= NUMBER
"""
We would have the same lhs
value. Even though calc
doesn't appear on the right-hand side of a rule, that is okay because it is the start symbol.
Consider this grammar:
class UnexpandedNonterminal(GenericParser):
def __init__(self, start='expr'):
GenericParser.__init__(self, start)
def p_expr_unexpanded(self, args):
"""
expr ::= expr ADD term
expr ::= term2
term ::= NUMBER
"""
When this python code is run:
parser = UnexpandedNonterminal()
lhs, rhs, tokens, right_recursive = parser.checkSets()
then variable rhs
will be set(['term2'])
because term2
doesn't appear on the left-hand side of a rule and it isn't a terminal symbol. The parser assumes symbols in all capitals are terminal symbols.
An useful important technique I've found to trace problems in to simpifiy: both the test case you try as well as the grammar.
In contrast to the first spark parser, you can now add comments. Please use them. Comments can also be used in testing a feature of a grammar to block out grammar rules that are irrelevant to what you are testing. This often simplifies tracing a grammar.
[give example]
Similarly, one should simplify the program that you are tracing the grammar on.
[give example]
Sometimes I've found it useful to put in a custom rule temporarily just to see that I get a match.
[give example]
This parser is extremely flexibile in the grammars that it accepts and the means for filtering things out.
Although this isn't preferred, one can write a very general set of rules and then check specifics in a reduction rule .
Here is an example. In Python 3, function definitions can be annotated such as like this:
def foo(a: int):
A fragment of the bytecode in some versions of Python 3, e.g 3.4, looks like this:
LOAD_NAME 'int'
LOAD_CONST ('a',)
LOAD_CONST <code object foo>'
And the grammar rule for the annotation arg, the middle instruction or LOAD_CONST
above is:
annotate_arg ::= expr
The two LOAD_CONST
instructions could also match a keyword argument.
The code:
def foo(*, a=5)
gives bytecode fragment:
LOAD_CONST 'a'
LOAD_CONST 5
And the grammar rule for a keyword argument is:
kwarg ::= LOAD_CONST expr
The way the Python decompiler distinguishes these cases is to have
reduction rules for both annotate_arg and kwarg. For annotate_arg
its token attribute has to be a tuple for the rule to reduce. For a
kwarg, its LOAD_CONST
token argument has to be a string.
[to be completed]
Here I'm going to debug a real problem I encountered in working on an Emacs Lisp decompiler. I'm not going to explain Emacs Lisp, or describe either the LAP bytecode from that, or my deparsing grammar. I'm just going to walk trough the debugging! From that you may infer a little about the languages.
First my Lisp code:
(defun if2(e)
(if (<= e 4)
(setq e 10))
(if (null f)
(setq f 3)))
Now with the LAP code after it is been processed for parsing
(COME_FROM
and LABEL
instructions have been added while the function header is omitted):
0 VARREF e
1 CONSTANT 4
2 LEQ
3 GOTO-IF-NIL 1
6 CONSTANT 10
7 VARSET e
8 COME_FROM 3
8 LABEL :1
8:1 VARREF f
9 NOT
10 GOTO-IF-NIL-ELSE-POP 2
13 CONSTANT 3
14 DUP
15 VARSET f
16 COME_FROM 10
16 LABEL :2
16:2 RETURN
Now we'll go into the trace:
name_expr ::= VARREF (1)
expr ::= name_expr (1)
expr_stmt ::= expr \e_opt_discard (1)
exprs ::= expr_stmt (1)
body ::= exprs (1)
fn_body ::= body \e_opt_return (1)
START ::= |- fn_body (1)
...
On each line is a grammar reduction rule. The number in parenthesis is
where the last token we have read is. In the above, All of those
reduction come from reading just the first instruction VARREF
. The
fact that we have reduced all the way to START
means that one
instruction corresponds to a full valid Emacs Lisp function:
(defun foo (e) e )
;; ^ that single VARREF instruction
So by tracking portions that reduce to START
and keep track of the
instruction number, you might be able undertand how far we have parsed and what that means.
I had a bug where when I got the last token, I had seen a reduction of
that reduction to fn_body
but not START
. I had thought something
must be wrong in the algorithm. No, what is going on is that there
must have been additional nonterminal before fn_body
that hadn't been consumed.
Now continuing...
name_expr ::= CONSTANT (2)
expr ::= name_expr (2)
expr_stmt ::= expr \e_opt_discard (2)
expr_stmt ::= expr \e_opt_discard (2)
exprs ::= expr_stmt (2)
exprs ::= exprs expr_stmt (2)
body ::= exprs (2)
body ::= exprs (2)
fn_body ::= body \e_opt_return (2)
START ::= |- fn_body (2)
...
Ok. So now with just two instructions we have a complete Emacs lisp function. It is now
(defun foo (e) e 2)
;; ^ ^ a VARREF plus a CONSTANT instruction
We're really zipping along! Note that we see \e_opt_return
There is a grammar rule for opt_return
that
looks like this:
opt_return ::= RETURN?
which is a shorthand for
opt_return ::= RETURN
opt_return ::=
The e_
in the reduction simply means that we took the empty second or empty grammar reduction rule.