# The Magic of Tree Walking

In this chapter we continue our discussion of IR design.  We introduce the idea of the Abstract Syntax Tree (AST) as an IR which is well suited for more complex languages.  We show that this intermediate representation can be directly derived from the grammar itself which facilitates the IR design. We illustrate these ideas with a pretty printer program that processes a simple language called Cuppa1.  This example also illustrates a nice way to process abstract syntax trees via something called *tree walking*.  We finish the chapter by designing a tree walker which is an interpreter for our Cuppa1 language.

In [2]:
# let the notebook access the code folder
import sys
sys.path.insert(1,"code")

# Abstract Syntax Trees

Our Exp1bytecode language from the last chapter was so straightforward that the best IR was an abstract representation of the instructions which we could then directly interpret.  In more complex languages, especially higher level languages, it is usually not possible to design such simple IRs.  Instead we use Abstract Syntax Trees (ASTs).

<center>
<img src="figures/chap05/1/figure/Slide1.jpg" alt="">
Fig 1. The parse tree and corresponding abstract syntax tree for expression: `+ x - y z`.
</center>


One way to think about ASTs is as parse trees with all the derivation information deleted.  Think about the grammar,
```
exp : + exp exp
    | - exp exp
    | x
    | y
    | z
```
and consider the expression,
```
+ x - y z
```
Figure 1 shows both the parse tree and the AST for this expression.  It should be easy to see that both trees represent the same computation, namely that the difference of `y` and `z` is added to `x`.  Abstract syntax trees just represent these computations in an easier and  much more accessible way.

There is another interesting aspect of ASTs as IRs: Because every valid program has a parse tree, it is always possible to construct an AST for every valid program.  In this way ASTs are the IR of choice because it doesn’t matter how complex the programming language, there will always be an AST representation for any given valid program in that language.

## The Tuple Representation of ASTs

A convenient way to represent AST nodes is with the following structure,
```Python
(TYPE [, child1, child2,...])
```
A tree node is a tuple where the first component represents the type or name of the node followed by zero or more components each representing a child of the current node.

Consider the abstract syntax tree from Figure 1 for our expression `+ x - y x`.  That AST can be encoded in our tuple notation as follows,

In [3]:
ast = ('+', 'x', ('-', 'y', 'z'))

In order to make this easier to read the module `grammar_stuff` provides a nice function that dumps the AST in a much more readable format.

In [4]:
from grammar_stuff import dump_AST
dump_AST(ast)


(+ x 
  |(- y z))


# The Cuppa1 Programming Language

The Cuppa1 language is a high level language and borrows many of its syntactic features from the C/Java family of languages.  The language provides high-level programming language constructs such as while loops and if-then-else statements in addition to standard arighmetic infix notation for expressions.

Here is a first peek at the PLY grammar of the language.

```Python
# %load code/cuppa1_gram.py
# grammar for Cuppa1

from ply import yacc
from cuppa1_lex import tokens, lexer

# set precedence and associativity
# NOTE: all arithmetic operator need to have tokens
#       so that we can put them into the precedence table
precedence = (
              ('left', 'EQ', 'LE'),
              ('left', 'PLUS', 'MINUS'),
              ('left', 'TIMES', 'DIVIDE'),
              ('right', 'UMINUS')
             )


def p_grammar(_):
    '''
    program : stmt_list

    stmt_list : stmt stmt_list
              | empty

    stmt : ID '=' exp opt_semi
         | GET ID opt_semi
         | PUT exp opt_semi
         | WHILE '(' exp ')' stmt
         | IF '(' exp ')' stmt opt_else
         | '{' stmt_list '}'

    opt_else : ELSE stmt
             | empty
             
    opt_semi : ';'
             | empty

    exp : exp PLUS exp
        | exp MINUS exp
        | exp TIMES exp
        | exp DIVIDE exp
        | exp EQ exp
        | exp LE exp
        | INTEGER
        | ID
        | '(' exp ')'
        | MINUS exp %prec UMINUS
    '''
    pass

def p_empty(p):
    'empty :'
    pass

def p_error(t):
    print("Syntax error at '%s'" % t.value)

### build the parser
parser = yacc.yacc()
```



       
   
  
            








Ignore the precedence table for now.  Let's take a look at the grammar proper.  We see that programs are statement lists.  A statement list is a possibly empty list of statements and in Cuppa1 there are six different kinds of statements:

* The assigment statement assigns the value of the expression `exp` on the right side of the `=` sign to the variable `ID` on the left side of the `=` sign.

* The get statement asks the user to input an integer value which the get statement will assign to the variable `ID`. 

* The put statement prints out the value of the expression `exp` to the terminal.

* The while loop will execute `stmt` that makes up its loop body as long as the condition `exp` evaluates to true.

* The if statement will execute the statement `stmt` if the condition `exp` evaluates to true otherwise it will execute the else statement if present.

* The block statement which a list of statements appearing within curly braces.

Note that statements can be followed by an optional semicolon.

Reading further in the grammar we see that expressions are infix and Cuppa1 supports all the normal arithmetic operations.  It also supports two relational operators: equal and less equal.  Expressions can also be integer values and variable names as well as parenthesized expressions and unary minus.

It now is a good time to take look at the precedence table at the top of the grammar.  We need to tell the parser generator about the precendence and associativity of each arithmetic opertor so that the expressions are parsed correctly.  Consider the expression `3 * 4 + 1`.  Without the precedence and associativity information an LR parser would parse this as `3 * (4 + 1)` which is of course not correct knowing that multiplication has a higher precednce than addition.  If we inform the parser of the precedence and associativity of the various operators then an LR parser will parse this correctly as `(3 * 4) + 1`.  

Looking at the precedence table more closely we see that all the operators are left associative with the exception of the unary minus which is right associative.
We also see that the relational operators have the lowest precedence, addition and subtraction are at the next level, the multiplication and division operators are above that, and finally we have the unary minus at the highest level of precedence.  That is each row in the precedence table represents a precedence level and the rows are sorted in increasing precendence level.


From the lexical perspective there are no surprises.

```Python
# %load code/cuppa1_lex.py
# Lexer for Cuppa1

from ply import lex

reserved = {
    'get' : 'GET',
    'put' : 'PUT',
    'if' : 'IF',
    'else' : 'ELSE',
    'while' : 'WHILE'
}

literals = [';','=','(',')','{','}']

tokens = [
          'PLUS','MINUS','TIMES','DIVIDE',
          'EQ','LE',
          'INTEGER','ID',
          ] + list(reserved.values())

t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_EQ      = r'=='
t_LE      = r'<='

t_ignore = ' \t'

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    t.type = reserved.get(t.value,'ID')    # Check for reserved words
    return t

def t_INTEGER(t):
    r'[0-9]+'
    return t

def t_COMMENT(t):
    r'//.*'
    pass

def t_NEWLINE(t):
    r'\n'
    pass

def t_error(t):
    print("Illegal character %s" % t.value[0])
    t.lexer.skip(1)

# build the lexer
lexer = lex.lex(debug=0)
```






       
   
  
            








Perhaps the one thing to point out is the inclusion of comments that start with `//` and then run to the end of the line are now part of the language.

Here are a couple of programs written in Cuppa1.

In [5]:
from cuppa1_gram import parser

Generating LALR tables


In [6]:
p1 = \
'''
get x
x = x +1
put x
'''

parser.parse(p1)

In [7]:
fact = \
'''
get x;
y = 1;
while (1 <= x)
{
      y = y * x;
      x = x - 1;
}
put y;
'''

parser.parse(fact)

Here is an interesting program that loops forever but does nothing,

In [8]:
loop = "while (1) {}"

parser.parse(loop)

# The Cuppa1 Frontend

In the technical jargon of programming language implementation a *frontent* is a parser that constructs an abstract syntax tree for an input program and fills in the symbol table with basic variable information. Our Cuppa1 frontend consists of the parser specification from above enriched with embedded actions that construct tuple based AST and fills in the symbol table with variable names that defined using either the assignment statement or the get statement.

Before we get into the details of the frontend we need to introduce the Cuppa1 state.  A state in Cuppa1 is not unlike the state of the abstract machine for the Exp1bytecode.  Here the state is the computation represented by the AST together with a symbol table.  A state is an object,

In [9]:
# %load code/cuppa1_state

class State:
    def __init__(self):
        self.initialize()

    def initialize(self):
        # symbol table to hold variable-value associations
        self.symbol_table = {}

        # when done parsing this variable will hold our AST
        self.AST = None

state = State()


The member function `initialize` initializes a state object by clearing the symbol table and setting the AST to `None`.

The full fronent specification appear in the  [Appendix](Cuppa1 Frontend.ipynb).  Here we take a look at telling aspects of the specification.  Perhaps the best part to start with is the specificaiton of the statements.

## Statements

In [13]:
# %load -s p_stmt code/cuppa1_frontend_gram.py
def p_stmt(p):
    '''
    stmt : ID '=' exp opt_semi
         | GET ID opt_semi
         | PUT exp opt_semi
         | WHILE '(' exp ')' stmt
         | IF '(' exp ')' stmt opt_else
         | '{' stmt_list '}'
    '''
    if p[2] == '=':
        p[0] = ('assign', p[1], p[3])
        state.symbol_table[p[1]] = None
    elif p[1] == 'get':
        p[0] = ('get', p[2])
        state.symbol_table[p[2]] = None
    elif p[1] == 'put':
        p[0] = ('put', p[2])
    elif p[1] == 'while':
        p[0] = ('while', p[3], p[5])
    elif p[1] == 'if':
        p[0] = ('if', p[3], p[5], p[6])
    elif p[1] == '{':
        p[0] = ('block', p[2])
    else:
        raise ValueError("unexpected symbol {}".format(p[1]))


Looking at the code carefully we see that the embedded actions build AST nodes for the corresponding statements.  Consider the assignment statement,
```
stmt : ID '=' exp opt_semi
```
The embedded actions for this grammar rule construct the following tuple where `p[1]` contains the name of the variable of the token `ID` and 
`p[3]` contains the AST generated by the non-terminal `exp`,
```Python
('assign', p[1], p[3])
```
In addition, the variable name in `p[1]` is written to the symbol table,
```Python
state.symbol_table[p[1]] = None
```
However, we are not doing any processing at this point so the value we record for the variable is set to `None`.  As pointed out above when we discussed the syntax of Cuppa1, the non-terminal `opt_semi` means that the statement can be followed by an optional semicolon.  

The get statement specification follows a very similar pattern.  Here is the grammar rule for get statemments,
```
stmt : GET ID opt_semi
```
The corresponding embedded actions build a tuple where `p[2]` holds the variable name,
```Python
('get', p[2])
```
Again, the variable is recorded in the symbol table with its value set to `None`,
```Python
state.symbol_table[p[2]] = None
```

The remainder of the embedded actions are pretty straight forward with the execption perhaps of the if statement.  From our discussion on the syntax of Cuppa1 above we know that if statements come in two flavors: `if-then` and `if-then-else`.  Our frontend specification has to deal with these two flavors effectively.  From the grammar snippet above we have the rule,
```
stmt : IF '(' exp ')' stmt opt_else
```
which specifies that the `if-then` statement can be followed by an optional else-part.  The tuple that is being constructed is,
```Python
('if', p[3], p[5], p[6])
```
where `p[3]` is the AST generated by the non-terminal `exp`, `p[5]` is the AST generated by the non-terminal `stmt`.
The question is, what does `p[6]` contain?  In order to answer this we need to take a look at the embedded actions for
the rule defining the non-terminal `opt_else`.

In [14]:
# %load -s p_opt_else code/cuppa1_frontend_gram.py
def p_opt_else(p):
    '''
    opt_else : ELSE stmt
             | empty
    '''
    if p[1] == 'else':
        p[0] = p[2]
    else:
        p[0] = p[1]
    

The embedded actions for the `opt_else` non-terminal specify that if an else-part exists then copy the AST of the corresponding statement.  That means that `p[6]` above will have the AST of the else-statement if the else-part exists. If the else-part does not exist then we copy whatever the non-terminal `else` produces.

For thie we need to take a look at the definition of the non-terminal `empty`.

In [15]:
# %load -s p_empty code/cuppa1_frontend_gram.py
def p_empty(p):
    'empty :'
    p[0] = ('nil',)


The embedded action for `empty` constructs a tuple with no children and of the type `nil`.  That means, `p[6]` in the embedded action for the if statement will have the node,
```Python
('nil',)
```
indicating that there is no else-part.  This is worthwhile to remember because later on when we actually walk the tree in order to do some processing we need to be aware of these to distinct tree patterns.

## Statement Lists and Programs

Statement lists and programs are specified as follows in the frontend,

In [None]:
# %load -s p_prog,p_stmt_list code/cuppa1_frontend_gram.py
def p_prog(p):
    '''
    program : stmt_list
    '''
    state.AST = p[1]


def p_stmt_list(p):
    '''
    stmt_list : stmt stmt_list
              | empty
    '''
    if (len(p) == 3):
        p[0] = ('seq', p[1], p[2])
    elif (len(p) == 2):
        p[0] = p[1]



Programs in Cuppa1 are lists of statements.  The embedded action for  the rule `program : stmt_list`  will copy
the AST computed by the non-terminal `stmt_list` in `p[1]` into the AST field of the state.

The rule for statement lists,
```
stmt_list : stmt stmt_list
          | empty

```
states that statement lists consist of a stament followed by statement lists or they are empty.  
For non-empty statement lists we need a way to glue all the different statements together in the AST.
The traditional way to do this is with a 'sequennce' node.
Let's take a look at the embedded action for non-empty statement lists,
```Python
p[0] = ('seq', p[1], p[2])
```
Here we construct an AST node of type `seq` where the first child is the AST computed by the non-terminal `stmt` and
the second child is the AST of the rest of the statement list computed by the non-terminal `stmt_list`.  

If the statement list is empty then the corresponding embedded action,
```Python
p[0] = p[1]
```
will copy the AST computed by the non-terminal `empty`.  We know from our previous disscussion that this AST is the tree node `('nil',)`.

Now, for a program with two statements,
```
Stmt1
Stmt2
```
our frontend will compute the following AST,
```Python
('seq'
    <Stmt1>
    ('seq',
       <Stmt2>
       ('nil',)))
```
where `<Stmt1>` and `<Stmt2>` represent the ASTs for the corresponding program statements `Stmt1` and `Stmt2`.
It is easy to see that statement lists are `nil` terminated lists constructed with `seq` nodes.  We will study this in more detail when we finished our discussion of the frontend.


## Expressions

We start our discussion of expressions in Cuppa1 with the binary operators,

In [None]:
# %load -s p_binop_exp code/cuppa1_frontend_gram.py
def p_binop_exp(p):
    '''
    exp : exp PLUS exp
        | exp MINUS exp
        | exp TIMES exp
        | exp DIVIDE exp
        | exp EQ exp
        | exp LE exp
    '''
    p[0] = ('binop', p[2], p[1], p[3])
 

For each of 