# Procrastination Pays Off: Program Analysis with Intermediate Representations

<!--
\index{intermediate representation}
\index{IR}
\index{bytecode interpreter}
\index{virtual machine}
\index{abstract syntax tree}
\index{AST}
\index{pretty printer}
-->

In this chapter we show that the simple, syntax directed scheme of processing programming
languages shown in Chapter 2 is not powerful enough to handle certain
standard programming constructs such as the `jump to label` instruction for instance.  We show that such programming
constructs can be processed by first constructing an intermediate representation (IR) of the program
and then use this IR during the actual processing of the program.  We illustrate these ideas with a simple
bytecode language (virtual machine).  We continue our discussion with the fact that the *ad hoc* IR design
we used for the bytecode interpreter has its limitations when designing processors for more complex
languages.  We then introduce the idea of the Abstract Syntax Tree (AST) as an intermediate representation
and show that this intermediate representation can be directly derived from the grammar itself giving us
a more principled way of constructing intermediate representations.
We illustrate these ideas with a pretty printer program for a simple high-level language.

In [1]:
import sys
sys.path.insert(0,"code")

# Limit of Syntax Directed Processing

In Chapter 3 we introduced syntax directed interpretation as a way to add semantics to programming
languages.
However, this scheme fails when some language construct needs access to information that is not directly computable based on the local syntactic structures or has not been entered into the symbol table for instance.
Classic examples of this is the goto-statement in C and the `jump to label`  machine code instruction.
In order to examine this a little bit closer we extend our Exp1 language with conditional and unconditional jump instructions
and call the new language *Exp1bytecode*.
This new language is based on our Exp1 language but introduces five new statements: 

* `noop` - a statement that does nothing.
* `stop` - a statement that halts the execution.
* `jumpT exp label` - a statement that evaluates `exp` and then jumps to the `label` if the expression evaluates to true.
* `jumpF exp label` - a statement that evaluates {\icd exp} and then jumps to the {\icd label} if the expression evaluates to false.
* `jump label` - an unconditional jump to the `label`.

Recall that in Exp1 expressions are based on integer values.
Therefore, in order to compute the truth values necessary for the conditional jump instructions we adopt the following convention: an expression value of zero represents the boolean value false and a non-zero expression value represents the boolean value true.

Our Exp1bytecode language also introduces the idea of labeled statements as targets for jump statements.
Labels are names followed by a colon that precede a statement.
For example,
```
      store x 5;
L1:   store x (- x 1);
      jumpT x L1;
```
This program loops while `x` is non-zero.

<!--
\index{relational operator}
-->

In order to write some interesting programs in this new language we also introduce two new operators:

* `=` - the equality relational operator.
* `=<` - the less-equal relational operator. 

Both operators return zero for the boolean value false and one for the boolean value true.

Here is the grammar for our Exp1bytecode language.

In [None]:
# %load code/exp1bytecode_gram.py
from ply import yacc
from exp1bytecode_lex import tokens, lexer

def p_grammar(_):
    '''
    prog : stmt_list

    stmt_list : labeled_stmt stmt_list
              | empty

    labeled_stmt : label_def stmt

    label_def : NAME ':' 
              | empty

    stmt : PRINT exp ';'
         | STORE NAME exp ';'
         | JUMPT exp label ';'
         | JUMPF exp label ';'
         | JUMP label ';'
         | STOP ';'
         | NOOP ';'

    exp : '+' exp exp
        | '-' exp exp
        | '-' exp
        | '*' exp exp
        | '/' exp exp
        | EQ exp exp
        | LE exp exp
        | '(' exp ')'
        | var
        | NUMBER
        
    label : NAME
    var : NAME
    '''
    pass

def p_empty(p):
    'empty :'
    pass

def p_error(t):
    print("Syntax error at '%s'" % t.value)

parser = yacc.yacc()


You can still clearly see the Exp1 lineage shining through.  However, lists of statements are now lists of labeled statements where a label definition is a name followed by a colon.  Statements now include all the statements from our earlier design discussion of this language and we have also enriched expressions to include all the standard arithmetic operators including the unary minus.
The unary minus introduces shift/reduce conflicts into our parser because the unary minus expression `'-' exp` is a prefix to the binary subtraction expression `'-' exp exp`.  However, the standard conflict resolution of LR parsers for shift/reduce conflicts is to always shift when possible.  This behavior is exactly what we want so we can just leave it the way it is.

Here is the corresponding lexical analyzer for Exp1bytecode.

In [None]:
# %load code/exp1bytecode_lex.py
# Lexer for Exp1bytecode

from ply import lex

reserved = {
    'store' : 'STORE',
    'print' : 'PRINT',
    'jumpT' : 'JUMPT',
    'jumpF' : 'JUMPF',
    'jump'  : 'JUMP',
    'stop'  : 'STOP',
    'noop'  : 'NOOP'
}

literals = [':',';','+','-','*','/','(',')']

tokens = ['NAME','NUMBER','EQ','LE'] + list(reserved.values())

t_EQ = '='
t_LE = '=<'
t_ignore = ' \t'

def t_NAME(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    t.type = reserved.get(t.value,'NAME')    # Check for reserved words
    return t

def t_NUMBER(t):
    r'[0-9]+'
    t.value = int(t.value)
    return t

def t_NEWLINE(t):
    r'\n'
    pass

def t_COMMENT(t):
    r'\#.*'
    pass
    
def t_error(t):
    print("Illegal character %s" % t.value[0])
    t.lexer.skip(1)

# build the lexer
lexer = lex.lex()


There should be no real surprises in this definition.  The only real change is that we added comments to our language in the form of `#` comments.  As usual with these kinds of comments once you start a comment it spans the rest of the line. The regular expression for this is `\#.*` - match zero or more character after the hash symbol not including the newline character.

Let's exercise this parser.

In [5]:
from exp1bytecode_gram import parser

Generating LALR tables


As input program we will use our example program from above.

In [6]:
input_stream = \
'''
      store x 5;
L1:   store x (- x 1);
      jumpT x L1;

'''

In [7]:
parser.parse(input_stream)

Good! No syntax or other errors were reported.  That means our parser works.

<!--
\index{label!definition}
\index{label!reference}
-->

Now, back to our problem at hand: the syntax directed interpretation of this language.
As long as we are dealing with expressions in Exp1bytecode things are fine.
We could easily envision providing syntax directed interpretation for expression similar to what we
did in Exp1,
```Python
   ...

def p_arith_exp(p):
    """
    exp : '+' exp exp
        | '-' exp exp
        | '(' exp ')'
    """
    if p[1] == '+':
        p[0] = ' (+' + p[2] + p[3] + ')'
    elif p[1] == '-':
        p[0] = ' (-' + p[2] + p[3] + ')'
    elif p[1] == '(':
        p[0] = p[2]
    else:
        raise ValueError("parsing weirdness in expressions: {} !".format(p[1]))

def p_var_exp(p):
    "exp : var"
    p[0] = p[1]
    
def p_num_exp(p):
    "exp : num"
    p[0] = p[1]

   ...
```
All information is available at the point in time when we recognize a syntactic structure and we are able to evaluate the embedded rules.
The same is true for `print` and `store` statements.  All the information required to interpret these statements
is availble at the time their syntax is recognized by the parser.

<!--
\index{forward jump}
-->

Trouble arises when we try to perform syntax directed interpretation of jump statements. Consider,
```Python
def p_stmt(p):
    '''
    stmt :
         ...
         | JUMP label ';'
         ...
    '''
    if p[1] == 'jump':
        target = p[2]
        # and we are in deep trouble - jump target is not local!?!
```
Trying to do this in a syntax directed fashion gets us into deep trouble because the target of the jump instruction
is not local to the syntactic unit of the jump instruction.  As a matter of fact, if the jump is a forward jump
in the code then we will not have actually seen the target yet!

Consider the following,
```
      store x 10;
      jumpT (= x 10) L1;
      print 0;
      stop
L1:   print 1;
      stop;
```
This program stores the value ten in `x`, then checks if `x` has the value ten.  
If so, it jumps forward to the label `L1` and prints out the value one and stops the execution.
Otherwise it prints out the value zero and stops the execution.
It is a silly program but it illustrates the point quite nicely that the syntax directed processing of the `jumpT` statement
will fail because at the point of interpreting the jump statement we have not seen the label definition yet.

# Decoupling Syntax Analysis and Semantic Processing

<!--
\index{interpreter}
\index{syntax analysis}
\index{semantic analysis}
\index{intermediate representation}
\index{IR}
-->

In order to interpret languages like Exp1bytecode we need to decouple the syntax analysis from the actual interpretation, that is,
we need to procrastinate with our interpretation of the program by first constructing an abstract representation of it, the *intermediate representation*, and then in turn evaluate or interpret this abstract representation.
This fits nicely with our architecture of an interpreter of Figure 6 in Chapter 1.
Here our interpreter has two phases that are coupled with an intermediate representation of the program.
The first phase builds the intermediate representation and the second phase interprets the intermediate representation.

We can say the following about any kind of intermediate representation
> An intermediate representation (IR) is an abstract representation of the original program.

## Feature Driven IR Design

<!--
\index{IR design}
-->

Since the IR is at the core of our interpreter this makes a good IR design paramount,

> A good IR should be easy to construct and easy to process.

Here we take an approach to IR design that is driven by particular features of our language at hand.
In our case we can view Exp1bytecode as representing an abstract machine.

## The Exp1bytecode Interpreter

---
<center>
<img src="figures/chap04/1/figure/Slide1.jpg" alt="">
Fig 1. IR design for the Exp1bytecode interpreter.
</center>

---


If we look at the Exp1bytecode syntax we can identify three major characteristics of this language:

* We have variables that hold values and these values can be changed and referenced by instructions.
* We have conditional and unconditional jumps which use label definitions and references to specify the range of the jumps.
* Programs in this language consist of a sequence of instructions.

Given these features of Exp1bytecode and given the fact that the
language looks like very abstract machine code one design choice it to make our IR resemble a virtual machine that consists of three
major components:

* A symbol table to hold variable definitions.
* A label table to hold label definitions.
* A list of instructions..
 
Figure 1 shows our IR design.
Our abstract machine is shown with the program,
```
   store x 10 ;
L1:
   print x ;
   store x (- x 1) ;
   jumpT x L1 ;
   stop ;
```
loaded and ready to be interpreted.
Given that programs in the IR representation still look very much like the programs in the original textual representation the IR is easy to construct.
Also, given that programs are represented as a list of instructions they are easy to interpret -- we simply 
walk down the list of instructions and execute each one in turn.
So it seems that our IR design fulfills the two key points of IR design we made above: easy to construct and easy to process.

Now, let us just think through the issue with labels that we had before when we attempted a syntax directed approach to the interpretation of Exp1bytecode.
In our new IR design labels behave much like variables in the sense that we have a definition point and we have label references.
The definition points of labels are the labeled instructions and the reference points are the labels in the jump
instructions.
In order to deal with this effectively our IR uses a label table that records the instructions that act as definition points for a particular labels.
In our example in Figure 1 we see that the label table holds the label `L1` and the entry 
for this label points to the definition point of this label, namely the print statement in the program.
Label references point back to the label table and therefore we can find and resolve the targets for any jumps that occur in a program.
Also note that forward references are no longer a problem because during the syntax analysis phase we will have seen all label
definition points and entered them into the label table before the semantic phase started.

Here is an animation of the abstract machine executing our program.

<!-- chap04 q1 -->

<a href="http://www.youtube.com/watch?feature=player_embedded&v=7oY-FS0jHvo" target="_blank">
<img style='border:1px solid #000000' src="movie.jpg" width="120" height="90" />
</a>


### Implementation

The IR construction is embedded in the grammar specification for our parser.  Let's take a look.

In [None]:
# %load code/exp1bytecode_interp_gram
from ply import yacc
from exp1bytecode_lex import tokens, lexer

# define the structures of our abstract machine
addr_ix = 0
program = []
symbol_table = dict()
label_table = dict()

def p_prog(_):
    '''
    prog : stmt_list
    '''
    pass

def p_stmt_list(_):
    '''
    stmt_list : labeled_stmt stmt_list
              | empty
    '''
    pass

def p_labeled_stmt(p):
    '''
    labeled_stmt : label_def stmt
    '''
    global label_table
    global program
    global addr_ix
    # if label exists record it in the label table
    if p[1]:
        label_table[p[1]] = addr_ix
    # append stmt to program
    program.append(p[2])
    addr_ix += 1

def p_label_def(p):
    '''
    label_def : NAME ':' 
              | empty
    '''
    p[0] = p[1]

def p_stmt(p):
    '''
    stmt : PRINT exp ';'
         | STORE NAME exp ';'
         | JUMPT exp label ';'
         | JUMPF exp label ';'
         | JUMP label ';'
         | STOP ';'
         | NOOP ';'
    '''
    # for each stmt assemble the appropriate tuple
    if p[1] == 'print':
        p[0] = ('print', p[2])
    elif p[1] == 'store':
        p[0] = ('store', p[2], p[3])
    elif p[1] == 'jumpT':
        p[0] = ('jumpT', p[2], p[3])
    elif p[1] == 'jumpF':
        p[0] = ('jumpF', p[2], p[3])
    elif p[1] == 'jump':
        p[0] = ('jump', p[2])
    elif p[1] == 'stop':
        p[0] = ('stop',)
    elif p[1] == 'noop':
        p[0] = ('noop',)
    else:
        raise ValueError("Unexpected stmt value: {}".format(p[1]))

def p_bin_exp(p):
    '''
    exp : '+' exp exp
        | '-' exp exp
        | '*' exp exp
        | '/' exp exp
        | EQ exp exp
        | LE exp exp
    '''
    p[0] = (p[1], p[2], p[3])
    
def p_uminus_exp(p):
    '''
    exp : '-' exp
    '''
    p[0] = ('UMINUS', p[2])
    
def p_paren_exp(p):
    '''
    exp : '(' exp ')'
    '''
    # parens are not necessary in trees
    p[0] = p[2]
    
def p_var_exp(p):
    '''
    exp : var
    '''
    p[0] = (p[1],)

def p_number_exp(p):
    '''
    exp : NUMBER
    '''
    p[0] = (int(p[1]),)

def p_label_or_var(p):
    '''
    label : NAME
    var : NAME
    '''
    p[0] = p[1]

def p_empty(p):
    '''
    empty :
    '''
    p[0] = ''

def p_error(t):
    print("Syntax error at '%s'" % t.value)

parser = yacc.yacc()


After the preamble that includes our lexer module we see that we are defining the data structures for our abstract machine: a program list for our instructions, a symbol table dictionary for variable lookups, and our label table for resolving jump targets.  We also have an address index variable that keeps track of where we are inserting instructions into the program list.

The next interesting piece of code is the parsing function for the rule,
```
labeled_stmt : label_def stmt
```
The embedded action associated with this rule is,
```Python
# if label exists record it in the label table
if p[1]:
    label_table[p[1]] = addr_ix
# append stmt to program
program.append(p[2])
addr_ix += 1
```
Pretty straight forward.

Parsing statements themselves constructs a tuple consisting of the name of the instruction together with its arguments. Take a look,
```
'''
stmt : PRINT exp ';'
    ...
     | JUMP label ';'
    ...
'''
```
```Python
    # for each stmt assemble the appropriate tuple
    if p[1] == 'print':
        p[0] = ('print', p[2])
    ...
    elif p[1] == 'jump':
        p[0] = ('jump', p[2])
    ...
```
Here the tuple for the print instruction consists of the name `print` and the expression tree that represents the value to be printed out.
The tuple for the jump instruction consists of the name `jump` and the name of the target label.
Similarly for all the other instructions.

Finally, the last bit of interesting code in the grammar specification is the definition of expressions,
```
'''
exp : '+' exp exp
    | '-' exp exp
    | '*' exp exp
    | '/' exp exp
    | EQ exp exp
    | LE exp exp
'''
```
```Python
p[0] = (p[1], p[2], p[3])
```
Here we construct a labeled term tree from the expressions.  According to these rules the expression,
```
=< + 3 2 * 3 2
```
would give rise to the labeled term tree,
```
('=<', ('+', (3,), (2,)), ('*', (3,), (2,)))
```

Let's give our parser a test run.

In [9]:
from exp1bytecode_interp_gram import parser, program, symbol_table, label_table
import pprint
pp = pprint.PrettyPrinter()

Generating LALR tables


In [10]:
input_stream = \
'''
   store x 10 ;
L1:
   print x ;
   store x (- x 1) ;
   jumpT x L1 ;
   stop ;
'''

In [11]:
parser.parse(input_stream)

In [12]:
pp.pprint(program)

[('store', 'x', (10,)),
 ('print', ('x',)),
 ('store', 'x', ('-', ('x',), (1,))),
 ('jumpT', ('x',), 'L1'),
 ('stop',)]


In [13]:
pp.pprint(label_table)

{'L1': 1}


In [14]:
pp.pprint(symbol_table)

{}


In order to interpret the programs in our IR we need two functions.  The first one is the interpretation of instructions on the program list.

In [23]:
def interp_program():
    global program
    global symbol_table
    global label_table
    
    # We cannot use the list iterator here because we
    # need to be able to interpret jump instructions
    
    # start at the first instruction in program
    instr_ix = 0

    # keep interpreting until we ran out of instructions
    # or we hit a 'stop'
    while True:
        if instr_ix == len(program):
            # no more instructions
            break
        else:    
            # get instruction from program
            instr = program[instr_ix]

        # instruction format: (name, arg1, arg2, ...)
        name = instr[0]
        
        # interpret instruction
        if name == 'print':
            # PRINT exp 
            exp_tree = instr[1]
            val = eval_exp_tree(exp_tree)
            print("> {}".format(val))
            instr_ix += 1
            
        elif name == 'store':
            # STORE NAME exp
            var_name = instr[1]
            val = eval_exp_tree(instr[2])
            symbol_table[var_name] = val
            instr_ix += 1

        elif name == 'jumpT':
            # JUMPT exp label
            val = eval_exp_tree(instr[1])
            if val:
                instr_ix = label_table.get(instr[2], None)
            else:
                instr_ix += 1

        elif name == 'jumpF':
            # JUMPF exp label
            val = eval_exp_tree(instr[1])
            if not val:
                instr_ix = label_table.get(instr[2], None)
            else:
                instr_ix += 1

        elif name == 'jump':
            # JUMP label
            instr_ix = label_table.get(instr[1], None)

        elif name == 'stop':
            # STOP
            break
            
        elif name == 'noop':
            # NOOP
            instr_ix += 1
            
        else:
            raise ValueError("Unexpected instruction name: {}".format(p[1]))
   

In order to have a complete interpreter for our abstract machine we need to provide the function that evaluates expression trees. 

In [16]:
def eval_exp_tree(node):
    global symbol_table
    
    # tree nodes are tuples (NAME, [arg1, arg2,...])
    
    name = node[0]
    
    if name == '+':
        # '+' exp exp
        v_left = eval_exp_tree(node[1])
        v_right = eval_exp_tree(node[2])
        return v_left + v_right

    elif name == '-':
        # '-' exp exp
        v_left = eval_exp_tree(node[1])
        v_right = eval_exp_tree(node[2])
        return v_left - v_right

    elif name == '*':
        # '*' exp exp
        v_left = eval_exp_tree(node[1])
        v_right = eval_exp_tree(node[2])
        return v_left * v_right

    elif name == '/':
        # '/' exp exp
        v_left = eval_exp_tree(node[1])
        v_right = eval_exp_tree(node[2])
        return v_left // v_right

    elif name == '=':
        # '=' exp exp
        v_left = eval_exp_tree(node[1])
        v_right = eval_exp_tree(node[2])
        return v_left == v_right

    elif name == '=<':
        # '=<' exp exp
        v_left = eval_exp_tree(node[1])
        v_right = eval_exp_tree(node[2])
        return v_left <= v_right
    
    elif name == 'UMINUS':
        # 'UMINUS' exp
        val = eval_exp_tree(node[1])
        return - val
    
    elif type(name) is str:
        # NAME
        return symbol_table.get(name,0)
    
    elif type(name) is int:
        # NUMBER
        return name
 

In [17]:
eval_exp_tree(('=<', ('+', (3,),  (2,)), ('*', (3,), (2,))))


True

In [24]:
interp_program()


> 10
> 9
> 8
> 7
> 6
> 5
> 4
> 3
> 2
> 1


In [25]:
symbol_table

{'x': 0}

# Summary

# Bibliographic Notes

# Exercises

\ex
\label{chap03:exp1bytecode-new}
How would you change Exp1bytecode to make it amenable for syntax directed interpretation? 
({\bf Hint:} Add structured programming constructs.) 
Implement a grammar specification for your new language and illustrate that it can support syntax directed interpretation.

\ex (project)
Write a syntax directed interpreter for the language you designed in Exercise~\ref{chap03:exp1bytecode-new}.


\ex (project)
Consider our Exp1bytecode language given in Figure~\ref{chap03:exp1bytecode-gram}.
Add a new branching instruction called {\icd compare} to the language.
The syntax of this instruction is as follows,
\antlrlisting
stmt:	'compare' exp exp label label label ';' 
\end{lstlisting}
and its semantics can be described like this,
\begin{itemize}
\item If the first expression has a value less than the second expression then jump to the first label.
\item If the expressions have equal values then jump to the second label.
\item If the first expression has a value larger than the second expression then jump to the third label.
\end{itemize}
Modify the interpreter for Exp1bytecode to accommodate this new instruction.


\end{exercises}