# Program Analysis with Intermediate Representations

<!--
\index{intermediate representation}
\index{IR}
\index{bytecode interpreter}
\index{virtual machine}
\index{abstract syntax tree}
\index{AST}
\index{pretty printer}
-->

In this chapter we show that the simple, syntax directed scheme of processing programming
languages shown in Chapter 2 is not powerful enough to handle certain
standard programming constructs such as the `jump to label` instruction for instance.  We show that such programming
constructs can be processed by first constructing an intermediate representation (IR) of the program
and then use this IR during the actual processing of the program.  We illustrate these ideas with a simple
bytecode language (virtual machine).  We continue our discussion with the fact that the *ad hoc* IR design
we used for the bytecode interpreter has its limitations when designing processors for more complex
languages.  We then introduce the idea of the Abstract Syntax Tree (AST) as an intermediate representation
and show that this intermediate representation can be directly derived from the grammar itself giving us
a more principled way of constructing intermediate representations.
We illustrate these ideas with a pretty printer program for a simple high-level language.

## Beyond Syntax Directed Processing

In Chapter 2 we introduced syntax directed interpretation as a way to add semantics to programming
languages.
However, this scheme fails when some language construct needs access to information that is not directly computable based on the local syntactic structures or has not been entered into the symbol table for instance.
Classic examples of this is the goto-statement in C and the `jump to label`  machine code instruction.
In order to examine this a little bit closer we extend our Exp1 language with conditional and unconditional jump instructions
and call the new language Exp1bytecode.  
This new language is based on our Exp1 language but introduces five new statements: 
\begin{description}
\item[{\icd noop}] -- a statement that does nothing.
\item[{\icd stop}] -- a statement that halts the execution.
\item[{\icd jumpT exp label}] -- a statement that evaluates {\icd exp} and then jumps to the {\icd label} if the expression evaluates to true.
\item[{\icd jumpF exp label}] -- a statement that evaluates {\icd exp} and then jumps to the {\icd label} if the expression evaluates to false.
\item[{\icd jump label}] -- an unconditional jump to the {\icd label}.
\end{description}
Recall that in Exp1 expressions are based on integer values.
Therefore, in order to compute the truth values necessary for the conditional jump instructions we adopt the following convention: an expression value of zero represents the boolean value false and a non-zero expression value represents the boolean value true.

Our Exp1bytecode language also introduces the idea of labeled statements as targets for jump statements.
Labels are names followed by a colon that precede a statement.
For example,
\begin{code}
      store x 5;
L1:   store x (- x 1);
      jumpT x L1;
\end{code}
This program loops while {\icd x} is non-zero.

\index{relational operator}
In order to write some interesting programs in this new language we also introduce two new operators:
\begin{description}
\item[{\icd =}] -- the equality relational operator.
\item[{\icd =<}] -- the less-equal relational operator. 
\end{description}
Both operators return zero for the boolean value false and one for the boolean value true.

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap03:exp1bytecode-gram}
{\input{figures/chap03/1/exp1bytecode-gram.tex}}
{The grammar specification for the Exp1bytecode language.}

\index{label!definition}
\index{label!reference}
Figure~\ref{chap03:exp1bytecode-gram} shows the full grammar specification for our Exp1bytecode language.
The specification is straightforward with the exception perhaps of the labels.
Labels can appear in two different context.  On line 10 we see that labels may appear in a label definition and
they may appear as label references in the jump statements on lines 14 through 16.
A label definition is the label name followed by a colon.
According to the rule that defines programs as non-empty lists of statements on line 8 we see that label definition may appear 
as part of a statement.
In order for this language to make sense only one label definition per label name is allowed.
With respect to label references we allow multiple references to the same label definition.
In other words, the same labeled satement can be the target of multiple jump instructions.

Now, back to our problem at hand: the syntax directed interpretation of this language.
As long as we are dealing with expressions in Exp1bytecode things are fine,
\antlrlistingnomath
exp returns [Integer value]
	:	'+' e1=exp e2=exp 	{ $value = $e1.value + $e2.value; }
	|	'-' e1=exp e2=exp 	{ $value = $e1.value - $e2.value; }
	
	...
	
	|	'(' e=exp ')' 		{ $value = $e.value; }
	|	rhsvar 				{ $value = $rhsvar.value; }
	|	NUM					{ $value =  new Integer($NUM.text); }
	;
	
rhsvar returns [Integer value]
	:	NAME				{ $value = lookup($NAME.text); }
	;
\end{lstlisting}
All information is available at the point in time when we recognize a syntactic structure and we are able to evaluate the semantic rules.
Trouble arises when we try to perform syntax directed interpretation of jump statements.
\antlrlistingnomath
stmt:

	...

	| 'jumpT' exp label ';'  { if ($exp.value != 0) jumpTo($label.name); }
	| 'jumpF' exp label ';'  { if ($exp.value == 0) jumpTo($label.name); }
	| 'jump' label ';'	     { jumpTo($label.name); }

	…

	;
	
label returns [String name]
	: NAME	{ $name = $NAME.text; }
	;
\end{lstlisting}
\index{forward jump}
From a syntax directed perspective labels have a value, their label name, but this value does not tell us where to jump to.
In order to know where to jump to we would need the label definition point and this is where the trouble arises in the function
{\icd jumpTo} in the code above: the label definition point can not be computed in a syntax directed fashion.
Even if we decided that labels kind of look like variables in that they have a definition and a reference point and use a label table where we 
store labels together with their definition point this would not 
work because the particular jump instruction we are be processing might be a forward jump, that is, we haven't even seen 
the label definition yet.
Consider the following,
\begin{code}
      store x 10;
      jumpT (= x 10) L1;
      print 0;
      stop
L1:   print 1;
      stop;
\end{code}
This program stores the value ten in {\icd x}, then checks if {\icd x} has the value ten.  
If so, it jumps forward to the label {\icd L1} and prints out the value one and stops the execution.
Otherwise it prints out the value zero and stops the execution.
It is a silly program but it illustrates the point quite nicely that the syntax directed processing of the {\icd jumpT} statement
will fail because at the point of processing the jump statement we have not seen the label definition yet.

\index{interpreter}
\index{syntax analysis}
\index{semantic analysis}
\index{intermediate representation}
\index{IR}
In order to interpret languages like Exp1bytecode we decouple the syntax analysis from the actual interpretation, that is,
we build an interpreter that consists of two phases: the syntax analysis and the  semantic analysis.
The two phases are couple with an intermediate representation (IR) of the program.
In other words, we adopt the architecture of the interpreter of Chapter~\ref{chap:prog-lang} in 
Figure~\ref{chap01:interpreter} on page~\pageref{chap01:interpreter}.

%%%%%%%%%%%%%%%%%%%% new section %%%%%%%%%%%%%%%%%%%%%%%
\section{Feature Driven IR Design}

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap03:exp1bytecode-IR}
{\includegraphics[width=5.2in]{figures/chap03/2/figure.jpg}}
{IR design for the Exp1bytecode interpreter.}

\index{IR design}
Our interpreter consists of two phases that communicate with each other using the IR.
Since the IR is at the core of our interpreter this makes a good IR design paramount,
\begin{myquote}
A good IR should be easy to construct and easy to process.
\end{myquote}
Here we take an approach to IR design that is driven by particular features of the language at hand.

%%%%%%%%%%%%%%%%%%%% new section %%%%%%%%%%%%%%%%%%%%%%%
\subsection{An Example: A Bytecode Interpreter}

If we look at Exp1bytecode we can identify three major characteristics of this language:
\begin{itemize}
\item We have variables that hold values and these values can be changed and referenced by instructions.
\item We have conditional and unconditional jumps which use label definitions and references to specify the range of the jumps.
\item Programs in this language consist of a sequence of concrete instructions.
\end{itemize}
Given these features of Exp1bytecode and given the fact that the
language looks like very abstract machine code  or bytecode one design choice it to make our IR resemble a virtual machine that consists of three
major entities:
\begin{itemize}
\item A symbol table to hold variable definitions.
\item A label table to hold label definitions.
\item A container (in the object-oriented sense) to hold a list of instructions, that is, a container that holds our program to be interpreted.
\end{itemize} 
Figure~\ref{chap03:exp1bytecode-IR} shows this IR design.
Here the abstract machine is shown with the program,
\begin{code}
   store x 10 ;
L1:
   print x ;
   store x (- x 1) ;
   jumpT x L1 ;
   stop ;
\end{code}
loaded into its data structures.
Given that programs in the IR representation still look very much like the programs in the original textual representation it should follow that 
the IR is probably easy to construct (which it is as we will see later on).
Also, given that programs are represented as a linked list of instructions is should also follow that the programs are easy to interpret -- we simply 
walk down the list of instructions and execute each one in turn.
Again, as we will see later on in an OO environment this is pretty straight forward.
So it seems that our IR design fulfills the two key points of IR design we made above: easy to construct and easy to process.

Now, let us just think through the issue with labels that we had before when we attempted a syntax directed approach to the interpretation 
of Exp1bytecode.
In our IR design labels behave much like variables in the sense that you have a definition point and you have label references.
In order to deal with this effectively our IR uses a label table that records the instructions that act as definition points for particular labels.
In our example in Figure~\ref{chap03:exp1bytecode-IR} we see that the label table holds the label {\icd L1} and the entry 
for this label points to the definition point of this label, namely the print statement in the program.
Label references point back to the label table and therefore we can find and resolve the targets for any jumps that occur in a program.
Also note that forward references are no longer a problem because the separate syntax analysis phase will have seen all label
definition points and entered them into the label table before the semantic phase started.

%%%% qr code %%%%
\qrcode
{Scan the QR code or use the URL in order to see an animation of the Exp1bytecode virtual machine.}
{qrcodes/chap03/q1/qrcode.png}
{\bookurl/b/3/q1/figure.mov}

%%%%%%%%%%%%%%%% new section %%%%%%%%%%%%%%%%%%%
\subsubsection{Design Details}

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap03:top-level-exp1bytecode}
{\input{figures/chap03/3/figure.tex}}
{The top-level class for the Exp1bytecode interpreter.}

We begin our detailed analysis of the design by looking at the top-level class of the interpreter in Figure~\ref{chap03:top-level-exp1bytecode}.
The most striking thing about this class is that all its members are static members, that is, the class itself really just acts as a container for all the
data structures of our interpreter.
On lines 7 through 9 we see the declaration of our IR and on lines 12 through 15 we see the declaration of our lexer and parser objects together
with their appropriate input streams.

The top-level class also contains the static main function that is the entry point for the interpreter.
The layout of this function is slightly different from the layout of the main functions we encountered in the previous chapter.
Here we are embedding the code in a try-catch block because now the syntax as well as the semantic phase can generate errors and the
most convenient way to deal with these is as exceptions.

As perhaps expected, the main function first instantiates all the IR objects (lines 21 through 23) and then continues to instantiate the
lexer and parser objects (lines 26 through 29).
Once these objects have been instantiated we are ready to start our interpretation.
On line 32 we call the parsing function associated with the start symbol of our grammar in order to begin the syntax analysis.
Once this phase is complete we execute the semantic analysis phase calling the \ilisting{run} function associated with our IR.

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap03:exp1bytecode-grammar1}
{\input{figures/chap03/4/figure.tex}}
{Statement and expression grammar for the Exp1bytecode interpreter.}

The syntax analysis of our interpreter does nothing more than to construct the IR for the current input program.
This becomes evident when we take a look at the grammar specification for our byte code interpreter.
Figure~\ref{chap03:exp1bytecode-grammar1} shows the statement and expression grammar snippets of our Exp1bytecode language.
Notice that the actions of the rules do not encode any computation but instead contain code that simply instantiate objects that mimic
the syntax just recognized.
For example, for the \ilisting{print} statement we create an object \ilisting{PrintInstr}, for the \ilisting{store} statement we create and object
\ilisting{StoreInstr}, and so on.
Something similar happens when we parse expressions with one difference: the objects for expressions are linked together as an expression
tree.
If you look at the expression rules carefully you will see that each expression object (except for variables and numbers) has the expression
tree of its subexpression linked in.

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap03:exp1bytecode-class-hierarchies}
{
	\includegraphics[width=3.4in]{figures/chap03/5/instr-hierarchy.pdf}\\
	a)\\
	\includegraphics[width=1.8in]{figures/chap03/5/expr-hierarchy.pdf}\\
	b)
}
{Class hierarchies of the IR.}

There is one more thing we should point out; the return types of both statements and expressions is not given in terms of any specific type but instead is given as a base type of the statement types and expressions types, respectively.
This suggests the class hierarchies shown in 
\begin{summary}
\end{summary}

\begin{bibnotes}
\end{bibnotes}

\begin{exercises}

\ex
\label{chap03:exp1bytecode-new}
How would you change Exp1bytecode to make it amenable for syntax directed interpretation? 
({\bf Hint:} Add structured programming constructs.) 
Implement a grammar specification for your new language and illustrate that it can support syntax directed interpretation.

\ex (project)
Write a syntax directed interpreter for the language you designed in Exercise~\ref{chap03:exp1bytecode-new}.


\ex (project)
Consider our Exp1bytecode language given in Figure~\ref{chap03:exp1bytecode-gram}.
Add a new branching instruction called {\icd compare} to the language.
The syntax of this instruction is as follows,
\antlrlisting
stmt:	'compare' exp exp label label label ';' 
\end{lstlisting}
and its semantics can be described like this,
\begin{itemize}
\item If the first expression has a value less than the second expression then jump to the first label.
\item If the expressions have equal values then jump to the second label.
\item If the first expression has a value larger than the second expression then jump to the third label.
\end{itemize}
Modify the interpreter for Exp1bytecode to accommodate this new instruction.


\end{exercises}


