Skip to content

Compiler architecture

Idloj edited this page Mar 23, 2016 · 25 revisions
Clone this wiki locally

See also Engine architecture, since the design of the compiler and the design of the engine are of course intertwined.

Development branches

As described at Branches, NetLogo development work has diverged into two separate branches: 5.x and hexy.

The compiler is rather different in these branches. A lot of refactoring work on the compiler was done for NetLogo-Headless, which was later merged into hexy, described at zOld: Compiler Refactoring Notes.

Major changes on hexy include:

  • dist/depend.graffle updated for package structure changes. prospective compiler hackers will want to study this.
  • compile package split into parse (aka "the front end"), compile.middle, and compile.back packages (formalizing the informal division that already existed).
  • Most common stuff, including the AST classes for representing procedure bodies, has been moved to the core package.
  • new parse package. depends only on core package, unlike compile which still has grungy ties to agent and nvm.
  • StructureParser rewritten from scratch, using Scala's parser combinators library. top-level StructureParser class is in parse; the combinators live in parse.StructureCombinators. StructureConverter takes the nice clean top-level AST classes in parse.StructureDeclarations and turns them into grungy api.Program and nvm.Procedure objects
  • IdentifierParser is now called Namer and was rewritten by decomposing it into a bunch of independent NameHandlers
  • Some things that were formerly considered optimizations, like let-verification, agent type checking, and task variable verification, now live in the parser so that they can be used by Tortoise.

The code for the middle and back ends hasn't changed much. The action has all been on the front end because the goal of these changes is to better support Tortoise, and Tortoise only uses the front end. In some instances code that was inappropriate for Tortoise was relocated from the front end to the middle end.

The rest of this page describes the 5.x compiler. The 5.x information remains relevant since it will probably be maintained indefinitely and people are still building extensions and so forth for it. (Perhaps we should make separate wiki pages for the hexy compiler?)

Packages

The compiler is in the org.nlogo.compiler, org.nlogo.generator, and org.nlogo.lex packages.

Main entry points

The primary class is Compiler, which implements CompilerInterface. The main entry points are compileProgram() and compileMoreCode(). These correspond to two different modes of operation: we're either compiling the entire Code tab, or we're compiling an additional snippet of code (e.g. code in a button, or typed into the Command Center) which is compiled with reference to an already-compiled Code tab.

If compilation fails, a CompilerException is raised.

If compilation succeeds, the result of compilation is a CompilerResults object:

case class CompilerResults(procedures: Seq[Procedure], program: Program)

In the case of compileMoreCode(), procedures will contain only a single Procedure.

Other entry points

Some additional entry points support things like syntax highlighting, syntax checking, etc.

Phases

Compilation proceeds in phases. Each phase is implemented by a different class. If you read through Compiler.compile(), you will see each phase in order. The phases are: Tokenizer, StructureParser, IdentifierParser, ExpressionParser, Visitors, Optimizer, TypeParser, ArgumentStuffer, Assembler, Generator.

The phases may informally be divided into three groups: front end, middle end, and back end. The front end does parsing and some semantic analysis; the middle end does AST rewriting and further semantic analysis; the back end does linearization of control structures and code generation.

The phases are described individually below.

Front end phases

Tokenizer

  • input: String containing program text
  • output: Seq[Token]

The entry points are in Tokenizer, which uses the TokenLexer class automatically generated by JFlex from the specification in autogen/TokenLexer.flex.

Tokenizing (a.k.a lexing) NetLogo isn't very different from tokenizing any other language. One odd thing we do is distinguish between different kinds of identifiers (command primitives, reporter primitives, built-in variables, etc.) at tokenization time (using a helper class called TokenMapper); you might have expected that would be handled by IdentifierParser. (I'm not aware of any special reason we do it here instead of there.)

StructureParser

  • input: Seq[Token]
  • output: StructureParser.Results (comprising a Program, a Map[String, Procedure], and a Map[Procedure, Iterable[Token]])

StructureParser breaks the code up into declarations and procedures. The information in the declarations (such as globals, breeds, etc.) is stored in the new Program that is returned. Procedures are created, and the tokens in each procedure body are stored in a map.

IdentifierParser

  • input: for each procedure, a Procedure and an Iterator[Token]
  • output: Seq[Token]

IdentifierParser resolves named references in the program. For example it connects procedure calls to the Procedure objects they call and connects variable references to the variables they refer to (be they observer, patch, turtle, breed, parameters, locals, or let).

ExpressionParser

  • input: Procedure, Iterable[Token]
  • output: Seq[ProcedureDefinition]

ExpressionParser converts tokens to parse trees. This is what is usually meant by "parsing". It involves figuring out which reporters are inputs for which other reporters and commands, and interpreting brackets and parentheses, while taking operator precedence into account.

The result type is Seq[ProcedureDefinition] rather than just ProcedureDefinition because any task bodies inside the procedure result in separate ProcedureDefinition objects.

Middle end phases

Visitors

  • input: ProcedureDefinition
  • side effects: mutates ASTs

This is actually several classes, each of which could be considered a distinct phase of compilation, but since each class does a very small job, we lump them together as a single "phase".

Each visitor implements the AstVisitor interface. This use of the Visitor Pattern makes it very easy to write new classes that do some kind of processing of the parse tree. Putting each kind of processing in a separate visitor keeps the code modular.

At present there are eight visitors:

  • ReferenceVisitor: fill in reference field of prims that take references to agent variables
  • ConstantFolder: as an optimization, perform computations at compile time (see http://en.wikipedia.org/wiki/Constant_folding)
  • SimpleOfVisitor: convert _of(_*variable) to _*variableof
  • TaskVisitor: process reporter tasks and references to task variables
  • LocalsVisitor: change let variables into Procedure-level local variables (slots in the Activation) where possible
  • SetVisitor: make _set calls more specific, e.g. _setobservervariable
  • CarefullyVisitor: connect _errormessage reporters to their enclosing _carefully commands

Optimizer

Technically this is just a ninth visitor, but it's significant enough to deserve treatment as a phase unto itself.

  • input: ProcedureDefinition
  • side effects: mutates ASTs

Rewrites parts of the parse tree into forms which should run faster.

For example the combination of _any and _with is rewritten to a call to _anywith which exits early once an example is found.

Another example is the replacing of _with(_patches,_equal(_patchvariable:PXCOR,_constdouble:5.0)) with _patchcolumn:5.

Each optimization is a singleton object within Optimizer.

Here's a probably very out of date list of the optimizations performed:

@ = got lost between 4.0 and 4.1 when the optimizer was rewritten from
the ground up. maybe should be re-added someday.

== Visitors

reduce interpreter overhead by reducing arity:
  _of(_turtle/patch/breedvariable) => _turtle/patch/breedvariableof  [SimpleOfVisitor]

avoid constructing LetMap & Let objects:
  _let/_setletvariable => _setprocedurevariable  [LocalsVisitor]
  _letvariable => _procedurevariable             [LocalsVisitor]

== Optimizer

early loop exit:
  _any(_with) => _anywith
  _greatherthan(_countwith,_constdouble:0) => _anywith
  _notequal(_countwith,_constdouble:0) => _anywith
  _equal(_countwith,_constdouble:0) => _not(_anywith)
  _oneof(_with) => _oneofwith
@ _any(_breedon) => _anybreedon
@ _any(_turtleson) => _anyturtleson
@ _equal(_count,_constdouble:0) => _not(_any)
@ _greatherthan(_count,_constdouble:0) => _any
@ _notequal(_count,_constdouble:0) => _any

avoid constructing AgentSet:
  _count(_with) => _countwith
  _other(_with) => _otherwith
  _any(_other) => _anyother
  _any(_otherwith) => _anyotherwith
  _count(_other) => _countother
  _count(_otherwith) => _countotherwith
@  _other(_turtleshere) => _otherturtleshere
@  _other(_breedhere) => _otherbreedhere
@  _count(_turtleshere) => _countturtleshere
@  _count(_otherturtleshere) => _countotherturtleshere
@  _count(_breedhere) => _countbreedhere
@  _count(_otherbreedhere) => _countotherbreedhere

generate simpler code when something about arguments is known:
  _random(_constdouble) => _randomconst if arg is positive integer
@@  _call => _tailcall
@@ _callreport => _tailcallreport

avoid extra work by optimizing special cases:
  _fd(_constdouble:1) => _fd1
  _patchat(-1/0/1,-1/0/1) => _patchnorth/south/etc.
  _with(_patches,_equal1double(px/ycor)) => _patchcol/_patchrow
  _with(_patches,_equal(px/ycor,_observer/procedurevariable)) => _patchcol/_patchrow

avoid constructing list, avoid shuffling (and for some, reduce interpreter overhead):
  _sum(_patchvariableof(_neighbors4)) => _nsum4
  _sum(_patchvariableof(_neighbors)) => _nsum
@ _max/mean/min/sum(_of) => _max/min/mean/sumof  // Flocking, GasLab, Ising
@  _max/mean/min/sum(_patch/turtle/breedvariableof) => _max/mean/min/sumpatch/turtle/breedvariableof
@  (uh oh, there's also "turtlevariabledouble", "turtleorlinkvariable", etc. this gets complicated)

AgentTypeChecker

  • input: Seq[ProcedureDefinition]
  • side effects: mutates usableBy field in Procedures

This phase is unique to NetLogo. It involves figuring out what is observer code, what is turtle code, what is patch code and what is link code. Code can be runnable by all four, or just by some of them. If it isn't runnable by any agent type, a compiler error occurs.

A long comment at the top of the class explains the details.

Back end phases

ArgumentStuffer

  • input: ProcedureDefinition
  • side effects: mutates args arrays in Instructions

Fills the args arrays, in all of the Instructions anywhere in the Procedure, with Reporters.

No real action happens here. We're basically just discarding the "scaffolding" of the tree of AstNodes, leaving just the Command and Reporter objects themselves.

Assembler

  • input: ProcedureDefinition
  • output: stores Array[Command] in code field of Procedure

Mainly we're just discarding the "scaffolding" of the tree of AstNodes, leaving just the Command and Reporter objects themselves.

Exception: optimization of some types of tail recursion happens here, though. (Perhaps that could be reimplemented as a Visitor?)

Most commands are assembled generically, but some commands are “custom assembled”. Typically these are control structures such as ask and while. Each custom assembled command has an assemble() method which directs its own assembly via an AssemblerAssistant.

Generator

  • input: Procedure
  • side effects: alters/replaces Command and Reporter objects in the procedure's code field

Translates some commands and reporters directly into JVM byte code. Replaces Command and Reporter instances with instances of built-on-the-fly custom subclasses of GeneratedCommand and GeneratedReporter.

Optional phase, can be disabled. Doesn't run in applet, since applets can't use custom classloaders.

Something went wrong with that request. Please try again.