Permalink
Switch branches/tags
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
633 lines (514 sloc) 31 KB

MoarVM JIT compiler overview

Introduction

This document attempts to describe the structure of the MoarVM JIT compiler and document some of the design decisions that shape it.

Components

Dynamic optimizer

This part is noteworthy because it is not, in fact, part of the JIT compiler at all. It is a separate subsystem that handles all of the type logging, graph construction and rewriting, call inlining - in short, all of the interesting optimizations of a dynamic language system. The JIT (legacy and ‘expression’) is a code-generation backend for this system. MoarVM is rather unique in this since in most systems, the order is reversed.

In contrast to what might be called ‘language-level’ optimizations implemented by spesh, the objective of the JIT is to remove intepreter-level inefficiencies.

‘Lego’ JIT compiler

The legacy or ‘lego’ JIT compiler was the first to be developed back in 2014. It has this name because it works by emitting independent fragments of machine code that are stitched together. These fragments are mapped directly from the ‘spesh graph’ (representing a MoarVM subroutine or ‘frame’) that is the input to the compiler. There are very few optimizations applied by the legacy compiler (one notable optimization is the inlining of a polymorphic REPR function pointer if the type of the object is known).

The lego JIT produces a structure called the ‘JIT graph’, which, while a technically correct name, is a bit of a misnomer since it is really a singly-linked list. It uses labels rather than an explicit control flow graph (like spesh does) to indicate internal structure. The JIT graph is then compiled to JIT code. There is in fact little reason to maintain this two-phase setup; the legacy JIT could work just as well in a single phase.

Even for the new JIT, the legacy JIT provides the necessary scaffolding to integrate with the interpreter (such as a proper function prologue and epilogue, support for exceptions and invokish operators, and the various bits and pieces required for correct deoptimization.

DynASM assembler

DynASM is a third-party library from the luajit project that we’ve modified to support register addressing on the x86-64 architecture. It allows you to interleave the assembly code you’d like to emit with the C code that does the regular compiling business. DynASM has two parts:

  • A lua script to preprocess a ‘dasc’ (which is C interleaved with assembly code) file and output a C file.
  • A C header to #include into the generated C file that contains all the functions to assemble and link the machine code at runtime

DynASM source files are architecture-specific.

Expression trees

The expression ‘tree’ is the intermediate format of the new ‘expression’ JIT compiler. The tree is a technically incorrect misnomer - it actually represents a graph.

The expression tree is built from a spesh graph by mapping MoarVM instructions to templates. These templates are written in a textual format by a human (or perhaps a script), that is precompiled to a header file during MoarVM compilation. Take note that while the expression templates look a bit like a high-level language some may be familiar with, the concepts it expresses are decidedly ‘low-level’ and can be mapped more-or-less directly to the machine code of a hypothetical RISC CPU.

Syntax

The syntax used for expression templates is derived from LISP.

  • Lists are the basic constructs of the language. A list is delimited by opening =’(‘= and closing =’)’= parentheses. Lists can be nested. Items in the lists are separated by whitespace.
  • A word is anything that matches the perl regular expression of [^\s\(\)#"'].
    • A word that consists only of alphanumeric symbols (and possibly underline) is called a symbol: sp_p6oget_o. The first symbol in a list is generally it’s operator.
    • Suffixed by : is generally a keyword, e.g. let:
    • Prefixed by $ is a reference, either to a MoarVM instruction operand ($1, $2) or to a declared name ($val).
    • Prefixed by ^ invokes a list macro, which can be declared with the macro: keyword.
    • Macro parameters are prefixed by =,= (e.g. =,obj=), and they are replaced with whatever symbol or list is ‘bound’ to them when the macro is invoked.
    • Parameter macro’s are indicated by a prefix &, e.g. &sizeof. They are always substituted with a function macro: (&sizeof foo) becomes sizeof(foo). These must form constant expressions at compilation time.
  • A string is anything delimited by =”= - but note that escape sequences are not supported, so =”foo " bar”= doesn’t do what you might think it does

An example of a template definition would be as follows:

(template: sp_p6oget_o
    (let: (($val (load (add (^p6obody $1) $2) ptr_sz)))
           (if (nz $val) $val (^vmnull))))

The let: keyword is not an operator but a special construct for the template precompiler. It allows the declaration of names for values that are computed and reused in further operators.

And an example of a macro definition would be:

(macro: ^getf (,object ,type ,field)
   (load (addr ,object (&offsetof ,type ,field))
         (&SIZEOF_MEMBER ,type ,field)))

Operators

These are all the operators that are known by the expression language as of today (<2017-09-25 Mon>).

OperatorShapeTypeSemantics
LOAD(load reg $size)regLoad a value from a memory address
STORE(store reg reg $size)voidStore a value to a memory address
CONST(const $value $size)regLoad a constant value to a register
ADDR(addr reg $ofs)regAdd a constant offset to a pointer
IDX(idx reg reg $scale)regCompute a pointer for an index in an array
COPY(copy reg)regCopy this value (opaque to tiling)
LT(lt reg reg)flagFirst operand is smaller than second
LE(le reg reg)flagFirst operand is smaller than or equal to second
EQ(eq reg reg)flagFirst operand is equal to second
NE(ne reg reg)flagFirst operand is not equal to second
GE(ge reg reg)flagFirst operand is larger than or equal to second
GT(gt reg reg)flagFirst operand is larger than second
NZ(nz reg)flagOperand is nonzero
ZR(zr reg)flagOperand equals zero
CAST(cast reg $to $from $sign)regConvert a value from smaller to larger representation
FLAGVAL(flagval flag)regBinary value of comparison operator
DISCARD(discard reg)voidDiscard value of child operator
ADD(add reg reg)regAdd two integer values
SUB(sub reg reg)regSubtract second operand from first
AND(and reg reg)regBinary AND (intersection) of two operands
OR(or reg reg)regBinary OR (union) of two operands
XOR(xor reg reg)regBinary XOR of two operands
NOT(not reg)regBinary negation of operand
ALL(all flag+)flagVariadic short-circuit logical AND
ANY(any flag+)flagVariadic short-circuit logical OR
DO(do void* reg)regExecute multiple statements and return last expression
DOV(do void+)voidExecute multiple statements
WHEN(when flag void)voidExecute statement if condition is met
IF(if flag reg reg)regYield value of conditional expression
IFV(ifv flag void void)voidConditionally execute one of two statements
BRANCH(branch reg)voidJump to code position
LABEL(label $label)regLoad position of code
MARK(mark (label $label))voidMark this position
CALL(call reg (arglist ...) $size)regCall a function and return a value
CALLV(call reg (arglist ...))voidCall a function without a return value
ARGLIST(arglist (carg reg $type)+)c_argsSetup function call arguments
CARG(carg reg $type)voidAnnotate value with ‘parameter type’
GUARD(guard void $before $after)voidWrap a statement with code before and after
TC(tc)regRefer to MoarVM thread context
CU(cu)regRefer to current compilation unit
LOCAL(local)regRefer to current frame working memory
STACK(stack)regRefer to top of C frame stack

Tree iteration and manipulation

Instruction selection (tiler)

Next to the register allocator, the instruction selection algorithm (or the ‘tiler’) is the most complex part of the JIT. It is fortunately fairly stable. The goal of tiling is to match the intermediate representation (the expression tree/graph) to the cheapest sequence of instructions on the target architecture (x86-64) that implements the semantics of the expression tree. The implementation is heavily based on the paper by Aho et al - and in fact, the expression tree IR was designed mostly to accomodate it.

How tiling works

The following is a necessarily incomplete description of the actual process - much better described by either the paper linked above or the source code that implements it.

A ‘tile’ in the expression JIT describes a number of overlapping concepts, for which I rely on the reader to disambiguate by context. (Humans are good at that).

  • A pattern that can be matched against an expression tree and which is defined textually, much like the expression templates.
  • An object for the JIT compiler to represent a machine instruction, and
  • A function that emits the machine code to the compiler buffer, with parameters substituted

The textual ‘tile definition’ contains the name of the function that implements it (in the example below store_addr and test_addr_const), the pattern that the tile matches, the symbol that is substituted for the pattern in a succesful match, and the cost of doing so (in terms of compiled code).

(tile: store_addr 
   (store (addr reg $ofs) reg $size) void 5)
(tile: test_addr_const  
   (nz (and (load (addr reg $ofs) $size) (const $val $size))) flag 4)

Conceptually, tiling is a process of reducing the tree structure by replacing nodes by the symbols defined by the tiles. The following example may serve as an illustration. Given the following set of tiles:

TilePatternSymbolAssembly code
local(local)regrbx
const(const $value)regmov reg, $value
addr(addr reg $offset)reglea reg, [reg+$offset]
load(load reg $size)regmov out, [reg]
add(add reg reg)regadd reg1, reg2
store(store reg reg $size)voidmov [reg1], reg2

We can reduce a tree and generate code as follows (note that the order of reducing is bottom-up / postorder from left-to-right):

TreeTileCode
(add (load (addr (local) 0x8)) (const 17))local -> reg
(add (load (addr reg 0x8)) (const 17))addr -> reglea rax, [rbx+0x8]
(add (load reg) (const 17))load -> regmov rcx, [rax]
(add reg (const 17))const -> regmov rdx, 17
(add reg reg)add -> regadd rcx, rdx

Tiling never takes constant parameters into account, which is a severe limitation - some operators are radically different between 8-bit and 64 bit sizes on x86-64. Maybe we can implement architecture-specific templates to optimize our way out of this.

Picking tiles

The difference between the model above and the real implementation are:

  • A tile can cover more complex expression trees (see the example tiles above)
  • A given subtree may be reduced by different sets of tiles, and
  • We’d like to choose the cheapest such set, and we’d like to do that efficiently.

Before going further, be warned: the following is rather complex and even now it makes my head hurt. Feel free to skip this section.

In order to figure out if a certain tile can be applied to a given tree, we’d need to know if it’s pattern matches. The pattern matches if it’s structure matches and if the expressions ‘below’ it can be matched to the symbols at its leafs. Because the tiler uses postorder iteration, we can ensure that symbols have been assigned to the leafs when we consider the head operator of the tile.

To avoid having to traverse the tree to find out if the leafs match to the pattern, during precompilation the tile patterns are ‘flattened’ and the nested lists replaced with ‘synthetic symbols’. From the sublist a new (partial) tile pattern is generated that reduces to this generated symbol and compiles to nothing). See for some examples the table below. Because the resulting patterns are flat, they can be matched by inspecting only the symbols of the direct children of the operator.

OriginalFlat
(load (addr reg)) -> reg(load @sym1) -> reg
(addr reg) -> @sym1
(store (addr reg) reg) -> void(store @sym2 reg) -> void
(addr reg) -> @sym2
(add reg (const)) -> reg(add reg @sym3)
(const) -> @sym3
(add reg (load (addr reg))) -> reg(add reg @sym4) -> reg
(load @sym5) -> @sym4
(addr reg) -> @sym5

From the table it is possible to see that a single tile list pattern (like ‘(addr reg)’) can be reduced to many different symbols. In fact, given a set of tiles and their associated costs, we can compute all possible combinations of symbols (symbol sets) that a tile pattern can be reduced to and we can also compute which tile would be the most efficient implementation of that operator given the symbol sets of it’s children. By precomputing this information selecting the optimal tile for an operator can be reduced to a table lookup at runtime.

For completeness I’ll try to describe how the precomputation process works. The central concept it is based on is that of ‘symbol sets’ (symsets). We begin by initializing a map of all symbol sets, which are initially just the individual symbols that are generated by all tile patterns (called @rules).

$sets{$_->{sym}} = [$_->{sym}] for @rules;

And a reverse lookup table is also generated, from all symbols within a set, to all symbol sets that they occur in. Again, initially this is just the symbol itself.

while (my ($k, $v) = each %sets) {
   # Use a nested hash for set semantics
   $lookup{$_}{$k} = 1 for @$v;
}

Then for each tile pattern, we lookup all symbols that could be combined with the symbols in the patterns, and we store the symbol that this reduces to. The name ‘%trie’ is also kind of inaccurate, but it gets the picture accross, which is that this is a nested associated lookup table. The purpose is to combine all tiles (and their associated symbols) that map to the same symbol sets together - because that means that these tiles can cover the same tree structures.

for my $s_k1 (keys %{$lookup{$sym1}}) {
   for my $s_k2 (keys %{$lookup{$sym2}}) {
        $trie{$head, $s_k1, $s_k2}{$rule_nr} = $rule->{sym};
    }
}

Having done that, we generate new symbol sets from this lookup table - all the values of the generated hashes are symbols that can be produced by tiles that map to the same trees:

my %new_sets;
for my $gen (values %trie) {
    my @set = sort(main::uniq(values %$gen));
    my $key = join(':', @set);
    $new_sets{$key} = [@set];
}

In general, we run this process multiple iterations with the new sets of symbols, because the symbolsets that exist influence the tile patterns that are considered to be equivalent. For instance, in the table above the tile patterns generating @sym1, @sym2 and @sym5 are equivalent, and so after a single iteration there will be only a single set containing those symbols. This means that the patterns of (load @sym1) and (load @sym5) are equivalent, and hence that @sym4 is always combined with reg. So this process has to iterate until there are no more change in the sets, at which point it can read (from the same table) the all possible combinations of sets. Having this, it is a small matter to compute the cheapest tile of the set to cover a given tree [fn:cost].

my (%seen, @rulesets);
for my $symset (values %trie) {
    my @rule_nrs = main::sortn(keys %$symset);
    my $key = join $;, @rule_nrs;
    push @rulesets, [@rule_nrs] unless $seen{$key}++;
}
return @rulesets;

Register allocation

The legacy JIT compiler did not require register allocation, because it never persisted values in registers between fragments. The expression JIT compiler does, that’s the whole purpose of it.

Earlier attempts

There have been three attempts at implementing a register allocator for the expression JIT. Ironically, all of them have been based on the same algorithm, which is linear scan register allocation on SSA form.

The first attempt tried to do register allocation ‘online’ with code generation, i.e. during tree walking. This does not work because you’ll need to know the live range of the values your allocating, otherwise you don’t know when a register can be freed, and you’ll either run out or you might be incorrect when freeing them. Furthermore, you’ll need to manage aliases, i.e. values that are the same (created by COPY operators) and nodes that ‘merge’ (created by IF opreators). Never mind the case where we compile a CALL node in a conditional block.

The second attempt improved on that in two ways:

  • It made a real attempt to compute the live range extents of values. Initally based on tree iteration order; however, after introducing the tile list, based on tile list position (which is what we use now).
  • It used a ‘cross-linked-list’ to identify values that either shared a location or a definition. This was used to deal with aliases and merged values.

However, this proved still not quite sufficient. It had inherited from the earlier online algorithm the design of allocating a register for a value at the first instance it is encountered, and assigning that register to instructions that use the value when they’d be encountered. Hence it required the ability to determine the ‘current’ position for a value, including if that value is spilled or not. This is somewhat complicated when a value is represented by multiple different things (due to aliasing and merging, in a cross-linked list). What is more, it would only try to resolve aliases at register-assignment time, i.e. when it’d encounter the relevant COPY or IF operator. So that would actually change the extent of the live range while it was ‘under operation’, which further complicates several key functionalities such as deciding when a register can be freed.

Current implementation

The third attempt is the current one and departs from the previous two attempts in the following crucial ways:

  • Rather than have a value be synonymous with its definition (the instruction and operator that created it) it represents a set of definitions and uses , that are joined together by aliases and merges. It uses a separate phase and a union-find data structure to implement these values (which are called ‘live ranges’ consistently). This ensures that all uses of the ‘same’ value use the same register location for it.
  • During the allocation phase, it iterates over the values in ascending order (i.e. by first definition), rather than over the instructions as the second version did or over the tree as the first. A register can be reused only after its value has expired, which is when the algorithm iterates past its last use. This means that at a single point in the program, no two values can be allocated to the same register, which is the essential correctness property for a register allocator.
  • Whenever a register is allocated, it is directly assigned to the instructions that use its value. When that register is changed for whatever reason, we update it as well. Thus we never have to query ‘what is the current location of the value of $x’, which gets tricky to answer in case of spilling, and the register assigned to an instruction is always ‘up to date’.
  • It is sometimes necessary to ‘spill’ a value from a register to memory, either because we need more registers or because the ABI forces us to - this is true for all register allocation algorithms. At any point in the program where it is necessary to spil a value, all instructions that define or use that value are wrapped with instructions to move that value between register and memory. (This ensures that the value is spilled to the same position in all possible code paths). The single live range that covered all these uses is then split into ‘atomic’ live ranges that cover only the instruction to load or store the value and the original instruction that would use or define it. The upshot is that a live range that is processed always refers to a value that resides in a register.
  • Because of value merging and aliasing, it is sometimes possible for a value to be undefined within its own life range. This can be demonstrated by code below. In it, $a is first defined as 10, then conditionally redefined as the result of $a * 2 - 3. So between the definition of $b and the assignment of the ‘new’ value of $a, the ‘old’ value of $a = 10 is no longer valid. Hence, it should not be stored to memory at that point. Such undefined ranges are called ‘live range holes’, and they are an of example of the ‘complexity waterbed’ of compiling - SSA form code doesn’t have them, but makes it really complicated to ensure the same register is allocated for all the separate uses of a value.
my $a = 10;
if (rand < 0.5) {
    my $b = $a * 2;
    $a = $b - 3;
}
printf "\$a = %d\n", $a;

Function call ABI

It is (unfortunately) necessary to have the register allocator handle the C function call ABI. This is because certain function arguments are supposed to be passed by in specific registers (meaning they have to be placed there) and some registers can be overwritten by the function that is called (in fact, all of them). So in the general case, the JIT needs to:

  • Place values in the correct registers and/or stack positions, making sure not to overwrite values that are still necessary for the call.
  • Load values from memory if they are not present in registers. Note that a function call can have more arguments than registers, so it is not always possible to load all values into registers prior to the call. So we cannot rely on the normal register allocator mechanisms to ensure a value is present.
  • Ensure all values that ‘survive’ the call are spilled to memory.

This is currently implemented by a function prepare_arglist_and_call that is over 144 lines long and implements topological sort, cycle-breaking and register-conflict resolution with values used by the CALL operator itself (e.g. in dynamic dispatch).

I feel like this functionality - rearranging registers to deal with register requirements of specific operators - could be generalized, but I’m not sure if it is worth it. The x86-64 instruction set has plenty of ‘irregular’ instructions that have specific register requirements however, virtually all of them use just the single rax register, which I’ve decided to keep as the spare register anyway. So I’m not sure what the value of generalizing would be, and I’m fairly sure it would introduce even more complexity.

Challenges

What is still necessary for completeness is the ability to allocate multiple register classes, specifically floating point registers. The suggested way of handling that is by giving out different number ranges per class, and letting the tiles sort out the details of getting the correct number.

Another thing that is not necessary but certainly nice-to-have is a way to ensure that the register assignment actually meets the constraints of the target architecture. Specifically, while the expression JIT assumes a system that operates as follows:

a = operator(b, c)

What we actually have is on x86-64 is more like:

a = operator(a, b)

The situation is actually considerably more complex due to the many addressing modes that are possible. (For instance, indexed memory access uses a third register ‘c’.). Currently we handle that in tiles by moving values arround, but it would be much nicer if the register allocator could enforce this constraint. (The allocator keeps a ‘scratch’ register free at all times for this purpose).

Finally, the live range hole finding algorithm iterates over the basic blocks of the JIT code in backwards linear order, which is fine so long as there are no loops (the tile list control flow always flows forward), which is true so long as we only ever try to compile single basic blocks, but that is an assumption I’m explicitly trying to break. So I’ll need to find a better, more general iteration order.

Logging

To aid debugging, the JIT compiler logs to file. This logging is adhoc and line-based (for the most part). It is not currently useful for general consumption or instrumentation. Adding to that, the linear scan allocator has numerous debug logging statements that are conditionally compiled in.

It would probably be a nice improvement to write ‘structural’ logging information to the JIT or spesh log (that could be useful for instrumentation), and use conditionally compiled ‘adhoc’ logging statements solely for debugging. And it’d be nicer still to have conditional compilation on the section of JIT output that you’d really be interested in (e.g. expression building or register allocation).

The JIT log is setup at initialization time in src/moar.c, based on the environment variable MVM_JIT_LOG. This is arguably not a good design for embedding; then again, for embedding, this is not a file you’d want to expose anyway.

Expression trees and tile lists have their own loggers, which implement a more structured format. An expression tree log represents the tree as a directed graph in the ‘dot’ language. There’s probably still some formatting we could apply to that, but on the whole I’m satisfied with that approach. The tile list logger represents each tile with a debug string, which is stored when the tiler table generator processes the tile pttern definitions. It also lists the basic block structure of the program.

Tools

Notes

Memory management

The JIT uses three different strategies for memory management:

  • Spesh region allocation for the JIT graph.
  • Dynamically allocated vectors for the expression tree, tile lists and related structures.
  • Anonymous memory pages for the generated machine code.

Spesh region allocation

The spesh dynamic optimization framework contains a region allocator that is bound to the spesh graph.

Dynamic vectors

Memory pages

Operating systems that try to provide some measure of security to C programs generally do not allow execution of heap-allocated memory.

[fn:cost] Actually, it is not very simple at all, and this is mostly because we’ve ‘flattened’ the tile list, so we somehow have to ‘reconstruct’ the cost of a tile set that is equivalent to the tree matched by the original. This part really needs to be revisited at some point.