This document describes the role of tiles and how to implement additional tiles for an architecture.
A tile is the combination of a tree pattern and a code generation rule. The patterns are defined in a tile list file. Tiles are architecture-specific, unlike expression templates. The path for the x64 tile list is 'src/jit/x64/tiles.list'. If and when someone chooses to port the JIT compiler, other architectures will need to define their own tile list files.
The textual format used for tiles is, as in the expression templates, symbolic expressions (s-expressions). They are described in some detail in that document (expr.md). A pattern is matched and replaced with the nonterminal, which are then used in other pattern matches. For instance, given the following tree:
(add
(load
(addr
(local)
8
)
int_sz
)
(const
16
int_sz
)
)
Which, by the way, stands for:
= LOCAL[8] + 16
And the following simplified set of patterns:
(add reg reg) -> reg
(local) -> reg
(load reg sz) -> reg
(addr reg ofs) -> reg
We can do the following bottom-up transformation:
(add (load (addr (local) 8) int_sz) (const 16 int_sz))
(add (load (addr reg 8) int_sz) (const 16 int_sz))
(add (load reg int_sz) (const 16 int_sz))
(add reg (const 16 int_sz))
(add reg reg)
reg
In this fashion tile patterns are used to succesively simplify the input tree to a final terminal symbol. The actual process to implement this is complicated and beside the point of this article. The reason to care about it is that when using a complex instruction set (CISC) architecture any instruction can typically implement a large part of any expression tree, and doing so is often worth it.
As I'm writing this documentation, the tile library is not yet complete, nor will it be for the foreseeable future. This is primarily because the instruction set provided by x64 processors is large; there are many possible sets of instructions for any given computation. So my aim is to help interested parties in completing the tile library, which I hope should be a rewarding passtime. I warn you that a little (but not a lot) assembly language knowledge is required.
A tile requires a pattern description in the tile list file. Say I'm writing a tile for the hardware implementation of the binary XOR operator. In x64 we have a number of options, including but not limited too:
xor rax, rcx /* register-to-register */
xor rax, 0x0ba1 /* constant to register */
xor rax, [rcx+ofs] /* memory with offset to register */
xor rax, [rcx+rdx*8] /* indexed memory to register */
xor [rax+ofs], rcx /* register to memory */
xor [rax+ofs], 0x12 /* constant to memory */
Let's focus on the first three, since it is my suspicion that these will be the most common (the latter two imply a STORE and a LOAD from the same operand, which is not directly supported in the memory model of the compiler, although it may be in the future).
A tile description is a list that starts with the keyword
tile:
, followed by the tile name, the pattern proper,
the symbol that it generates, and the (estimated) tile cost. There
are three supported symbols, namely reg, flag and
void. The flag symbol refers to the output of a comparison or
'testing' operation and is in general only useful in conditional
operations such as IF and WHEN. The reg symbol refers to a
physical register and is what we'll be using for the xor
implementations. (This is again not future-exclusive, because I may
want to add different symbols for numeric and/or SIMD registers). The
void symbol represents operations which yield no results. (NB: in
the past there was a mem symbol that stood for (compound) memory
access. This approach has been abandoned after it was clear
The first version of xor
corresponds to the following
pattern:
(tile: xor (xor reg reg) reg 2)
Meaning it is a tile called 'xor', matching the expression tree node
XOR
(or MVM_JIT_XOR
) and two register
arguments, yields a register, and has a cost of 2. The derrivation of
costs is explained below. By constrast, the second and third version
correspond to these patterns:
(tile: xor_consst (xor reg (const)) reg 3)
(tile: xor_addr (xor reg (load (addr reg))) reg 6)
You may notice that the (const)
and (addr reg)
nodes are stripped from their argument values. That is
because argument values do not take part in tile matching. As a
consequence, tiles should be written to accept (or reject) any
arguments they will encounter.
By the way, it is by no means necessary that a tile represents a single machine code instruction, although this is the case for most tiles currently implemented.
Finally, I note that the symbol names you choose in your tile must
have an analogue defined somewhere in a suitable C header file, like
MVM_JIT_REG
and MVM_JIT_FLAG
which are
defined in src/jit/expr.h; and also that all values marked
reg
are managed by the register allocator (and only those
values). I'm still looking for a good way to represent constraints on
registers, so expect some changes in the future. More details on the
interaction of tiles with the register allocator are described below.
Calculating the proper cost of a tile is something of a dark art form because there are many different factors to take into account. Some of the more important ones are:
- Code size - the machine code cache is pretty small I've heard, so smaller code is generally better for performance
- Memory access - memory access is often unavoidable from the point of view of the tiler, but memory reads and writes are expensive and this cost should be reflected in the tile.
- Register use - registers are a limited resource and spilling them involves memory traffic, so the (temporary) use of a register should receive a cost penalty.
With that in mind, I use the following scheme to calculate costs:
-
Per instruction issued, I count 1. Per constant stored in the instruction stream, I also count 1. Hence
(xor reg (const))
is more expensive than(xor reg reg)
. -
Per register used as output or as temporary variable space, I count 1. This is irrelevant for 'call' nodes which spill values. Note that using input registers is free, since their costs are calculated elsewhere. Hence
(xor reg (const))
(total cost 3) is cheaper than(const) + (xor reg reg)
(total cost 4), because the second version uses 2 instructions and two registers.In contrast, operations yielding
flag
values don't pay for registers - the cost of converting flags to register values if necessary is paid by other operations. -
Per memory access operation, I count a cost of 4. To be fair, this isn't completely relevant since memory-access operations do not usually compete with non-memory-access operations (the memory traffic is either explicitly required by the code or implicitly by the register allocator, and not very often optional). Still, by using the same cost values consistently the tiler can generate correct tables.
The basic memory model of the new JIT is register to register operations. (By the way at, this is at odds with the rest of the VM, which uses memory-to-memory operations, which is unavoidable if you're developing a virtual machine.). Operations are generally expected to write values to registers and to read values, ultimately from registers. Thus, the expression tree format that we are tiling is deliberately closer to the CPU architecture.
The important bit about tiles and registers is that as long as you do not need any temporary registers or any specific registers, you don't need to bother with the register allocator at all. Registers will be automatically allocated to write values to and your register operands will be equally automatically loaded if ever spilled. Temporary register allocations cannot cause any 'in-use' register to be unloaded, nor can temporary registers themselves be accidentally freed. Also, do not try to free registers yourself, the register allocator will handle that quite adequately and a double free can wreak havoc in the same way it can do in malloc. There is no reason to ever do so since the register allocator will free registers as soon as it is possible to do so.
The tile implementation is responsible for the output of suitable machine code for it's architecture. For machine code output we use the DynASM library. Please read that link for detailed information on DynASM. With DynASM, tile implementation becomes simple.
The tile implementation is defined in src/jit/x64/tiles.dasc
. This
file is included from src/jit/x64/emit.dasc
, which is the file that
includes the original JIT architecture-specific code. A tile
implementation looks much like this:
/* in src/jit/x64/tiles.dasc, supposing we only care about 8 byte xor */
MVM_JIT_TILE_DECL(xor) {
int dst = tile->values[0];
int src_a = tile->values[1];
int src_b = tile->values[2];
if (values[0]->size < 8)
MVM_oops(tc, "oops!");
if (src_a != dst) {
| mov Rq(dst), Rq(src_a);
}
| xor Rq(dst), Rq(src_b);
}
The lines starting with a '|' are machine code that is to be emitted to the compiler. Of note is that in the expression model a node takes any number of input nodes (typically zero, one or two) and yields an output node (typically a value into a register). Hence the destination and source nodes are different. However, in x64 (and indeed all modes of x86) only two operands per instruction are supported, one of which typically acts as destination operand (as well as a source operand). Hence the xor operation here described overwrites the first operand. In case the first operand wil be used later, this is of course unacceptable. Hence the mov on the 9th line. However, in case the first operand isn't used later, reusing the register can serve as a minor but significant optimization. The register allocator tries to ensure that this is the case for all relevant tiles, and tile writers should try to take advantage of this if possible. Note that architectures which use more standard (and sane) three-operand instructions do not have to take this into account. Note also that the size of the operands is a parameter; usually you should use the biggest of the child operand sizes as definitive. Please, please, please check and fail on sizes you do not expect to handle, because bugs in emitted code are much more difficult to debug than bugs in the code generator.
It is probably of interest what the exact argument list definition of a tile really is. A tile function receives the following arguments in order:
-
MVMThreadContext *tc
- the current thread context passed to all MVM functions -
MVMJitCompiler *compiler
- the compiler object, which is aliased via macro magically to a DynASM handle object, which magically ensures that the dynasm statements work. Also the general manager of labels and registers etc. -
MVMJitTile *tile
- the tile object that holds the values and parameter arguments. -
MVMJitExprTree *tree
- the tree object that is being compiled. Contains pretty much all data relevant to the compilation that is not kept in the compiler structure; however in tiles it's mostly used to extract argument data, since other data is made available in more convenient ways.