Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
493 lines (338 sloc) 15.1 KB
Plasma Abstract Machine
=======================
Paul Bone <paul@plasmalang.org>
v0.1, October 2018: Draft.
Copyright (C) 2015-2018 Plasma Team
License: CC BY-SA 4.0
:toc:
This document describes the behaviour of the Plasma Abstract Machine (PZ
Machine). The PZ file format is described in link:pz_format.html[PZ
Bytecode Format]. Implementations of the PZ abstract machine are
shall be discussed elsewhere (TODO)
In this document we use the textual version of the .pz files for
illustrative purposes. However the textual format is never used as an
interchange format and rarely used as a language so it does not need or have
a specification.
link:https://github.com/PlasmaLang/plasma/tree/master/docs/pz_machine.txt[Contribute to this page]
== Basic data types
The abstract machine supports words of varying sizes, with the symbols
representing them.
- 8bit (+w8+)
- 16bit (+w16+)
- 32bit (+w32+)
- 64bit (+w64+)
- fast word width (+w+)
- a word width the same width as a pointer (+wptr+)
- a pointer (+ptr+)
A fast word width is a width that should be the fasted word width for
integers on the platform. This may take into account register size, memory
usage and maybe implementation convenience.
A word with the same width as a pointer and a pointer differ only in whether
the garbage collector may trace them. Which is significant in some contexts
(like structures) but not in others (like instruction parameter widths).
.TODO: Polymorphism
NOTE: Handle polymorphism for +wptr+/+ptr+. We'll probably remove wptr and
handle the pointer vs non-pointer distinction another way.
Some instructions only make sense with either signed or unsigned data, this
is up to individual instructions, the PZ format and abstract machine don't
care. This way "move a 32bit word" makes sense regardless of whether the
word is signed, unsigned, or something else (float, bitfield etc).
The PZ machine also supports structures and arrays, more on those later.
== Registers
The PZ Machine is a stack based machine, it has two registers:
* the program counter (PC) and
* the environment pointer (EV)
the program counter "points to" the next instruction to execute.
While the environment pointer points to the current environment which is
used by closures to refer to their environment.
== Stacks
The basic abstract machine is a stack machine with two stacks. A return
stack and an expression stack. The return stack is used to handle procedure
call and return including saved closure environments. Very little control
of the return stack is available. Both basic instructions and procedures
are a transformation of the top of the expression stack.
== Notation
A procedure or instruction's signature may look like:
add (w w - w)
This describes the instruction + as taking two words from the top of stack
and replacing them with a word. Calling conventions for procedures work the
same way. The expression stack is used for argument passing and temporary
storage.
fibs (w - w)
From a callee's perspective, there is little difference between an
instruction and a call.
If an instruction is available for all word sizes it may be written as:
add (* * - *)
This is a convention only, there is no support for polymorphism. When using
the textual format for PZ, you may disambiguate which instruction you need
with a suffix.
eg:
add:8
add:16
add:32
add:64
add:w (fast word width)
add (no suffix also means fast word width)
add:ptr (pointer word width)
This works similarly for literal data. This is a byte containing the number
23.
23:8
This is only available for instructions, not calls.
Also in our notation we indicate immediate data with CamelCase, and in the
case of calls and literal data, the instruction name is not provided. The
instruction to use is available via context.
== High level bytecode items
Each item in a bytecode file belongs in one of four types and is referred
to by a 32bit ID. Each item type has its own ID-space. In other words data
item 5 and procedure 5 are separate. Names are used in .pzt files but are
discarded when these are compiled to .pz files. The exceptions are imported
items, exported items (TODO), and in the future some names and other
information may be stored for debugging.
=== Imports
Each import is a fully qualified reference to a closure in another module.
The list of imports is provided so that .pz files can then refer to each
import by an Import ID. The import section of the .pz file maps these IDs
to names for looking up in other module's symbol tables.
=== Structs
A struct is a record type, and has a lot in common with a C struct. Each
struct has a fixed number of fields and each field has a width (as above).
Structs allow the bytecode interpreter to make its own data layout
decisions. Which it may do differently on different platforms.
.Example usage of this information
TIP: When a program is loaded and the loader reads a struct type. For that
struct type it computes offsets for each of the fields, computes the total
size.
=== Data
Data items come in three types:
* Basic data: a single data item of a specific width.
* Array data: a number of data items of the same width, usually packed
together.
* Structure data: a structure of data, the data item provides the struct ID
and the value of each field. Structure fields may contain:
** Basic data (like above)
** A reference to another data structure
** A reference to an imported closure
** (more to come)
=== Procedures
Procedures contain executable code. A procedure's signature is a "stack
transformation" it represents the top of stack values before and after a
call to this procedure. This is explained above.
Procedures are made up of blocks which are used for control flow. The first
block in each procedure is executed when the procedure is called. Within
each procedure blocks are numbered sequentially starting at 0. Jump
instructions refer to their destination by block ID.
Note that execution can never "fall through" a block, the last instruction
in every block must be an unconditional control flow instruction.
=== Closures
Closures can be created by code by bundling a procedure reference with an
environment. The environment is a heap allocated struct. See
+make_closure+ below.
When calling a closure the environment is placed into the +EV+ register and
the previous one is pushed onto the stack to be restored after the procedure
call.
Closures are also created with the closure declaration, the procedure and
data ids are specified.
=== Options
The PZ format also contains options. Only one option of each type is
allowed, and the only type (now) is the entrypoint. The entrypoint option
specifies the ID of the closure that should be executed to run the module.
== Instructions
Each instruction is made from an opcode, between zero and two operand widths
and optionally an immediate value.
=== Zero extend, Sign extend and Truncate
ze (* - *)
se (* - *)
trunc (* - *)
Zero extends, sign extends or truncates the value on the top of the stack.
By truncate we mean discard the most significant bytes. While most
instructions work on a single operand width, these instructions use two
operand widths. For example.
ze (w16 - w32)
Note that it is not necessary (or advised) to use these instructions to
convert to and from pointer data, for example to manipulate tagged pointers.
=== Arithmetic
add (* * - *)
sub (* * - *)
mul (* * - *)
div (* * - *)
mod (* * - *)
Integer addition, subtraction, multiplication, division and modulus.
lshift (* w8 - *)
rshift (* w8 - *)
and (* * - *)
or (* * - *)
xor (* * - *)
Bitwise operations. Note that right shift is unsigned. A signed version
will be added later.
not (* - *)
Logical negation
=== Comparison
lt_u (* * - w)
lt_s (* * - w)
gt_u (* * - w)
gt_s (* * - w)
eq (* * - w)
Less than and greater than on unsigned and signed data. Note that the
result is always fast word width. Likewise conditional instructions always
take their argument in the fast word width.
=== Stack manipulation
Stack manipulation instructions don't care about data width, the machine
conceptually places all data in the same sized slots.
drop N
Drop the top _N_ items from the stack.
roll N
Rotate the top _N_ items on the stack. The top _N_-1 items move to the
left, the leftmost item becomes the rightmost. (rightmost is top-of-stack).
Note that +roll 2+ is the same as +swap+.
pick N
Push the _N_th item on the stack to the top of the stack. Note that +pick 1+
is the same as "dup"
=== Procedure calls: call and ret
call ProcId (-)
Call the procedure given by ProcId. Push the value of the program counter
and environment pointer onto the return stack and load the program counter
with the address of the first instruction in the first block of the
procedure.
The order of the return address and environment on the stack are not
specified, provided that the call instructions and +ret+ instruction agree
about the order.
call ImportId (-)
call ClosureId (-)
Perform a closure-call. As with +call+ above but deconstruct the closure
that ImportId or ClosureId refer to, use the target address and new
environment in that closure for the new values for those registers.
TODO: Calling a closure id is not yet supported
tcall ProcId (-)
Call the procedure given by ProcId, Replace the top of the return stack
with the current value of the program counter. Load the program counter
with the address of the first instruction in the first block of the
procedure.
TODO: Implement tail calls for calling closures and imports
call_ind (ptr -)
Indirect call. All indirect calls are also closure calls.
The value at the top of the stack is a closure and is deconstructed into a
the new environment and a pointer to code, the call proceeds the same as
+call_closure+.
ret (-)
Pop two values off the return stack, place the return address into the
program counter register, and the saved environment in the environment
address register.
=== Jumps: jmp, cjmp
jmp BlockId (-)
Jump unconditionally to the indicated block by loading the address of the
first instruction of the block into the program counter.
Note that only blocks can be the target of jump instructions, this way all
jmp targets are known.
cjmp BlockId (w -)
Pop a value of the expression stack, if it is non-zero load the address of
the first instruction of the given block into the program counter.
Note that this instruction always consumes the value on the stack.
TODO: indirect jumps or some mechanism for computed gotos.
=== Make closure
make_closure ProcId (ptr - ptr)
Form a closure from the given procedure and the structure pointed to by the
TOS. Return the new closure on the TOS.
It would be possible to make closures a special type of struct, such as a
struct whose first argument is a function pointer. This would use less
memory and could be used by further optimisations. However we've chosen to
get closures working with this more naive method and probably change it
later.
Also, closures could be implemented entirely within the compiler. However
by making them a PZ machine construct we can take advantage of them for
code loading and potentially other things.
=== Loops
TODO: Some loops may be handled differently than using blocks and jumps,
=== Data
==== Load immediate number
N (- *)
Loads the immediate value onto the expression stack. (N is any value).
==== Load code reference
ProcId (- ptr)
Loads the address of the procedure referenced by ProcId.
==== Load and store memory
load StructId FieldNum (ptr - * ptr)
Read the value of a field from the object at the given address.
_StructId_ and _FieldNum_ are literal.
store StructId FieldNum (* ptr - ptr)
Store a value into the field of an object at the given address.
TODO: Make sure that we can easily handle memory barriers for GC.
TODO: Ordinary and array loads and stores.
==== Memory allocation
alloc StructId (- ptr)
alloc_mutable StructId (- ptr)
alloc_array ElementWidth (w - ptr)
alloc_array_mutable ElementWidth (w - ptr)
Plasma will use immutable structures more often than mutable ones, so
immutable is the "normal" type.
Note that this does not prevent store from being used on immutable data,
but doing so would be bad.
The GC among other things will use immutability information to optimise its
algorithms.
==== Retrive environment
The environment is a pointer to a struct, that pointer can be retrieved with
get_env (- ptr)
Then the resulting pointer can be used as a regular struct.
== Garbage collection
The Garbage Collector must be aware of which values are pointers and which
are not. Above we explained how information about structures can be used to
calculate this for typical heap cells.
This information must also be available for stack frames. There are
multiple ways to make this available at runtime. One simple solution is at
each GC save point execute code that updates a word in the stack frame
containing a bitfield that specifies which stack slots contain a pointer
into the heap.
We will probably require .pz programs to provide such bitfields within their
instruction streams. Since we use separate expression and return stacks it
will need to include information about how many of the top stack values
belong to the current procedure.
.TODO: Polymorphism
NOTE: Any polymorphic values will need their "is a pointer" bit filled in at
runtime. We can generate runtime code that takes an argument and constructs
the bit field. This information can be passed to the procedure by adding
extra argument(s) to the procedure, which is how polymorphism
transformations work in general.
=== Optimisations
The objects' bitfields can easily be stored together, as mark bits are
already stored in a GC. To save further on memory usage objects with
particular layouts can be allocated in particular heap regions. These heap
regions themselves provide this information. If a heap layout stores object
sizes with the object, the bitfields for most object sizes could easily be
packed with the object size.
Some of the information required for stack frames is implicit within the
instruction stream. Requiring it to be made explicit makes writing
+pzrun+ easier, but some of it could be omitted in a later version.
== Builtin operations
See runtime/pz_builtin.c
.Misc
----
print (ptr -)
int_to_string (w - ptr)
die ()
----
.Pointer tagging
----
// Combine a pointer and a tag into a tagged pointer
make_tag (ptr ptr - ptr)
// Combine a word and a tag into a tagged word (shifting the word)
shift_make_tag (ptr ptr - ptr)
// Extract the pointer and tag from a tagged pointer
break_tag (ptr - ptr ptr)
// Extract the word and tag from a tagged word (shifting the word)
break_shift_tag (ptr - ptr ptr)
// Unshift a tagged value
unshift_value (ptr - ptr)
----
.Deprecated
----
concat_string (w w - w)
----
== Linking to other modules
TODO
== Working with foreign code
TODO
== Using PZ
=== A note about data
The stack cannot be used to store complex data, neither can it be
intermediate in the instruction stream. Complex data (structs and arrays)
must be either statically allocated or allocated on the heap. In either
case the PZ machine needs to know about the structure or array being used.