Caution
July 18, 2010: Oops, dates on my computer were screwed up for one day, hence the commits timestamps were all wrong. I fixed it, and force-pushed the changes to github. Unfortunately, if you cloned / fetched the repository recently, you'll have to re-clone it again now -- merges / rebases will not work otherwise.
This project is inspired by this series of blog posts by Anatoly Vorobey (the link is in Russian).
Scheme implementations are a dime a dozen these days. In fact, the best ones are even free. (My personal favorite is Chicken Scheme ). Nobody needs another scheme interpreter, not even me. So why write it? For fun, obviously. Besides, sometimes I think every programmer should implement some kind of Lisp at least once -- and no, suffering the effects of Greenspun's 10th Rule doesn't give you a free pass.
So. The plan is to implement a relatively straightforward Scheme interpreter in C. I plan to keep C codebase as small as possible -- maybe under 2000 lines tops, maybe under 1500.
Minimum viable feature set: a Lisp-1, with lexical scoping, Fixnum and floating point arithmetics, unhygienic macros, tail-call elimination, reentrant continuations, and automatic memory management (garbage collection).
Note
July 18, 2010: The minimum viable feature set implemented. It's rough around the edges, but all the big things are in place and seem to work. Took me 5 days to get here, by git log.
- Latest achievements:
- Mark-and-sweep stop-the-world Garbage Collection
- gray set implementation is more efficient now
- C source lines: 1134, LoC: 942.
- Mark-and-sweep stop-the-world Garbage Collection
- What works:
- Some builtin arithmetics (fixnum and double), list functions.
lambda
, non-builtin function calls.- Lexical scope.
- Tail-call elimination.
quasiquote
and user defined macros viadefmacro
.call-with-current-continuation
works.read
/print
implemented, Read-Eval-Print Loop reimplemented in scheme.- Automatic memory management / Garbage Collection.
- What doesn't:
- Error handling is just not there.
- vectors
- syntax-rules
As of now, it's a more or less vanilla SECD1 machine, modified for varargs, special forms and tail calls elimination. Modifications are as follows:
- Stack is not a simple list, but a list of lists.
apply
removes the top list of the stack (so we can support varargs). - Some
PROCEDURE
s are markedFL_SYNTAX
. When SECD machine detects a syntax call (takes some look-ahead), arguments are not evaluated. Also, ifFL_EVAL
flag is set for syntax, the return value of the procedure is queued up for re-evaluation. - When a tail call is detected, new dump frames allocation is skipped in
apply
.
Tagged values. scm_val
is C long
. Lower 2 bits define the semantics of the upper 30 / 62 bits as follows. We rely on the cell allocator to always align cells on 4-byte boundary. Since we have our own allocator, it's easy to enforce.
Machine word bit values | scm_val type |
---|---|
|
Cell ptr, type info in cell |
|
Fixnum |
<24 or 56 bits of data><6 bits of tag>10 | Extended tag (next 6 bits) |
Primitive types are: CHAR
, BOOL
, FIXNUM
, SYMBOL
, SPECIAL
.
PROCEDURE is a toplevel type. flags used are SYNTAX, BUILTIN.
- PROCEDURE
is a cons(DEFINITION, env)
- DEFINITION for non-BUILTIN
is a cons(formals, body)
- DEFINITION for BUILTIN
is a cons(CFUNC, hint)
- CFUNC
is an
scm_val (*cfunc)(scm_val params, <SECD machine ptr>, scm_val hint)
In SECD machine2, continuation is the content of dump register. So, basically, we capture the state of SECD machine, and we can restore it later.
I admittedly don't understand macros well. For now, quasiquote
is implemented, and hooked up as the mechanism for user-defined macros. It cons()
-es like there's no tomorrow, of course, but hey, it gets the job done.
Pretty naive tri-color mark-and-sweep Garbage Collection3. We do our own C stack walking to collect pointers referencing something inside of our memory pool. GC provides gc_register()
call to notify GC about memory locations which may be of interest for GC. SECD machine uses it to register S
, E
, C
and D
.
- Memory management: we can try to force every non-cell blob object (like string data) to be always pointed at by exactly one cell. Then we get a pool for cells and another pool for blobs. Objects in cell pool can be garbage-collected trivially (walking C-stack may be necessary, though), blobs are freed when the referencing cell is GC-ed. I don't think I care enough to do real compaction -- freelist should be enough.
- Garbage Collection improvements:
- unroll the unnecessary "scm-aware
cons()
" code changes gc_unregister()
- memory management for blobs (like strings, file descriptors, etc) and vectors
- unroll the unnecessary "scm-aware
- Error handling (probably via error continuation?)
- More builtin primitives
- Bootstrap prelude.scm further
- 64-bit support and other portability issues
No idea yet, some code cleanup is due, I guess. After that, memory management improvements, error handling and scheme bootstrapping.