Standalone brainfuck compiler
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
tests
LICENSE
README.md
bfc.py

README.md

BFC

BFC is an ahead-of-time compiler for brainfuck. It takes a brainfuck source file and outputs native code in the form of a 32-bit ELF executable. The executable only depends on Linux syscalls, not the C library or anything else. This is basically as fast as unoptimized brainfuck code is going to get.

The output program allocates 1 MB (1048576 cells) of memory on the stack for the tape. That is more than enough, even for complex programs like the mandelbrot renderer by Erik Bosman. Because the stack grows downwards, the increase and decrease pointer operations are reversed in the output code.

Although a language like C is generally more suitable for low-level development like this, I decided to implement it in Python because it makes the code simply more readable and easy to understand. Additionally, it removes frustrations like having to implement file reading code for the 1000th time. The source should be valid Python 2.7 and Python 3.x code.

Usage

Prepare a brainfuck source file, like tests/hello.bf and invoke the compiler by running:

$ ./bfc.py tests/hello.bf

If the file could be read and correctly parsed, then an executable named hello will be produced in the same directory.

$ tests/hello
Hello World!

Tests

Some sample brainfuck programs have been included for testing the compiler.

  • hello: Hello world
  • infinite: Infinite loop
  • mandelbrot: 128x48 ascii mandelbrot renderer by Erik Bosman
  • rot13: ROT13 cipher

Details

Since this project simultaneously serves as a learning sample for basic compiler architectures, it is set up like a regular compiler. The input source goes through the

  • Lexing
  • Parsing
  • Optimizing
  • Compiling
  • Linking

passes to result in an output executable. The lexer simply discards all characters that are not part of the brainfuck language and turns the characters into tokens, implemented as child classes of Token.

The parser is a recursive descent parser and basically only exists for loops. It's included for completeness more than anything else, although the tree it creates does help a bit with code generation logic.

The parser is implemented according to the following LL(1) grammar in EBNF form:

program  = { command | loop }
command  = ">" | "<" | "+" | "-" | "." | ","
loop     = "[", { command | loop }, "]

The optimizer merges repeated increment and decrement instructions, so that the code generator can easily emit add and subtract instructions instead.

The code generator produces one or two instructions for each node in the AST. Jumps are written with the correct relative addresses corresponding to loop beginnings and endings. Currently it always uses the 16/32 bit jump instructions for simplicity. Optimization for executable size could be added by falling back to 8 bit jumps if possible.

The final pass creates a bare-bones ELF file that instructs the loader to load the entire file into memory (as is common with executables) and run the program positioned in the file after the ELF and program header.

There are no checks at runtime, so if you mess up the program, it will happily write or read memory where it shouldn't. Because the behaviour of brainfuck is rather hard to predict, the only solution would be to add bounds checks to every read and write, but all these branches would significantly impact performance.

The code generated for each token is listed below. Input and output is handled by the read and write syscalls on Linux. They are called using interrupt 0x80. The exit syscall is used to end the program.

>

dec esp

<

inc esp

+

inc [esp]

-

dec [esp]

.

mov eax, 0x4 ; write
mov ebx, 0x1 ; stdout
mov ecx, esp
mov edx, 0x1 ; write 1 byte at pointer
int 0x80

,

mov eax, 0x3 ; read
mov ebx, 0x0 ; stdin
mov ecx, esp
mov edx, 0x1 ; read 1 byte to pointer
int 0x80

[

loop:
    cmp [esp], 0
    je loop_end

]

    jmp loop
loop_end:

EOF

mov eax, 1 ; exit
mov ebx, 0 ; EXIT_SUCCESS (= 0)
int 0x80

It may be an interesting experiment to mark the code section itself as writable and allow for self-modifying code that way.

License

BFC is licensed under the MIT license, of which the terms are outlined in the LICENSE file.