Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
266 lines (201 sloc) 8.6 KB


Task was easy, but that's not what I wanted to share

So there was this +300M brainfuck file... (329 080 541 to be precise).

It was a multilayer, so every next layer was also brainfuck file but a lot smaller:

329 080 541 original
 41 145 939 file1
  5 148 580 file2
    646 203 file3
     82 092 file4
     10 991 file5
      1 914 file6
        221 file7

last file was following python script:

import hashlib
class Tolkien:
    def __init__(self):
        self.quote = 'Not all those who wander are lost'
if __name__ == '__main__':
    t = Tolkien()

(If solution is the only thing that you're interested in, you can stop reading here)

AOT v1

So the biggest problem was probably making interpreter that is able to crunch that 300M file.

But without much hassle, I got an interpreter that could do that in 30s.

That was certainly fine for CTF, but it bothered me for some time and I thought that's lame, so I've come up with an idea of making an AOT for this task. I have bad experiances with asmjit, and I was already familiar with luajit's dynasm (, so I thought about using it.

I came up with following code (boilerplate dynasm code removed) first.cpp

That's where the problems started. Internally dynasm is handling some sort of sections and it's getting problematic, if you'll have to many labels (and that certainly will be the case for a brainfuck to x64 binary translation).

The problem comes from the fact how it's calculating so called 'positions' into pointers within a sections. I've tried requesting some help on luajit mailing list, but without much luck.

I was to lazy to analyze the code and do a proper fix, so instead I've changed the macros that handle calculations, so I've turned following macros from dynasm_x86.h:

/* Macros to convert positions (8 bit section + 24 bit index). */

#define DASM_POS2IDX(pos)       ((pos)&0x00ffffff)
#define DASM_POS2BIAS(pos)      ((pos)&0xff000000)
#define DASM_SEC2POS(sec)       ((sec)<<24)
#define DASM_POS2SEC(pos)       ((pos)>>24)
#define DASM_POS2PTR(D, pos)    (D->sections[DASM_POS2SEC(pos)].rbuf + (pos))

into (I know I have only one section, so we'll use only single bit for section, and 31 bits for index):

#define DASM_POS2IDX(pos)       ((pos)&0x7fffffff)
#define DASM_POS2BIAS(pos)      ((pos)&0x80000000)
#define DASM_SEC2POS(sec)       ((sec)<<31)
#define DASM_POS2SEC(pos)       ((pos)>>31)
#define DASM_POS2PTR(D, pos)    (D->sections[DASM_POS2SEC(pos)].rbuf + pos)

Now that itself did not work, as the input file was big enough to cause integer calculations, so I've fixed few more things:

typedef struct dasm_Section {
  int *rbuf;		/* Biased buffer pointer (negative section bias). */
  int *buf;		/* True buffer pointer. */
  size_t bsize;		/* Buffer size in bytes. */
  size_t /*int*/ pos;		/* Biased buffer position. */
  size_t /*int*/ epos;		/* End of biased buffer position - max single put. */
  int ofs;		/* Byte offset into section. */
} dasm_Section;

pos and epos in struct above became size_t, and at beginnig of dasm_put, there was also pos that became size_t size_t pos = sec->pos;

now there's one more thing that might be a problem if you don't have 4G of free mem ^^ basically dasm rellocation always try to double memory, I didn't need that, but you might need following hack, that slows down allocation a bit once you hit 512M:

#ifndef DASM_M_GROW
#define DASM_M_GROW(ctx, t, p, sz, need) \
  do { \
    size_t _sz = (sz), _need = (need); \
    if (_sz < _need) { \
      if (_sz < 16) _sz = 16; \
      if (_sz < 1024 * 1024 * 1024) \
          while (_sz < _need) _sz += _sz; \
      else \
          while (_sz < _need) _sz += 128 * 1024 * 1024; \
      (p) = (t *)realloc((p), _sz); \
      if ((p) == NULL) exit(1); \
      (sz) = _sz; \
    } \
  } while(0)

So with all of that in place it finally worked, and I was able to get get output of 300M file in time between 16-26 secs (depending on avail memory, allocations, system load etc.)

I've tried playing a bit with dynasm allocator (not to reallocate section), but that did not give much boost.

I collected bit more details, about time taken (in seconds):

action time
code generation 5.19
dynasm link 1.73
encode time 3.25
execution time 5.22
--------------- -------------
overall 15.39

At this point I thought, I should be able to beat 3 first parts, as dynasm needs to take care about handling labels.

But brainfuck is simple, so I thought it shouldn't be that hard to beat that value, right?

AOT v2

And it isn't :)

We can generate all the code on the fly in a single pass.

But first reminder how brainfuck loops work like.

  • [ - if byte(data pointer) == 0, jump after ]
  • ] - jump back to correpsonding opening [

So we will need simple trick when generating [, we'll gonna make following code:

  cmp [rdi], 0
  jnz continue_loop
  jmp dummy_value

  ... code goes here

; now handle closing brace
jmp loop_start

When generating, I'm gonna track all loop starts on the stack, and when generating jump at closing brace:

  • get from the stack position of loop_start, and generate jmp loop_start
  • calculate proper value to fill dummy_value, so that the jump will go to loop_end

That is what second.cpp does, easy peasy, time 6.53 :)

action time
code generation 1.25
execution time 5.28
--------------- -------------
overall 6.53


Now, next question is, can this get any better? Sure, with just a little bit of cheating.

The key is an observation, that there are many loops in the file that look like: [---->++<]

I'm storing this as a new insturction that will have arguments M(count1, count2) count1 is negative if there were - and positive it it were +, but in the files provided count1 is always negative.

The preprocessing step is bit lengthy, but it's not that complicated. Take a look at optimizer.cpp

So I'm changing whole loop into M(a,b) filling the rest with blanks. This can be used to calculate value of next cell as follows (pseudo code):

	starting = cell[ current ]
	a = -a
	while (starting % a)
		starting += 256;

	result = starting / a * b;
	cell[ current ] = 0
	cell[ current + 1 ] += result;

How to use it is along with second.cpp is left as an excercise.

Now time for the final number time: 2.8s :)


Obviously last step is a cheat, and if applied directly to interpreter, it would improve it's speed as well.

  • Is it an overkill? Who am I to argue.
  • Was it fun? OFC

Point was, that even from dumb task, you can get somewhat valuable knowledge.


So I've applied optimizations as well, and collected times from few runs, here are final numbers:


optimizer interpreter total
0.866119 1.18828 2.05485
0.776477 1.18891 1.9658
0.711183 1.18157 1.89319
0.721473 1.19604 1.91792
0.680486 1.18171 1.86262
0.677328 1.18375 1.8615
0.657009 1.17936 1.83681
0.669505 1.18598 1.85591
0.670191 1.18677 1.85737
0.669077 1.18755 1.8571
average 0.7098848 1.185992 1.896307


optimizer x64 generation execution total
0.731768 0.87682 1.00158 2.64302
0.677816 0.853774 1.00222 2.567
0.628973 0.843486 1.00134 2.50852
0.615942 0.854813 1.00879 2.51432
0.619122 0.846164 1.01687 2.51639
0.615045 0.855564 1.02103 2.52412
0.616443 0.84457 1.0055 2.49951
0.616248 0.842644 1.01138 2.52595
0.631289 0.847689 1.01104 2.52361
0.618018 0.84542 1.01319 2.51087
average 0.6370664 0.8510944 1.009294 2.533331

So it's pretty clear interpreter is faster ^^ If you compare exectution only vs interpreter the difference is not that big.

Was it worth it? Certainly :)