• The New Queue

    defunkt 30 Oct 2008

    Yesterday we moved to a new queue, Shopify’s delayed_job (or dj).

    After trying a few different solutions in the early days, we settled on Ara Howard’s Bj. It was fine for quite a while, but some of the design decisions haven’t been working out for us lately. Bj allows you to spawn exactly one worker per machine – we want a machine dedicated to workers. Bj loads a new Rails environment for every job submitted – we want to load a new Rails environment one time only. Both of these decisions carry performance implications.

    If we were to run one Bj per machine, we’d only have four workers running as GitHub consists of four, ultra-beefy app slices. Unlike most contemporary web apps, the fewer the slices we have the better – it means less machines connected to our networked file system, and less machines create less network chatter and lock contention. As some of the jobs take a while to run (60+ seconds), four workers is a very low number. We want something like 20, but we’d settle for as few as 8.

    We did hack Bj to allow multiple instances to run on a machine, but that ended up being counterproductive due to design decision #2: loading a new Rails environment for each job.

    See, Rails takes a while to start up. Not only do you have to load all the associated libraries, but each require statement needs to look through the entire, massive load path – a load path that includes the Rails app, Rubygems, the Rails source code, and all of our plugins. Doing this over and over, multiple times a minute, burns a lot of CPU and takes a lot of time. In some cases, the Rails load time is 99% of the entire background job’s lifetime. Spawning a whole bunch of Bjs on a single machine meant we effectively DoS’d the poor CPU.

    I started working on a solution, but it was at this point we realized we were doing something wrong. These are not flaws in Bj, they are design decisions – these two ideas make Bj a pleasure to work with and perfect for simple sites. It’s working great on FamSpam. We had simply outgrown it, and hacking Bj would have been error prone and time consuming. Luckily, we had seen people praising Dj in the past and a solid recommendation from technoweenie was all we needed.

    The transition took about an hour and a half – from installing the plugin to successfully running Dj on the production site, complete with local and staging trial runs (and bug fixes). Because we had changed queues so many times in the past, we were using a simple interface to submitting a job.

    RockQueue meant we didn’t have to change any application code, just infrastructure. I highly recommend an abstraction like this for vendor-specific APIs that would normally be littered all throughout your app, as changing vendors can become a major pain.

    Anyway, Dj lets us spawn as many workers on a machine as we want. They’re just rake tasks running a loop, after all. It deals with locking and retries in a simple way, and works much like Bj. The queue is much faster now that we don’t have to pay the Rails startup tax.

    We now have a single machine dedicated to running background tasks. We’re running 20 Dj workers on it with great success. There is no science behind this number.

    Since people have already started asking “why didn’t you use queue X” or “you should use queue Y,” it seems reasonable to address that: we were very happy with Bj and wanted a similar system, albeit with a few different opinions. Dj is that system. It is simple, required no research beyond the short README, works wonderfully with Rails, is fast, is hackable, solves both the queue and the worker problems, and has no external dependecies. Also, it’s hosted on GitHub!

    Dj is our 5th queue. In the past we’ve used SQS, ActiveMQ, Starling, and Bj. Dj is so far my favorite.

    In a future post I’ll discuss the ways in which we use (and abuse) our queue. Count on it.

  • Comments

    mtodd Thu Oct 30 11:41:10 -0700 2008

    Awesome, love to see good solutions to interesting problems.

    I’d be interested to see what your thoughts and critiques of the other queues were (SQS, ActiveMQ, and Bj). I know I saw in a tweet about ActiveMQ specifically that it wasn’t a complete system because it left the workers entirely up to you… if I recall correctly.

    paulca Thu Oct 30 12:17:05 -0700 2008

    Great stuff, thanks for the write up. I used a combination of the Ruby daemons gem and god for worker queues on Exceptional and Qwitter, will definitely have to look in to Dj.

    joergbattermann Thu Oct 30 12:17:06 -0700 2008

    Oh… no starling? Got half a minute to elaborate why you moved away from it?

    collin Thu Oct 30 13:36:44 -0700 2008

    It’s sort of on topic. I’ve been poking around with http://memcachedb.org/memcacheq/ for about half a day.

    It’s nice to have a queue that speaks memcache, though it is a little rough around the edges, could have taken more than an hour and half to get going depending on how things went.

    I’ve gotten really excited about queues recently, I’ll have to give Dj a look. Thanks for pointing it out.

    ropiku Thu Oct 30 14:14:17 -0700 2008

    I’m interested to see your comments on SQS and ActiveMQ. RabbitMQ is also interesting (and using nanite with it).

    qrush Thu Oct 30 14:43:13 -0700 2008

    Great post! Definitely going to watch Dj if I ever need to use it.

    dustin Thu Oct 30 14:53:45 -0700 2008

    I still quite disagree that memcached support in a job queue is good.

    What github’s done that is really good for them is abstracted away from the API. That makes the technology a bit less important. dj works well for them now, but if they decide it doesn’t, they can evaluate something else without being afraid of the impact of implementing it everywhere.

    ismasan Sat Nov 01 13:05:08 -0700 2008

    Did you give Beanstalk a try as well? I’ve heard nothing but praise about it…

    mmmurf Mon Nov 03 13:47:54 -0800 2008

    What don’t you guys like about Starling? I’ve been using it and it seems to work pretty well.

    alistairholt Thu Nov 06 03:41:59 -0800 2008

    Great, always interesting to see more talk about queues. I too would love to see your comments on all of the previous queue systems you have tried.

    deepak Tue Feb 17 03:10:13 -0800 2009

    Have you guys written RockQueue, can you release the source?

    Please log in to comment.