Behold the power of open source! adamcooke has reskinned Resque’s web UI.
Also now includes live updating:
Thanks Adam!
Resque is our Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later.
Background jobs can be any Ruby class or module that responds to
perform. Your existing classes can easily be converted to background
jobs or you can create new classes specifically to do work. Or, you
can do both.
All the details are in the README. We've used it to process over 10m jobs since our move to Rackspace and are extremely happy with it.
But why another background library?
We've used many different background job systems at GitHub. SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd. Each change was out of necessity: we were running into a limitation of the current system and needed to either fix it or move to something designed with that limitation in mind.
With SQS, the limitation was latency. We were a young site and heard stories on Amazon forums of multiple minute lag times between push and pop. That is, once you put something on a queue you wouldn't be able to get it back for what could be a while. That scared us so we moved.
ActiveMessaging was next, but only briefly. We wanted something focused more on Ruby itself and less on libraries. That is, our jobs should be Ruby classes or objects, whatever makes sense for our app, and not subclasses of some framework's design.
BackgroundJob (bj) was a perfect compromise: you could process Ruby jobs or Rails jobs in the background. How you structured the jobs was largely up to you. It even included priority levels, which would let us make "repo create" and "fork" jobs run faster than the "warm some caches" jobs.
However, bj loaded the entire Rails environment for each job. Loading Rails is no small feat: it is CPU-expensive and takes a few seconds. So for a job that may take less than a second, you could have 8 - 20s of added overhead depending on how big your app is, how many dependencies it requires, and how bogged down your CPU is at that time.
DelayedJob (dj) fixed this problem: it is similar to bj, with a database-backed queue and priorities, but its workers are persistent. They only load Rails when started, then process jobs in a loop.
Jobs are just YAML-marshalled Ruby objects. With some magic you can turn any method call into a job to be processed later.
Perfect. DJ lacked a few features we needed but we added them and contributed the changes back.
We used DJ very successfully for a few months before running into some issues. First: backed up queues. DJ works great with small datasets, but once your site starts overloading and the queue backs up (to, say, 30,000 pending jobs) its queries become expensive. Creating jobs can take 2s+ and acquiring locks on jobs can take 2s+, as well. This means an added 2s per job created for each page load. On a page that fires off two jobs, you're at a baseline of 4s before doing anything else.
If your queue is backed up because your site is overloaded, this added overhead just makes the problem worse.
Solution: move to beanstalkd. beanstalkd is great because it's fast, supports multiple queues, supports priorities, and speaks YAML natively. A huge queue has constant time push and pop operations, unlike a database-backed queue.
beanstalkd also has experimental persistence - we need persistence.
However, we quickly missed DJ features: seeing failed jobs, seeing pending jobs (beanstalkd only allows you to 'peek' ahead at the next pending job), manipulating the queue (e.g. running through and removing all jobs that were created by a bug or with a bad job name), etc. A database-queue gives you a lot of cool features. So we moved back to DJ - the tradeoff was worth it.
Second: if a worker gets stuck, or is processing a job that will take hours, DJ has facilities to release a lock and retry that job when another worker is looking for work. But that stuck worker, even though his work has been released, is still processing a job that you most likely want to abort or fail.
You want that worker to fail or restart. We added code so that, instead of simply retrying a job that failed due to timeout, other workers will a) fail that job permanently then b) restart the locked worker.
In a sense, all the workers were babysitting each other.
But what happens when all the workers are processing stuck or long jobs? Your queue quickly backs up.
What you really need is a manager: someone like monit or god who can watch workers and kill stale ones.
Also, your workers will probably grow in memory a lot during the course of their life. So you need to either make sure you never create too many objects or "leak" memory, or you need to kill them when they get too large (just like you do with your frontend web instances).
At this point we have workers processing jobs with god watching them and killing any that are a) bloated or b) stale.
But how do we know all this is going on? How do we know what's sitting
on the queue? As I mentioned earlier, we had a web interface which
would show us pending items and try to infer how many workers are
working. But that's not easy - how do you have a worker you just
kill -9'd gracefully manage its own state? We added a process to
inspect workers and add their info to memcached, which our web
frontend would then read from.
But who monitors that process. And do we have one running on each server? This is quickly becoming very complicated.
Also we have another problem: startup time. There's a multi-second startup cost when loading a Rails environment, not to mention the added CPU time. With lots of workers doing lots of jobs being restarted on a non-trival basis, that adds up.
It boils down to this: GitHub is a warzone. We are constantly overloaded and rely very, very heavily on our queue. If it's backed up, we need to know why. We need to know if we can fix it. We need workers to not get stuck and we need to know when they are stuck.
We need to see what the queue is doing. We need to see what jobs have failed. We need stats: how long are workers living, how many jobs are they processing, how many jobs have been processed total, how many errors have there been, are errors being repeated, did a deploy introduce a new one?
We need a background job system as serious as our web framework. I highly recommend DelayedJob to anyone whose site is not 50% background work.
But GitHub is 50% background work.
In the Old Architecture, GitHub had one slice dedicated to processing background jobs. We ran 25 DJ workers on it and all they did was run jobs. It was known as our "utility" slice.
In the New Architecture, certain jobs needed to be run on certain machines. With our emphasis on sharding data and high availability, a single utility slice no longer fit the bill.
Both beanstalkd and bj supported named queues or "tags," but DelayedJob did not. Basically we needed a way to say "this job has a tag of X" and then, when starting workers, tell them to only be interested in jobs with a tag of X.
For example, our "archive" background job creates tarballs and zip files for download. It needs to be run on the machine which serves tarballs and zip files. We'd tag the archive job with "file-serve" and only run it on the file serving slice. We could then re-use this tag with other jobs that needed to only be run on the file serving slice.
We added this feature to DelayedJob but then realized it was an opportunity to re-evaluate our background job situation. Did someone else support this already? Was there a system which met our upcoming needs (distributed worker management - god/monit for workers on multiple machines along with visibility into the state)? Should we continue adding features to DelayedJob? Our fork had deviated from master and the merge (plus subsequent testing) was not going to be fun.
We made a list of all the things we needed on paper and started re-evaluating a lot of the existing solutions. Kestrel, AMQP, beanstalkd (persistence still hadn't been rolled into an official release a year after being pushed to master).
Here's that list:
Can you name a system with all of these features:
I can. Redis.
If we let Redis handle the hard queue problems, we can focus on the hard worker problems: visibility, reliability, and stats.
And that's Resque.
With a web interface for monitoring workers, a parent / child forking model for responsiveness, swappable failure backends (so we can send exceptions to, say, Hoptoad), and the power of Redis, we've found Resque to be a perfect fit for our architecture and needs.
We hope you enjoy it. We certainly do!
ab5tract has a great post looking at GitHub “through the lens of the ethics of commons-based peer production.”
A key quote which puts it in perspective for me is this:
The software further induces virtue in its participants through the `git blame` function, which immediately calls up the person responsible for a commit. In practice it used as much to know who to praise as it is to know who to berate, but it fulfills one of the the paper’s common criteria for extant commons-based peer production: that of a mechanism to mitigate the potential impacts of malicious users. Slashdot has its moderation system, Wikipedia its editors, and git has `blame`. In fact this functionality is a crucial part of what enables the ‘virtue spreading virtue’ element of such peer production.
Read the blog post for the whole scope. Thanks ab5tract!

Its time for a meetup! This week we’ll be going to a faraway place, a place once thought to only exist in legend, a land they call the ‘Richmond’. According to local lore there’s a bar called Buckshot where you can play skee ball and shuffleboard while you shoot the breeze. And even if you dont actually spot any mythical beasts, you might get a chance to talk to Chris about unicorns. 8pm Thursday November 5th.
We’ve had a number of inquiries into why we chose ldirectord as our primary load balancer for the new GitHub architecture. As I’ve mentioned before (and more on this later), we’ve hired the excellent team at Anchor as our server specialists. Our team lead over there is Matt Palmer, and we left the choice of load balancer up to him and his expertise. He’s taken it upon himself to explain the driving factors behind his choice, and it makes for an enlightening read if you’re interested in such things. Just head on over to the Anchor blog to check it out:
http://www.anchor.com.au/blog/2009/10/load-balancing-at-github-why-ldirectord/
Be sure to read the comments where the author of haproxy weighs in on the post and adds some additional perspective. Like most technology decisions, there is no single correct answer. Only tradeoffs and preferences.
At 07:53 PDT this morning the site was hit with an abnormal number of SSH connections. The script that runs after an SSH connection is accepted makes an RPC call to the backend to check for the existence of the repository so that we can display a nice error message if it is not present. The vast number of these calls that came in simultaneously caused some delays in the backend that cascaded to the frontends and resulted in a piling up of the scripts waiting for their RPC results. This, in turn, caused load to spike on the frontends further exacerbating the problem. I removed the RPC call from the SSH script to prevent this bottlenecking and soon after the barrage of SSH connections ceased.
Another unrelated problem caused the outage to continue even after the SSH connection load became nominal. Last night I deployed some package upgrades to our RPC stack that had tested out fine in staging for two days. While debugging the SSH problem, I restarted the backend RPC servers to rule them out as the problem source. This was the first time these processes had been restarted since the package upgrades, as they were deemed to be backward compatible with the changes and staging had shown no problems in this regard. However, it appears that these restarts put the RPC servers into an unworking state, and they began serving requests very sporadically. After failing to identify the problem within a short period, we decided to roll back to the previous known working state. After the packages were rolled back and the daemons restarted, the site picked up and began operating normally.
Full site operation returned at 09:34 PDT (some sporadic uptime was seen during the outage).
Over the next week we will be doing several things:
On a positive note, the outage led me to identify the source of several subtle bugs that have been eluding our detection for a few weeks. We are all rapidly learning the quirks of our new architecture in a production environment, and every problem leads to a more robust system in the future. Thanks for your patience over the last month and during the coming months as we work to improve the GitHub experience on every level.
jQuery, which we use for GitHub itself, is now hosted right here: http://github.com/jquery/jquery
If you’ve never contributed to the project, now’s a great time. Welcome, team!
Bob Silverberg just posted a nice guide to Setting up a Mac to Work with Git and GitHub. Thanks Bob!
Time to get your Rebase on! Send me a message about your project if you want to see it featured here, and please check out the Rebase howto as well. I’d love to see more than just web development stuff too. (but don’t stop that either!) Perhaps a collection of computer graphics related projects? AI? Music? You name it, just send me a message!

SubSonic is not your average ORM for .NET. This C# library is a veritable workhorse that follows convention over configuration and even allows developers to choose different data mapping paradigms, one such being Active Record. I wasn’t kidding about the workhorse bit: out of the box, it’s got support for LINQ, connecting to multiple DBs, and even a starter app to get you going. There’s an unbelievable amount of information on how to use it with your flavor of .NET on their wiki, and definitely check out how this shapes up compared to the other available ORMs. Besides, who else can beat screencasts set to Led Zeppelin and Rush?
ChicagoBoss claims to bring together the best of Rails and Django into the world of Erlang. Sounds neat, but how exactly does that work? Check out the MVC examples here and even some fledgling API docs for how this framework is shaping up. Another neat thing: Tokyo Tyrant/Cabinet support is built in, so you can key/value store to the list of buzzwords that Boss already has. Get forking!
Firepicker is a Firefox extension that adds a color picker into Firebug. Now, you won’t have to fumble around trying to find a specific application on your OS to do this when you’re playing with CSS in Firebug. Secondly, if you’re new to XUL and Firefox development in the first place, this is a great project to look at to get started. Check out some screenshots and how to install it here.
Papervision3D‘s readme may be short, but don’t let that deter you. It’s better if you just go look for yourself. Ok, so it’s a fully immersable 3D world written in ActionScript that’s open source. Whoever the first person to write a game for this environment is, please invite me to your private beach and/or yacht. There could be a ton of neat ways to implement this: perhaps a more 3D StreetView, planetarium, panoramas, the list goes on and on. Papervision’s dev blog has a lot of neat related links too.

Come get your drink on with the people of the Hub at Blackbird this Thursday October 22nd at 8pm. Also, be sure to look out for a possible Drinkup:Shanghai, China edition in the next few days- PJ and Scott are headed out for KungFuRails right now!
As I detailed in How We Made GitHub Fast, we have created a new data serialization and RPC protocol to power the GitHub backend. We have big plans for these technologies and I’d like to take a moment to explain what makes them special and the philosophy behind their creation.
The serialization format is called BERT (Binary ERlang Term) and is based on
the existing external term format already implemented by Erlang. The RPC protocol is called BERT-RPC and is a simple protocol built on top of BERT packets.
You can view the current specifications at http://bert-rpc.org.
This is a long article; if you want to see some example code of how easy it is to setup an Erlang/Ruby BERT-RPC server and call it from a Ruby BERT-RPC client, skip to the end.
For the new GitHub architecture, we decided to use a simple RPC mechanism to expose the Git repositories as a service. This allows us to federate users across disparate file servers and eliminates the need for a shared file system.
Choosing a data serialization and RPC protocol was a difficult task. My first thought was to look at Thrift and Protocol Buffers since they are both gaining traction as modern, low-latency RPC implementations.
I had some contact with Thrift when I worked at Powerset, I talk to a lot of people that use Thrift at their jobs, and Scott is using Thrift as part of some Cassandra experiments we’re doing. As much as I want to like Thrift, I just can’t. I find the entire concept behind IDLs and code generation abhorrent. Coming from a background in dynamic languages and automated testing, these ideas just seem silly. The developer overhead required to constantly maintain IDLs and keep the corresponding implementation code up to date is too frustrating. I don’t do these things when I write application code, so why should I be forced to do them when I write RPC code?
Protocol Buffers ends up looking very similar to Thrift. More IDLs and more code generation. Any solution that relies on these concepts does not fit well with my worldview. In addition, the set of types available to both Thrift and Protocol Buffers feels limiting compared to what I’d like to easily transmit over the wire.
XML-RPC, SOAP, and other XML based protocols are hardly even worth mentioning. They are unnecessarily verbose and complex. XML is not convertible to a simple unambiguous data structure in any language I’ve ever used. I’ve wasted too many hours of my life clumsily extracting data from XML files to feel anything but animosity towards the format.
JSON-RPC is a nice system, much more inline with how I see the world. It’s simple, relatively compact, has support for a decent set of types, and works well in an agile workflow. A big problem here, though, is the lack of support for native binary data. Our applications will be transmitting large amounts of binary data, and it displeases me to think that every byte of binary data I send across the wire would have to be encoded into an inferior representation just because JSON is a text-based protocol.
After becoming thoroughly disenfranchised with the current “state of the art” RPC protocols, I sat down and started thinking about what the ideal solution would look like. I came up with a list that looked something like this:
I mentioned before that I like JSON. I love the concept of extracting a subset of a language and using that to facilitate interprocess communication. This got me thinking about the work I’d done with Erlectricity. About two years ago I wrote a C extension for Erlectricity to speed up the deserialization of Erlang’s external term format. I remember being very impressed with the simplicity of the serialization format and how easy it was to parse. Since I was considering using Erlang more within the GitHub architecture, an Erlang-centric solution might be really nice. Putting these pieces together, I was struck by an idea.
What if I extracted the generic parts of Erlang’s external term format and made that into a standard for interprocess communication? What if Erlang had the equivalent of JavaScript’s JSON? And what if an RPC protocol could be built on top of that format? What would those things look like and how simple could they be made?
Of course, the first thing any project needs is a good name, so I started brainstorming acronyms. EETF (Erlang External Term Format) is the obvious one, but it’s boring and not accurate for what I wanted to do since I would only be using a subset of EETF. After a while I came up with BERT for Binary ERlang Term. Not only did this moniker precisely describe the nature of the idea, but it was nearly a person’s name, just like JSON, offering a tip of the hat to my source of inspiration.
Over the next few weeks I sketched out specifications for BERT and BERT-RPC and showed them to a bunch of my developer friends. I got some great feedback on ways to simplify some confusing parts of the spec and was able to boil things down to what I think is the simplest manifestation that still enables the rich set of features that I want these technologies to support.
The responses were generally positive, and I found a lot of people looking for something simple to replace the nightmarish solutions they were currently forced to work with. If there’s one thing I’ve learned in doing open source over the last 5 years, it’s that if I find an idea compelling, then there are probably a boatload of people out there that will feel the same way. So I went ahead with the project and created reference implementations in Ruby that would eventually become the backbone of the new GitHub architecture.
But enough talk, let’s take a look at the Ruby workflow and you’ll see what I mean when I say that BERT and BERT-RPC are built around a philosophy of simplicity and Getting Things Done.
To give you an idea of how easy it is to get a Ruby based BERT-RPC service running, consider the following simple calculator service:
# calc.rb
require 'ernie'
mod(:calc) do
fun(:add) do |a, b|
a + b
end
end
This is a complete service file suitable for use by my Erlang/Ruby hybrid BERT-RPC server framework called Ernie. You start up the service like so:
$ ernie -p 9999 -n 10 -h calc.rb
This fires up the server on port 9999 and spawns ten Ruby workers to handle requests. Ernie takes care of balancing and queuing incoming connections. All you have to worry about is writing your RPC functions, Ernie takes care of the rest.
To call the service, you can use my Ruby BERT-RPC client called BERTRPC like so:
require 'bertrpc'
svc = BERTRPC::Service.new('localhost', 9999)
svc.call.calc.add(1, 2)
# => 3
That’s it! Nine lines of code to a working example. No IDLs. No code generation. If the module and function that you call from the client exist on the server, then everything goes well. If they don’t, then you get an exception, just like your application code.
Since a BERT-RPC client can be written in any language, you could easily call the calculator service from Python or JavaScript or Lua or whatever. BERT and BERT-RPC are intended to make communicating between different languages as streamlined as possible.
The Ernie framework and the BERTRPC library power the new GitHub and we use them exactly as-is. They’ve been in use since the move to Rackspace three weeks ago and are responsible for serving over 300 million RPC requests in that period. They are still incomplete implementations of the spec, but I plan to flesh them out as time goes on.
If you find BERT and BERT-RPC intriguing, I’d love to hear your feedback. The best place to hold discussions is on the official mailing list. If you want to participate, I’d love to see implementations in more languages. Together, we can make BERT and BERT-RPC the easiest way to get RPC done in every language!
Now that things have settled down from the move to Rackspace, I wanted to take some time to go over the architectural changes that we’ve made in order to bring you a speedier, more scalable GitHub.
In my first draft of this article I spent a lot of time explaining why we made each of the technology choices that we did. After a while, however, it became difficult to separate the architecture from the discourse and the whole thing became confusing. So I’ve decided to simply explain the architecture and then write a series of follow up posts with more detailed analyses of exactly why we made the choices we did.
There are many ways to scale modern web applications. What I will be describing here is the method that we chose. This should by no means be considered the only way to scale an application. Consider it a case study of what worked for us given our unique requirements.
We expose three primary protocols to end users of GitHub: HTTP, SSH, and Git. When browsing the site with your favorite browser, you’re using HTTP. When you clone, pull, or push to a private URL like git@github.com:mojombo/jekyll.git you’re doing so via SSH. When you clone or pull from a public repository via a URL like git://github.com/mojombo/jekyll.git you’re using the Git protocol.
The easiest way to understand the architecture is by tracing how each of these requests propagates through the system.
For this example I’ll show you how a request for a tree page such as http://github.com/mojombo/jekyll happens.
The first thing your request hits after coming down from the internet is the active load balancer. For this task we use a pair of Xen instances running ldirectord. These are called lb1a and lb1b. At any given time one of these is active and the other is waiting to take over in case of a failure in the master. The load balancer doesn’t do anything fancy. It forwards TCP packets to various servers based on the requested IP and port and can remove misbehaving servers from the balance pool if necessary. In the event that no servers are available for a given pool it can serve a simple static site instead of refusing connections.
For requests to the main website, the load balancer ships your request off to one of the four frontend machines. Each of these is an 8 core, 16GB RAM bare metal server. Their names are fe1, …, fe4. Nginx accepts the connection and sends it to a Unix domain socket upon which sixteen Unicorn worker processes are selecting. One of these workers grabs the request and runs the Rails code necessary to fulfill it.
Many pages require database lookups. Our MySQL database runs on two 8 core, 32GB RAM bare metal servers with 15k RPM SAS drives. Their names are db1a and db1b. At any given time, one of them is master and one is slave. MySQL replication is accomplished via DRBD.
If the page requires information about a Git repository and that data is not cached, then it will use our Grit library to retrieve the data. In order to accommodate our Rackspace setup, we’ve modified Grit to do something special. We start by abstracting out every call that needs access to the filesystem into the Grit::Git object. We then replace Grit::Git with a stub that makes RPC calls to our Smoke service. Smoke has direct disk access to the repositories and essentially presents Grit::Git as a service. It’s called Smoke because Smoke is just Grit in the cloud. Get it?
The stubbed Grit makes RPC calls to smoke which is a load balanced hostname that maps back to the fe machines. Each frontend runs four ProxyMachine instances behind HAProxy that act as routing proxies for Smoke calls. ProxyMachine is my content aware (layer 7) TCP routing proxy that lets us write the routing logic in Ruby. The proxy examines the request and extracts the username of the repository that has been specified. We then use a proprietary library called Chimney (it routes the smoke!) to lookup the route for that user. A user’s route is simply the hostname of the file server on which that user’s repositories are kept.
Chimney finds the route by making a call to Redis. Redis runs on the database servers. We use Redis as a persistent key/value store for the routing information and a variety of other data.
Once the Smoke proxy has determined the user’s route, it establishes a transparent proxy to the proper file server. We have four pairs of fileservers. Their names are fs1a, fs1b, …, fs4a, fs4b. These are 8 core, 16GB RAM bare metal servers, each with six 300GB 15K RPM SAS drives arranged in RAID 10. At any given time one server in each pair is active and the other is waiting to take over should there be a fatal failure in the master. All repository data is constantly replicated from the master to the slave via DRBD.
Every file server runs two Ernie RPC servers behind HAProxy. Each Ernie spawns 15 Ruby workers. These workers take the RPC call and reconstitute and perform the Grit call. The response is sent back through the Smoke proxy to the Rails app where the Grit stub returns the expected Grit response.
When Unicorn is finished with the Rails action, the response is sent back through Nginx and directly to the client (outgoing responses do not go back through the load balancer).
Finally, you see a pretty web page!
The above flow is what happens when there are no cache hits. In many cases the Rails code uses Evan Weaver’s Ruby memcached client to query the Memcache servers that run on each slave file server. Since these machines are otherwise idle, we place 12GB of Memcache on each. These servers are aliased as memcache1, …, memcache4.
For our data serialization and RPC protocol we are using BERT and BERT-RPC. You haven’t heard of them before because they’re brand new. I invented them because I was not satisfied with any of the available options that I evaluated, and I wanted to experiment with an idea that I’ve had for a while. Before you freak out about NIH syndrome (or to help you refine your freak out), please read my accompanying article Introducing BERT and BERT-RPC about how these technologies came to be and what I intend for them to solve.
If you’d rather just check out the spec, head over to http://bert-rpc.org.
For the code hungry, check out my Ruby BERT serialization library BERT, my Ruby BERT-RPC client BERTRPC, and my Erlang/Ruby hybrid BERT-RPC server Ernie. These are the exact libraries we use at GitHub to serve up all repository data.
Git uses SSH for encrypted communications between you and the server. In order to understand how our architecture deals with SSH connections, it is first important to understand how this works in a simpler setup.
Git relies on the fact that SSH allows you to execute commands on a remote server. For instance, the command ssh tom@frost ls -al runs ls -al in the home directory of my user on the frost server. I get the output of the command on my local terminal. SSH is essentially hooking up the STDIN, STDOUT, and STDERR of the remote machine to my local terminal.
If you run a command like git clone tom@frost:mojombo/bert, what Git is doing behind the scenes is SSHing to frost, authenticating as the tom user, and then remotely executing git upload-pack mojombo/bert. Now your client can talk to that process on the remote server by simply reading and writing over the SSH connection. Neat, huh?
Of course, allowing arbitrary execution of commands is unsafe, so SSH includes the ability to restrict what commands can be executed. In a very simple case, you can restrict execution to git-shell which is included with Git. All this script does is check the command that you’re trying to execute and ensure that it’s one of git upload-pack, git receive-pack, or git upload-archive. If it is indeed one of those, it uses exec to replace the current process with that new process. After that, it’s as if you had just executed that command directly.
So, now that you know how Git’s SSH operations work in a simple case, let me show you how we handle this in GitHub’s architecture.
First, your Git client initiates an SSH session. The connection comes down off the internet and hits our load balancer.
From there, the connection is sent to one of the frontends where SSHD accepts it. We have patched our SSH daemon to perform public key lookups from our MySQL database. Your key identifies your GitHub user and this information is sent along with the original command and arguments to our proprietary script called Gerve (Git sERVE). Think of Gerve as a super smart version of git-shell.
Gerve verifies that your user has access to the repository specified in the arguments. If you are the owner of the repository, no database lookups need to be performed, otherwise several SQL queries are made to determine permissions.
Once access has been verified, Gerve uses Chimney to look up the route for the owner of the repository. The goal now is to execute your original command on the proper file server and hook your local machine up to that process. What better way to do this than with another remote SSH execution!
I know it sounds crazy but it works great. Gerve simply uses exec(3) to replace itself with a call tossh git@<route> <command> <arg>. After this call, your client is hooked up to a process on a frontend machine which is, in turn, hooked up to a process on a file server.
Think of it this way: after determining permissions and the location of the repository, the frontend becomes a transparent proxy for the rest of the session. The only drawback to this approach is that the internal SSH is unnecessarily encumbered by the overhead of encryption/decryption when none is strictly required. It’s possible we may replace this this internal SSH call with something more efficient, but this approach is just too damn simple (and still very fast) to make me worry about it very much.
Performing public clones and pulls via Git is similar to how the SSH method works. Instead of using SSH for authentication and encryption, however, it relies on a server side Git Daemon. This daemon accepts connections, verifies the command to be run, and then uses fork(2) and exec(3) to spawn a worker that then becomes the command process.
With this in mind, I’ll show you how a public clone operation works.
First, your Git client issues a request containing the command and repository name you wish to clone. This request enters our system on the load balancer.
From there, the request is sent to one of the frontends. Each frontend runs four ProxyMachine instances behind HAProxy that act as routing proxies for the Git protocol. The proxy inspects the request and extracts the username (or gist name) of the repo. It then uses Chimney to lookup the route. If there is no route or any other error is encountered, the proxy speaks the Git protocol and sends back an appropriate messages to the client. Once the route is known, the repo name (e.g. mojombo/bert) is translated into its path on disk (e.g. a/a8/e2/95/mojombo/bert.git). On our old setup that had no proxies, we had to use a modified daemon that could convert the user/repo into the correct filepath. By doing this step in the proxy, we can now use an unmodified daemon, allowing for a much easier upgrade path.
Next, the Git proxy establishes a transparent proxy with the proper file server and sends the modified request (with the converted repository path). Each file server runs two Git Daemon processes behind HAProxy. The daemon speaks the pack file protocol and streams data back through the Git proxy and directly to your Git client.
Once your client has all the data, you’ve cloned the repository and can get to work!
In addition to the primary web application and Git hosting systems, we also run a variety of other sub-systems and side-systems. Sub-systems include the job queue, archive downloads, billing, mirroring, and the svn importer. Side-systems include GitHub Pages, Gist, gem server, and a bunch of internal tools. You can look forward to explanations of how some of these work within the new architecture, and what new technologies we’ve created to help our application run more smoothly.
The architecture outlined here has allowed us to properly scale the site and resulted in massive performance increases across the entire site. Our average Rails response time on our previous setup was anywhere from 500ms to several seconds depending on how loaded the slices were. Moving to bare metal and federated storage on Rackspace has brought our average Rails response time to consistently under 100ms. In addition, the job queue now has no problem keeping up with the 280,000 background jobs we process every day. We still have plenty of headroom to grow with the current set of hardware, and when the time comes to add more machines, we can add new servers on any tier with ease. I’m very pleased with how well everything is working, and if you’re like me, you’re enjoying the new and improved GitHub every day!
Today marks Ryan Tomayko’s first day as a GitHubber. He’ll be helping make GitHub more stable, reliable, and awesome.
Ryan has consistently impressed all of us with his work on Sinatra and Rack, the awesomeness of shotgun and git-sh, his prolific writing and linking, and his various other projects.
You can follow his blog, his twitter, or his GitHub
Welcome to the team, Ryan!
We will be having a maintenance window tonight from 23:00 to 23:59 PDT. A very small amount of web unavailability will be required during this period.
We will be upgrading some core libraries to versions that are not compatible with what is currently running, so all daemons must be restarted simultaneously. For this to go smoothly, we will be disabling the web app for perhaps 30 seconds.
UPDATE: Maintenance was completed successfully. Total web unavailability was a tad more than estimated at one minute and 40 seconds. Some job runners did not restart cleanly and as a result some jobs failed, but all job runners are operating normally now. If you experienced any problems during the maintenance window, don’t hesitate to contact us at http://support.github.com.
UNICEF is using SMS to help those in need. And they’re doing it with open source.
You can read all about RapidSMS, their Mobile and SMS platform, but here’s a snippet:
The impact a RapidSMS implementation has on UNICEF’s work practices is dramatic. In October 2008, Ethiopia experienced crippling droughts. Faced with the possibility of famine, UNICEF Ethiopia launched a massive food distribution program to supply the high-protein food Plumpy’nut to under-nourished children at more than 1,800 feeding centres in the country. Previously, UNICEF monitored the distribution of food by sending a small set of individuals who traveled to each feeding center. The monitor wrote down the amount of food that was received, was distributed, and if more food was needed. There had been a two week to two month delay between the collection of that data and analysis, prolonging action. In a famine situation each day can mean the difference between recovery, starvation, or even death.
The Ethiopian implementation of RapidSMS completely eliminated the delay. After a short training session the monitors would enter information directly into their mobile phones as SMS messages. This data would instantaneously appear on the server and immediately be visualized into graphs showing potential distribution problem and displayed on a map clearly showing where the problems were. The data could be seen, not only by the field office, but by the regional office, supply division and even headquarters, greatly improving response coordination. The process of entering the data into phones was also easier and more cost effective for the monitors themselves leading to quick adoption of the technology.
What a great use of technology. The site says, “GSMA [predicts] that by 2010, 90% of the world will be covered by mobile networks.” Seems like SMS is going to become more important and more ubiquitous in the future.
Check out the RapidSMS home page or browse the source, right here on GitHub: http://github.com/rapidsms/rapidsms