GitHub was created as a side project, but it seems to have struck a nerve and gained traction quickly. As such, a lot of the infrastructure decisions were made not figuring on this sort of growth:

One of the major pieces of the infrastructure is how we store the repositories. The way it was originally setup worked great for a while, but it wasn’t sustainable.
As an example, lets take my github-services repository. Here’s where it was stored prior to yesterday:
/our-shared-drive/pjhyett/github-services.git
Straight forward and simple, as well as having the added benefit of the repo being easily locatable in the file system if we needed to debug an issue.
That works well unless you have thousands of folders sitting in the same directory. GFS tried as best as it could, but with the amount of IO we do at GitHub writing to and reading from the file system, a change had to be made quickly.
After migrating last night, taking the same repository, this is where it’s currently stored:
/our-shared-drive/5/52/af/b5/pjhyett/github-services.git
Instead of every user sitting in one directory, we’ve sharded the repositories based on an MD5 of the username. A large change to be sure, but with some number crunching by our very own Tom Preston-Werner, he told me everyone on the planet can sign up twice and we still won’t have to change the way we shard our repositories after this.
Another interesting point worth mentioning is the first directory, ‘5’, was setup specifically so we could add multiple GFS mounts (we currently use just one) using a simple numbering system to help scale the data when we start bumping up against that wall again.
Now, the question you may all be asking is why we didn’t do this from the beginning. The simple answer is it would have taken more time and prevented us from launching when we did. We could have spent a couple of extra weeks in the beginning figuring out and preventing bottlenecks, but the site may not have taken off and then we would have built a scalable site that three people use.
Truth be told, it’s a great problem to have, and the site is humming along smoothly now. Now we can get back to doing fun things like building new features for you guys and gals. Keep an eye out for the big one we’re launching next week!



By slicing the MD5 into 3 different directories, you may have many more directories than users, and as you grow, the traversal of the tree will be a bottle neck.
About 130 million users could be sustained with just prefixing the username directory with the first 3 letters of the hash, like:
/52a/pjhyett/...
I did 161616*32000. This assumes your FS can do 32K files within a directory (like plain old ext3). If it can support 64K files within a dir, then you could use the first 4 letters of the has and support over a billion users.
Just my initial thoughts, someone tell me if my math is off or if I missed something obvious.
Bah, it tried to format my math.
I tried to say “16 times 16 times 16 times 32K”
@up_the_irons 32K files is out of the question on GFS. That’s way way too many.
I knew Rails didn’t scale.
Yeah rails never scales, one one.
Looking forward to features (and tweaks) of awesomeness!
Nice to know you know how to deal with problems. I’m enjoing github much, much, much. Thank you for that.
glad you guys got it solved
@up_the_irons I should have mentioned that there wasn’t any change in average response time, but the real win is there haven’t been any GFS hiccups since the migration. Chris is right, though, we had problems long before 32k files in a directory.
Okay, now that scaling issues are out of way, can we get back to work? ;)
I need a paginator on popularly watched and forked projects page.
All projects, when i click is useless, because by default it displays, empty projects from new users.
Popularly watched projects page displays so popular projects that they are almost mainstream and everyone knows about them, I want to see the less popular projects and ideas with which people are playing. I already know about rails, merb and crap gang.
I had the same problem. The problem is a GFS scaling problem, and engineyard uses GFS. I solved the problem in a slightly different way, and some people who are interested in this problem might want to check out my post on it (w/ code):
http://forum.engineyard.com/forums/5/topics/80