• Post-Mortem of This Morning's Outage

    mojombo 30 Oct 2009

    At 07:53 PDT this morning the site was hit with an abnormal number of SSH connections. The script that runs after an SSH connection is accepted makes an RPC call to the backend to check for the existence of the repository so that we can display a nice error message if it is not present. The vast number of these calls that came in simultaneously caused some delays in the backend that cascaded to the frontends and resulted in a piling up of the scripts waiting for their RPC results. This, in turn, caused load to spike on the frontends further exacerbating the problem. I removed the RPC call from the SSH script to prevent this bottlenecking and soon after the barrage of SSH connections ceased.

    Another unrelated problem caused the outage to continue even after the SSH connection load became nominal. Last night I deployed some package upgrades to our RPC stack that had tested out fine in staging for two days. While debugging the SSH problem, I restarted the backend RPC servers to rule them out as the problem source. This was the first time these processes had been restarted since the package upgrades, as they were deemed to be backward compatible with the changes and staging had shown no problems in this regard. However, it appears that these restarts put the RPC servers into an unworking state, and they began serving requests very sporadically. After failing to identify the problem within a short period, we decided to roll back to the previous known working state. After the packages were rolled back and the daemons restarted, the site picked up and began operating normally.

    Full site operation returned at 09:34 PDT (some sporadic uptime was seen during the outage).

    Over the next week we will be doing several things:

    • Further testing on staging to attempt to reproduce the behavior seen on production and resolve the underlying issue.
    • Better SSH script logging to more quickly identify abnormal behavior.
    • Working towards a more fine-grained rolling deploy of infrastructure packages to limit the impact of unforeseen problems.

    On a positive note, the outage led me to identify the source of several subtle bugs that have been eluding our detection for a few weeks. We are all rapidly learning the quirks of our new architecture in a production environment, and every problem leads to a more robust system in the future. Thanks for your patience over the last month and during the coming months as we work to improve the GitHub experience on every level.

  • Comments

    Llanilek Fri Oct 30 12:07:07 -0700 2009

    glad to see you all back up and running... !!!

    mdarby Fri Oct 30 12:11:32 -0700 2009

    Thanks for your continued transparency!

    mickey Fri Oct 30 12:13:27 -0700 2009

    Thanks for this transparency to explaining the downtime. It's not just interesting, it's simply awesome !

    cadwallion Fri Oct 30 12:14:29 -0700 2009

    It is through your outages that I realize how dependent I am on your continued service, and through post-mortems like this that I resist the urge to freak out about that dependency.

    mharper Fri Oct 30 12:23:03 -0700 2009

    That's one badass unicorn!

    stevenhaddox Fri Oct 30 12:24:59 -0700 2009

    This is the most amazing communication I've ever seen from a company regarding outages, causes, and details. Props and kudos to everyone on the team and best of luck working out the bugs that haunt us all.

    kgbland Fri Oct 30 15:14:58 -0700 2009

    Thank you for the transparency, really good reading.

    cryos Fri Oct 30 20:45:51 -0700 2009

    I am also very appreciative of the transparency you have shown. The outage was inconvenient, but much easier to take when you are willing to explain what happened, why and what you learned. Thanks for providing such a great service, even in outages.

    I also loved the unicorn!

    eculver Sat Oct 31 00:28:27 -0700 2009

    Thanks for sharing the insight. We ALL learn from this.

    kolektiv Sun Nov 01 14:15:52 -0800 2009

    Yeah seconded. (Thirded? etc.) Keep being transparent when there's issues and I'll keep being cool with it. Things do happen. Pretending they don't is rubbish, keep on with the honesty and I'll be here a long time.

    raldred Mon Nov 02 00:58:30 -0800 2009

    Honesty always prevails, thank you github.

    SecretDiamond Tue Nov 03 20:31:19 -0800 2009

    I agree. Honesty is always the best policy!

    ThomasHabets Sat Nov 07 11:34:22 -0800 2009

    I love these post-mortems. They serve to teach as well as show that we are in good hands.

    Please log in to comment.