Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle MongoDB replicaset changes #200

Closed
GUI opened this issue Mar 16, 2015 · 1 comment
Closed

Gracefully handle MongoDB replicaset changes #200

GUI opened this issue Mar 16, 2015 · 1 comment

Comments

@GUI
Copy link
Member

GUI commented Mar 16, 2015

During our last upgrade of some system-level stuff a few weeks ago, we unexpectedly experienced about 15 seconds of downtime across our service. It occurred when I forced the MongoDB cluster to switch the primary server to a different server. I've gone through this process in the past without incurring downtime, so I finally got around to investigating how the MongoDB replicaset change triggered a brief outage.

I discovered two things that could happen when MongoDB was electing a new primary server:

  • The API backends config from MongoDB would get wiped, which could result in all API backends disappearing (this happened if the config reloader reloaded before the new primary was elected). This effectively took down all the APIs, which obviously is a very bad thing.
  • API key lookups on individual requests may have failed. Since we verify API keys with the MongoDB database, the key lookups could fail for any API that required API keys. These requests were being retried, but possibly not for long enough.

This has been addressed a few updates:

So now when MongoDB is completely down or just having a replicaset re-election, things should behave. The only downside of the current approach is that requests using API keys may pause until the new primary server is selected. So this means requests during this time may take 5-10 seconds to complete. But these primary server changes aren't super common, so I don't think this is a huge issue. This could be improved by caching the API key data locally on the servers, which is something that would eventually be good to roll out, but these fixes at least prevent outright failures. (And caching is hopefully on the horizon and is already implemented in the experimental Lua revamp: NREL/api-umbrella#111).

@GUI GUI self-assigned this Mar 16, 2015
@GUI GUI added this to the Sprint 17 (3/9-3/20) milestone Mar 16, 2015
@GUI
Copy link
Member Author

GUI commented Mar 16, 2015

As noted, fixed by a few different commits across projects. These updates have been rolled out to the servers.

@GUI GUI closed this as completed Mar 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant