The Lua rollout plan #294

GUI · 2015-10-27T18:29:00Z

We have a significant update to the API Umbrella platform we're going to be releasing: NREL/api-umbrella#183. This issue is to coordinate how we're going to update the api.data.gov stack with this update.

Phase 1: Testing
- This is a significant update. It should be backwards compatible, but given the scope of changes, we want to do lots of testing.
- We've been doing multi-day stress testing and looking for any potential problems like memory leaks.
  - One memory leak identified and fixed. Otherwise, all tests currently appear good and things are performing much better than the same tests being run against the old stack.
Phase 2: Limited rollout
- Deploy the new code base to a new proxy server (running the router and web components).
  - We'll also be taking this as an opportunity to migrate our servers to HVM AMIs for better virtualization performance.
  - I'd also like to tweak our disk partition setup to make it easier to move the API Umbrella data between servers.
- Configure that server to use our existing database servers (the upgraded stack also includes upgraded database versions, but we will defer these upgrades during this phase).
- Be sure to add back in global rate limits.
- Figure out NAT setup to route outgoing traffic from new proxy server through one of our existing IPs (so any API backends with IP restrictions won't need to be updated with a new IP). Two options:
  - Setup a dedicated NAT instance, assign one of our existing elastic IPs to this instance, and route both our old and new server through this instance.
  - Allow one of our existing proxy servers to act as a NAT instance, and route the outgoing traffic for our new server through this existing web server.
- Add this new server to the NREL ELB. We'll be using NREL as the test candidate, since I have the most insight and monitoring on the underlying APIs.
- Keep an eye on things... At this point, one third of NREL traffic should be going to the new stack. At this point, I think we just want to give things a bit of time to make sure we don't hit any unanticipated issues when dealing with real traffic.
Phase 3: Wider partial rollout
- Notify agencies of upcoming upgrades. We hopefully shouldn't be anticipating any issues, but it still seems worth notifying everyone, since live traffic
- Add the new proxy server to all other ELBs.
- With this setup, one third of all traffic should be hitting the new stack across all agencies. So continue to keep an eye on things to make sure we don't hit any weird issues.
Phase 4: Wider complete rollout.
- Spin up a second server with the new stack on an HVM AMI.
- Add that server to all the ELBs.
- Remove the old servers from the ELBs.
- At this point, all traffic should be routing through the new proxy stack.
Phase 5: Database upgrades
- Spin up two new database servers on HVM AMIs with the new partition structure.
- Add these servers to the mongodb and elasticsearch clusters, but ensure we configure things so these servers get a full copy of the data (we don't just want to expand the size of our cluster and only store partial data on these machines).
- Update the config on the proxy servers to point to these new servers.
- Remove old servers from database clusters so only new servers are being hit.
Phase 6: Cleanup
- Shut down the old stack of servers.
- Perform a final backup and then delete the servers once we're comfortable.

GUI · 2015-11-02T21:46:43Z

I've completed the initial rollout for testing against the NREL APIs. So far, things have looked pretty good and no big issues (and lots of nice benefits like lower memory use, CPU use, etc). There have been a few small things crop up, mostly around edge cases that the live traffic has helped pinpoint (eg, things like geocoding results for analytics containing a city, but no state/region). So I've been fixing those issues and keeping an eye on traffic, but otherwise no big issues really impacting functionality.

We'll continue to monitor things, but I think we can reach out to agencies soon about planning the wider rollout.

GUI · 2015-11-05T03:36:05Z

As a quick update, I've been seeing some unexpected memory growth on the production system running the new stack. It's not happening super quickly, but it's something I've been looking into, so I wanted to make note of it for reference.

This memory growth didn't show up in the multi-day stress tests I ran, but I have a couple of theories as to what's going on with production now:

What I'm hoping is happening is that the shared dict containing local caches of API key information and rate limits is simply slowly filling up as the system sees more unique API keys. However, this cache size is capped, so things should hopefully level off. I've run some load tests involving lots of API keys tests previously, but I think for the multi-day tests where I was keeping any eye on memory growth I was unfortunately just using a single API key. So after that key was cached, there would be no further growth in the local API key cache. But now on production we're seeing a wide variety of API keys, which is what's possibly causing this cache and memory usage to grow.
If it's not that, then the other options I've explored haven't led much of anywhere. I've done some memory profiling on the production system and generated flame graphs (which helped pinpoint an earlier memory leak that was legitimate), but I'm not seeing anything very obvious. There's a few curious things in the graphs related to gzip handling, so that might be an area to explore if it's not simply a matter of the API key cache filling up. So if we need to dive further into this, we should definitely try to come up with a better, more realistic test plan. In order to try and replicate the problem, we should throw more realistic requests and responses at it (multiple api keys for requests, and larger, streaming response bodies with a combination of pre-gzipped and non-gzipped responses).

I'm hoping it's the first option, since that means we don't really have a memory leak, just a slowly-filling cache that does have an eventual cap. And there are some recent signs that maybe point towards that:

I reloaded things at around 7AM, so that's the big drop-off, but the rate of increase does appear to be leveling off now. It's still increasing some, which I had sort of expected to stop by now based on some calculations, but we'll see how this looks tomorrow given another good chunk of hours. Prior to the 7AM reload, I also made some tweaks to better tune the default sizes of our shared memory dicts inside nginx, so that might also be helping.

GUI · 2015-11-09T04:39:05Z

Status update on the memory growth: Despite things appearing like they were leveling, off, the memory usage continued to grow. I think I've tracked it down to the geoip2 module. The memory growth was easily reproducible when making requests from many different IP addresses. I think this should be resolved by a switch to the geoip module that's builtin to nginx and uses the legacy dataset. Overall memory usage should also be improved by this switch after some deeper digging and testing. More details in this commit message: NREL/api-umbrella@19f2283

So we'll continue to keep our eyes peeled on that, but otherwise I think the plan is to announce the wider rollout for the week of November 16 and do a slow rollout that week to each agency domain.

GUI · 2015-11-17T06:31:29Z

A couple of status updates on the technical stuff:

I found a bug in the new code that meant if an API key was used, but then an admin edited that API key (to add a role, change it's rate limits, etc), those edits to that specific key would not be picked up. This was due to some difference in how the new stack caches API key information and polls for changes. This has been fixed in NREL/api-umbrella@7c1514f
The memory growth issues are are still being frustratingly bizarre and difficult to debug. As noted before, I was able to reproduce some memory growth issues with the geoip2 module, so I was thinking that was it, but even after the switch to the builtin geoip module (which does seem to have better memory characteristics regardless), we've continued to see memory growth beyond what I'd expect. I've been unable to reproduce the issues locally, making it difficult to debug, but even profiling the live servers shows nothing obvious (no lua GC issues, no apparent memory leaks in the C code, etc). So I'm still trying to get to the bottom of this and reproduce it in a controlled way, but the memory growth isn't huge, so worse comes to worse, we implement a temporary stopgap by reloading nginx once a day (which is actually far less than we're currently reloading it). I still have a couple ideas left of how to possibly reproduce this locally (my next idea is related to the dyups module and simulating upstream dns changes while keepalive connections are also being kept open).

And in terms of the general rollout, we announced our plans to rollout the changes to agencies this week. We're rolling things out to agencies one at a time on the following schedule:

GUI · 2015-11-19T04:59:27Z

Quick update for today: Things seem to be progressing well (knock on wood). The only notable issue discovered during the rollout this week was a pretty minor one. There was a bug that caused requests not to be logged in the analytics database if the request came from an IP address that geocoded to a city name that contained an accent or special character (Tórshavn, Faroe Islands is an example). This didn't affect a huge number of requests, but it has now been fixed.

GUI · 2015-11-21T21:34:49Z

The transition is fully complete! 🌟 🌟

GUI closed this as completed Nov 21, 2015

GUI mentioned this issue Nov 25, 2015

Investigate memory leak #296

Closed

GUI mentioned this issue Jan 7, 2016

Check over all custom rate limit records to ensure there are no duplicates #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Lua rollout plan #294

The Lua rollout plan #294

GUI commented Oct 27, 2015

GUI commented Nov 2, 2015

GUI commented Nov 5, 2015

GUI commented Nov 9, 2015

GUI commented Nov 17, 2015

GUI commented Nov 19, 2015

GUI commented Nov 21, 2015

The Lua rollout plan #294

The Lua rollout plan #294

Comments

GUI commented Oct 27, 2015

GUI commented Nov 2, 2015

GUI commented Nov 5, 2015

GUI commented Nov 9, 2015

GUI commented Nov 17, 2015

GUI commented Nov 19, 2015

GUI commented Nov 21, 2015