The grand Lua pull request #183

GUI · 2015-10-27T18:25:20Z

For background, at the beginning of the year, we began experimenting with rewriting some of API Umbrella's core in Lua & OpenResty. That toying around has proved worthwhile, so we're moving forward with this plan. See this issue for some of the original thinking and more details: #86

The short version is that these updates should be backwards compatible (there are a few small caveats noted below), the platform is faster, and this cleans up a few pain points.

There are still a few loose ends, but it's very close, so I wanted to start a pull request where we can start to document the changes.

Code Organization

This consolidates most of our separate repositories into this single git repo. Currently we have:

api-umbrella-gatekeeper
api-umbrella-router
api-umbrella-web
api-umbrella-static-site
api-umbrella-config
omnibus-api-umbrella
api-umbrella (this doesn't really store much as things exist now)

Under the new repository structure we will have:

The gatekeeper, router, and web projects have been merged into the api-umbrella repo. api-umbrella-config and omnibus-api-umbrella have been rendered defunct.

The reasoning behind this shift is for a couple reasons:

The gatekeeper and router are merged since all of that logic is combined inside OpenResty now.
We've increasingly seen how interconnected some of these core components are. While things still remain modularized (and you can run different pieces of API Umbrella on different servers), we are basically pulling all the core functionality together to make it easier to work on the project.

To be clear, we're not aiming towards a monolithic codebase with these changes, but we want to make the codebase easier to work on, which hopefully these code organization changes will do. As a quick example of some of the pain points with the previous setup, see this recent feature. It required coordinating 3 pull requests, with some duplicate config across projects, and the router changes having failing tests until the gatekeeper changes were merged and released.

Architecture

As outlined in #86, the proposed architecture was essentially:

[ incoming requests ] => [ API Umbrella (nginx) ] => [ API Backends ]

That didn't quite pan out like that once we got to adding the caching layer. The plan had been to use nginx's built-in caching layer, but the major pitfall with that is that nginx's caching does not work when response streaming is enabled. Since response streaming is important to us, we can't use this approach for now (but if that issue gets resolved with nginx, we can revisit).

What we ended up landing on for the architecture looks like:

[ incoming requests ] => [ nginx ] => [ TrafficServer (caching) ] => [ nginx ] => [ API Backends ]

So while we still have the nginx sandwich we were trying to simplify, things are still dramatically better, since we remove the multiple nodejs processes, and the nginx instances contain all the relevant configuration in the same instance (rather than trying to have to sync config changes between the independent nodejs processes and the nginx backend router).

The switch to TrafficServer for caching was due to it performing a fair bit better than Varnish in this sandwich situation, and allowing us to still achieve sub-millisecond overhead (to be clear, both Varnish and TrafficServer perform very well as caches, but TrafficServer seems to have a lower proxy overhead on a per request basis). We've also been eying TrafficServer for a while for other reasons. The primary reasons (streaming) were obviated by the Varnish 4 upgrade, but there are still some nice things that TrafficServer addresses (more flexible configuration for keep-alive settings, a hybrid disk and memory cache, so we can cache larger things on disk, support for HTTP/2, and TrafficServer is pursuing a Lua approach for plugins, which nicely aligns with OpenResty).

I experimented with Ledge a Lua-based http cache that can be embedded, but the tests didn't pan out too well (there were some proxy behavior issues that didn't match our expectations for a transparent proxy, and the performance wasn't as good). Those initial experiments were much earlier this year, so it's possible things have changed since then (Ledge is under pretty active development), but for now, we'll stick with a more established proxy solution.

Backwards Compatibility

By and large, this upgrade should be backwards-compatible with previous API Umbrella releases, despite the significant underlying changes. In nearly all cases, the proxying functionality should remain unchanged, and the internal admin APIs we provide are also unchanged. However, there are a couple of small areas where we do have backwards-incompatible changes. They mostly affect some more esoteric features that I don't think were actively being used by anyone, but it's worth noting:

Logic inside dynamic headers: On an API backend, if you set custom headers to be passed to the API backend, you could use Handlebars.js variables inside to reference other variables. The functionality remains unchanged for simple variables. However, if you used Handlebars.js logic, like an {{#if}} block, the syntax has changed since we are now just using Mustache templates for parsing. Something like {{#if headers.x-missing}}{{headers.x-missing}}{{else}}default{{/if}} in Handlebars would now become {{#headers.x-missing}}{{headers.x-missing}}{{/headers.x-missing}}{{^headers.x-missing}}default{{/headers.x-missing}}.
Global rate limits disabled by default: Previously, we had some default rate limits that applied within nginx on a per-server basis to prevent high-levels of abuse. However, these were always confusing, since they operated at a server level and were completely separate from API Umbrella's managed rate limits. These global limits have been disabled by default, but you can still turn them on via config settings if you'd like.

Analytics Logging

Gathering the log data for analytics was a particular pain point in the old stack. It had grown rather unwieldy over time with multiple queues to try to aggregate data from the multiple points in our proxy stack (eg, combining analytics from the nodejs processes and nginx). The overall approach is much, much simpler, now that everything is handled inside OpenResty. But another big improvement is that we have a much simpler approach to queuing this data using heka, which is much faster and solves potentially unbounded memory growth with disk buffering: 18F/api.data.gov#233

Performance

Basically, things are much faster and efficient. Our general goal based on prototypes had been to get the local time spent in the API Umbrella proxy down below 1 millisecond. I ended up stripping some of the initial over-eager optimizations I had done to simplify the initial code, so I think we might have crept up more in the 1-2ms range (depending on hardware). However, things are still definitely faster and more efficient than the old stack in any benchmark I've run. I also think there's quite a bit of room for further optimization if we make take a more focused look at that again (which should be easier to do now that we have all the functionality out of the way).

Resiliency

The stack should be much better at handling unexpected failures, particularly with the database going down. We now cache more things, like valid API keys locally, so even in the event of a complete database outage, API Umbrella's proxying should be able to continue to serve recent, existing users for a while (new users, or uncached users will fail, but this at least reduces the general impact of a small database outage).

We've also added a health check API endpoint that can be used to better determine the overall health of the API Umbrella platform. This could be used in load balancers to proactively remove servers that might be responding, but aren't actually able to service requests.

Process Management

One area that continued to cause pain in the old stack was how we were running the various processes that make up API Umbrella. API Umbrella is composed of multiple processes, but we want to offer an easy way to start and stop all of these services together. What we had before had grown a little unwieldy and could occasionally get in a funky state where things would not stop or start properly. All of that has been much simplified and improved.

We were previously using a combination of forever.js and supervisord, which was a bit of an odd mix, since they both do similar things (essentially, supervisor was being used for managing the processes, and forever.js was responsible for starting supervisord and responding to custom signals). We've switched to just using perp for all this. It's much lighter weight and more reliable since we replace the "api-umbrella" process with the perpboot process, and then trap specific signals we need to respond to. This helps cleanup our overly complex process tree and removes the need for us to run our own persistent process to respond to signals. It should lead to faster starts/stops, and also removes some overly-fancy logic we had that could lead to API Umbrella shutting down if a single component failed to start (which isn't ideal from a resiliency perspective).

Configuration Reloading

Despite improvements made, we still had some issues where sending a "reload" signal to API Umbrella wouldn't always reload specific config settings. Thanks to the process management simplifications made, this has been greatly simplified to better guarantee that a reload will always get the completely updated config.

DNS Changes & Keep-Alive Connections

To handle DNS changes with API backend servers, we currently end up reloading nginx. This works, but leads to hyper-frequent nginx reloads if you have a lot of API backends with changing IPs. These reloads should be zero-downtime, but I think this may lead to corner-case issues with keep-alive connections. I haven't proved this, but it's my current theory that this might be responsible for some ELB proxying issues we've potted.

In the new stack, all of the DNS changes are handled inside nginx without any reloads necessary. Overall, it's much simpler (and I'm hoping might explain/fix possible keep-alive issues). In addition, there's a path towards making the code even simpler, since a future version of OpenResty will natively support managing upstream server connections (right now we rely on an nginx plugin that works, but has a few rough edges).

Packaging Improvements

The binary packages for API Umbrella are much smaller in size since we've been able to simplify our dependencies. We've also made an effort to rely more on system dependencies wherever possible, rather than packaging all of our dependencies into our own binary. In addition to a smaller binary, this also means that when there are things like OpenSSL security updates, you only have to worry about updating the copy on your system, rather than also needing to upgrade API Umbrella.

We were previously using omnibus for building our packages (which is what led to a lot of the over-eager dependency inclusion), but now we're using a simpler standalone make process combined with fpm. Docker is being used to create packages for different distros. These changes should speed up the packaging process so it will be easier for us to release packages more frequently.

We'll also be looking to publish our binaries in real Yum and Apt repos so that dependencies can be automatically fetched (this is more of an issue for apt). This should also simplify upgrading API Umbrella.

Misc

There's probably more I'm forgetting, so I'll update this if I remember anything else. There's also more we need to outline on the roadmap related to this (shifting our APIs to Lua, unifying our testing frameworks), but this is the bulk of the big revamp we want to get out of the way.

If you've made it through this long-winded list, you'll probably see that the changes really extend beyond the switch to Lua. By taking a fresh look at things more broadly, it has allowed us to make improvements in other areas we've identified over the years. I hate to do such a massive code dump all at once in this pull request, but that's unfortunately how things turned out. However, with the major portions of the revamp out of the way, it should make things much easier to improve and iterate on moving forward. I don't take these kind of big changes (including a language switch) lightly, so hopefully we won't be doing anything like this again in the foreseeable future. But the benefits of the revamp have proved worthwhile, and we're excited about where this will let us go. To the future!

At this point, all the test/server/* tests should pass most of the time. There still seems to be some sporadic testing issues that can cause some failures that we still need to track down, but things are basically implemented.

Related to the changes here: NREL/api-umbrella-gatekeeper#15

Instead of `*example.com`, we'll now use `.example.com` which will match both `example.com` and `*.example.com`. This ensures we get along with nginx server name parsing.

This also uncovered a problem with wildcards and our hostname normalization, which largely defeated the wildcard handling. Instead, let's only apply the host normalization to the default host, since that was the only place we really wanted it.

This reduces the need to publish multiple API Backends when only reordering a single API backend for matching purposes. This becomes more important when there are admin groups and scopes in use, since it prevents one admin group from affecting another group's APIs when performing a reorder. Previously, we ordered each API backend with a numeric counter like 1, 2, 3, 4, 5, etc. So if you wanted to move the last backend to the front, you actually ended up having to change the order for all the backends. The new approach is similar in that it uses a counter, but we provide a significant gap between each item (10,000). So instead, API backends added would by default have a sort order of 0, 10000, 20000, 30000, 40000, etc. If you wanted to shift the last backend to the front, it would get a new sort order of -10000, while leaving the rest of the backends untouched. More important than simply going negative, if you wanted to move the last backend to the second position, it would leave the rest of the items untouched by choosing a sort order in between the existing values (so 5,000 in this example). Subsequent moves in between the first two items would continue dividing the available integer range in half (2,500, 1,250, etc.) With this approach of leaving gaps and then filling them in to fulfill moves, there is a theoretical point where the surrounding items will need to be shifted. Similarly, the sort order uses 32bit signed integers, so there's also a theoretical maximum upper bound and lower bound if enough entries get added with the gaps in place. Both situations should be handled in this implementation by shifting surrounding records to make room as need. In those cases, other records may still be changed when sorting, so this isn't a 100% perfect solution, but I think both situations should be exceedingly rare given the amount of data typically present and number of reorders that typically happen.

Implement different approach to API backend reordering to reduce reorder churn

- Roles from sub-URL settings now are appended by default to any original roles, rather than overwriting them. This seems more intuitive and less dangerous, since you're less likely to accidentally lift all role restrictions by simply adding a sub-url setting. - There's a new option to explicitly have sub-URL roles override the parent roles, but by default, they are appended. - When multiple roles are defined on the API, now the API key must have all of the roles in order to access the API, rather than any single role. I don't think the previous approach of only requiring a single role was very intuitive, and I think this approach makes more sense particularly with sub-URL roles appending to the base roles by default. - Fix crash possible if API backend was setup with role requirements but api keys not required. While this combination of settings doesn't make sense, it may have been encountered with an API backend that required a role, but then a sub-URL setting trying to disable key requirements (for CORS preflight requests as an example). - Remove special handling of "admin" role that previously granted access to all APIs. This wasn't really documented anywhere, and I don't think we have a strong use-case for this kind of global admin role, so it seems safest to remove this so a user isn't accidentally granted global access.

The override roles checkbox is part of some changes to how roles get applied and inherited made in: NREL/api-umbrella-gatekeeper@94670b0 This turned into a larger change than just adding the single override checkbox for roles, since we only want this checkbox to show up in the context of the sub-url settings, and not the other places that those settings fields are shared (api global settings). So we can more easily control pieces of the settings fields, it's been turned into an Ember component, instead of a render call, which gives us the ability to pass extra variables around. This in turn led to a few other things: - Turn the selectize input from a sort of strange view wrapper into an easyform input type. This should be more straightforward, and since Views will be removed in Ember 2, it seems like a good move to go ahead and make. - Tweaks to the selectize and ace editor inputs to improve usability and testability. Now the associated labels for these input types are correct, so clicking on the labels lands you in the correct input field, rather than still being tied to the underlying, hidden field. This and some related changes help with testability of these input types. - More shifting of text in the api and settings forms to i18n while we're at it. - A much more complete full integration test for the whole API backend and API user sections. Since we needed to shift the shared settings from a view/controller to a component, we needed a much more thorough integration suite to ensure this didn't break any piece unexpectedly.

Add option to override roles in sub-settings (and some refactoring)

Fixes and changes to how roles get applied

Related to it being possible to enter invalid error data (but valid YAML) via the web UI: #153 In these cases, we will now return the default internal server error, which at least prevents bad error responses without an HTTP status code at all, which could really throw some clients for a tizzy.

This ensures that the error data isn't an unexpected type at the root like a string: #153

Ensure that the error data yaml entered is the expected type (hash)

Better error handling for if error data is unexpectedly not an object.

Rails security updates: http://weblog.rubyonrails.org/2015/6/16/Rails-3-2-22-4-1-11-and-4-2-2-have-been-released-and-more/ Related rack update: rack/rack#894 Moped security updates: mongoid/moped#377

I'm favoring the ideas of simplifying our stack down to Lua more and more, so let's go ahead and switch the CLI app out for Lua.

We have a pull-request in, so hopefully this will be integrated into the releases at some point. In the meantime, we need it for the CLI app to dump YAML properly.

We're also giving our app an explicit namespace in lua for better separation.

Also try to improve the build process a bit, so that if things are already installed, we don't have to perform any actions.

Shift the /etc/resolv.conf and /etc/hosts parsing to the CLI, since this only needs be done once per reload, and then we the values stored in our runtime config.

This has been replaced by the web backends concept.

Implement the "services" logic so that only certain processes startup depending on the broader service groups defined for api umbrella.

- Track down a couple of processes that weren't in our default port range for api umbrella services (~14000). - Bind all the internal-only services to 127.0.0.1 instead of the public network interface. - Make sure all of our test processes run on different ports than the default ports (so tests can be run at the same time). - Remove a few bits of defunct config related to ports.

The CI environment is having some odd issues with the rake version under the static site suddenly. Bump some web dependencies while we're at it.

Running into rack bundler issues in newer version. I think perhaps the the various rack changes made in the newer versions of Puma aren't compatible with this older Rails 3.2 version of rack.

Right now, we'll use the standard CentOS 6 output, but we'll tailor this later for other distros.

The grand Lua pull request

perfaram · 2015-10-30T14:46:29Z

Whoa.

GUI and others added 30 commits June 5, 2015 09:28

Bump gatekeeper to v0.8.11.

8e91caa

Don't merge empty settings arrays over default values.

4f734b6

Fix distributed rate limit implementation. Other fixes.

7f81bed

At this point, all the test/server/* tests should pass most of the time. There still seems to be some sporadic testing issues that can cause some failures that we still need to track down, but things are basically implemented.

Bump gatekeeper to v0.8.12.

51c2352

Allow the backend host to also contain wildcards.

5f97364

Related to the changes here: NREL/api-umbrella-gatekeeper#15

Use nginx wildcard syntax for matching wildcards and root domains.

1f43fe1

Instead of `*example.com`, we'll now use `.example.com` which will match both `example.com` and `*.example.com`. This ensures we get along with nginx server name parsing.

Use nginx wildcard syntax for matching wildcards and root domains.

1087d6d

Instead of `*example.com`, we'll now use `.example.com` which will match both `example.com` and `*.example.com`. This ensures we get along with nginx server name parsing.

Fix wildcard host matching when host contains port.

4fafa81

Bump to v0.8.13

f6ef66f

Merge pull request #16 from NREL/api-gapped-ordering

4bf2b4d

Implement different approach to API backend reordering to reduce reorder churn

Fix intermittent selectize test failures.

6596748

Merge pull request #17 from NREL/role-overrides

649f1b0

Add option to override roles in sub-settings (and some refactoring)

Merge pull request #17 from NREL/role-changes

304cc9e

Fixes and changes to how roles get applied

Ensure that the error data yaml entered is the expected type (hash).

5c94c91

This ensures that the error data isn't an unexpected type at the root like a string: #153

Merge pull request #18 from NREL/error-data-non-objects

8af00fd

Ensure that the error data yaml entered is the expected type (hash)

Merge pull request #18 from NREL/error-data-non-objects

35efc2c

Better error handling for if error data is unexpectedly not an object.

Bump to v0.8.14

78723a0

Bump gatekeeper to v0.8.14.

3367dec

Bump rails, rack, and moped dependencies for security updates.

89ca422

Rails security updates: http://weblog.rubyonrails.org/2015/6/16/Rails-3-2-22-4-1-11-and-4-2-2-have-been-released-and-more/ Related rack update: rack/rack#894 Moped security updates: mongoid/moped#377

API rate buckets, defaulting to the frontend host

841c441

Switch CLI to lua. Some improvements to allow for custom install root.

211a538

I'm favoring the ideas of simplifying our stack down to Lua more and more, so let's go ahead and switch the CLI app out for Lua.

Use branch of lyaml to fix multiline dumping.

54bfc32

We have a pull-request in, so hopefully this will be integrated into the releases at some point. In the meantime, we need it for the CLI app to dump YAML properly.

Begin to shift around all the lua files for better organization.

42b70a5

We're also giving our app an explicit namespace in lua for better separation.

Fix jshint warnings.

cf81b2b

Let's try building and testing in the CI environment.

156d009

Also try to improve the build process a bit, so that if things are already installed, we don't have to perform any actions.

GUI added 20 commits October 25, 2015 22:35

Another fix for CI builds.

96bdd5a

Fix nginx template generation when IP rate is set but no burst.

e261399

Fix dns resolving in nginx when static site is set to a hostname.

ffbba28

Shift the /etc/resolv.conf and /etc/hosts parsing to the CLI, since this only needs be done once per reload, and then we the values stored in our runtime config.

Remove defunct reference to single static site backend.

68848f4

This has been replaced by the web backends concept.

Fix "services" configuration in lua branch.

a1838de

Implement the "services" logic so that only certain processes startup depending on the broader service groups defined for api umbrella.

Bump libcidr-ffi to fix memory leak.

22a6040

Fix install ordering issue with libcidr and libcidr-ffi.

9934ae1

Bump the static site and some gems to debug some CI build issues.

9b6d6f2

The CI environment is having some odd issues with the rake version under the static site suddenly. Bump some web dependencies while we're at it.

Revert puma upgrade.

496dc72

Running into rack bundler issues in newer version. I think perhaps the the various rack changes made in the newer versions of Puma aren't compatible with this older Rails 3.2 version of rack.

Another go at unexporting problematic variables from make.

cb89331

Also exclude gitignored files during packaging process.

ebf16ec

Fix delayed job process not starting up in right RAILS_ENV

f325aed

Some additional debugging for CI builds.

7a1da87

Fix asset precompiling for production builds.

6491389

Debugging some odd CI caching issues.

d1235ca

See if vendor symlink is messing up circleci's caching.

c6ab85b

Fix parallel CI tests due to reorganizing test dependencies.

ea05b82

Fix CI test caching.

963dcad

Increase some testing timeouts.

9308a83

GUI mentioned this pull request Oct 27, 2015

The Lua rollout plan 18F/api.data.gov#294

Closed

6 tasks

GUI added 3 commits October 29, 2015 22:24

Update init.d script with output.

c09dd8b

Right now, we'll use the standard CentOS 6 output, but we'll tailor this later for other distros.

Add missing version command to cli help output.

d7adf66

Some cleanup for the packaging build script.

1cf6a95

GUI added a commit that referenced this pull request Oct 30, 2015

Merge pull request #183 from NREL/lua

7dbcfb1

The grand Lua pull request

GUI merged commit 7dbcfb1 into master Oct 30, 2015

This was referenced Nov 28, 2015

beanstalkd #113

Closed

Contemplate switching HTTP caching layer from Varnish to Apache Traffic Server #35

Closed

Revisiting, speeding up, and simplifying API Umbrella's architecture #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The grand Lua pull request #183

The grand Lua pull request #183

GUI commented Oct 27, 2015

perfaram commented Oct 30, 2015

The grand Lua pull request #183

The grand Lua pull request #183

Conversation

GUI commented Oct 27, 2015

Code Organization

Architecture

Backwards Compatibility

Analytics Logging

Performance

Resiliency

Process Management

Configuration Reloading

DNS Changes & Keep-Alive Connections

Packaging Improvements

Misc

perfaram commented Oct 30, 2015