Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The grand Lua pull request #183

Merged
merged 2,082 commits into from
Oct 30, 2015
Merged

The grand Lua pull request #183

merged 2,082 commits into from
Oct 30, 2015

Conversation

GUI
Copy link
Member

@GUI GUI commented Oct 27, 2015

For background, at the beginning of the year, we began experimenting with rewriting some of API Umbrella's core in Lua & OpenResty. That toying around has proved worthwhile, so we're moving forward with this plan. See this issue for some of the original thinking and more details: #86

The short version is that these updates should be backwards compatible (there are a few small caveats noted below), the platform is faster, and this cleans up a few pain points.

There are still a few loose ends, but it's very close, so I wanted to start a pull request where we can start to document the changes.

Code Organization

This consolidates most of our separate repositories into this single git repo. Currently we have:

Under the new repository structure we will have:

The gatekeeper, router, and web projects have been merged into the api-umbrella repo. api-umbrella-config and omnibus-api-umbrella have been rendered defunct.

The reasoning behind this shift is for a couple reasons:

  1. The gatekeeper and router are merged since all of that logic is combined inside OpenResty now.
  2. We've increasingly seen how interconnected some of these core components are. While things still remain modularized (and you can run different pieces of API Umbrella on different servers), we are basically pulling all the core functionality together to make it easier to work on the project.

To be clear, we're not aiming towards a monolithic codebase with these changes, but we want to make the codebase easier to work on, which hopefully these code organization changes will do. As a quick example of some of the pain points with the previous setup, see this recent feature. It required coordinating 3 pull requests, with some duplicate config across projects, and the router changes having failing tests until the gatekeeper changes were merged and released.

Architecture

As outlined in #86, the proposed architecture was essentially:

[ incoming requests ] => [ API Umbrella (nginx) ] => [ API Backends ]

That didn't quite pan out like that once we got to adding the caching layer. The plan had been to use nginx's built-in caching layer, but the major pitfall with that is that nginx's caching does not work when response streaming is enabled. Since response streaming is important to us, we can't use this approach for now (but if that issue gets resolved with nginx, we can revisit).

What we ended up landing on for the architecture looks like:

[ incoming requests ] => [ nginx ] => [ TrafficServer (caching) ] => [ nginx ] => [ API Backends ]

So while we still have the nginx sandwich we were trying to simplify, things are still dramatically better, since we remove the multiple nodejs processes, and the nginx instances contain all the relevant configuration in the same instance (rather than trying to have to sync config changes between the independent nodejs processes and the nginx backend router).

The switch to TrafficServer for caching was due to it performing a fair bit better than Varnish in this sandwich situation, and allowing us to still achieve sub-millisecond overhead (to be clear, both Varnish and TrafficServer perform very well as caches, but TrafficServer seems to have a lower proxy overhead on a per request basis). We've also been eying TrafficServer for a while for other reasons. The primary reasons (streaming) were obviated by the Varnish 4 upgrade, but there are still some nice things that TrafficServer addresses (more flexible configuration for keep-alive settings, a hybrid disk and memory cache, so we can cache larger things on disk, support for HTTP/2, and TrafficServer is pursuing a Lua approach for plugins, which nicely aligns with OpenResty).

I experimented with Ledge a Lua-based http cache that can be embedded, but the tests didn't pan out too well (there were some proxy behavior issues that didn't match our expectations for a transparent proxy, and the performance wasn't as good). Those initial experiments were much earlier this year, so it's possible things have changed since then (Ledge is under pretty active development), but for now, we'll stick with a more established proxy solution.

Backwards Compatibility

By and large, this upgrade should be backwards-compatible with previous API Umbrella releases, despite the significant underlying changes. In nearly all cases, the proxying functionality should remain unchanged, and the internal admin APIs we provide are also unchanged. However, there are a couple of small areas where we do have backwards-incompatible changes. They mostly affect some more esoteric features that I don't think were actively being used by anyone, but it's worth noting:

  • Logic inside dynamic headers: On an API backend, if you set custom headers to be passed to the API backend, you could use Handlebars.js variables inside to reference other variables. The functionality remains unchanged for simple variables. However, if you used Handlebars.js logic, like an {{#if}} block, the syntax has changed since we are now just using Mustache templates for parsing. Something like {{#if headers.x-missing}}{{headers.x-missing}}{{else}}default{{/if}} in Handlebars would now become {{#headers.x-missing}}{{headers.x-missing}}{{/headers.x-missing}}{{^headers.x-missing}}default{{/headers.x-missing}}.
  • Global rate limits disabled by default: Previously, we had some default rate limits that applied within nginx on a per-server basis to prevent high-levels of abuse. However, these were always confusing, since they operated at a server level and were completely separate from API Umbrella's managed rate limits. These global limits have been disabled by default, but you can still turn them on via config settings if you'd like.

Analytics Logging

Gathering the log data for analytics was a particular pain point in the old stack. It had grown rather unwieldy over time with multiple queues to try to aggregate data from the multiple points in our proxy stack (eg, combining analytics from the nodejs processes and nginx). The overall approach is much, much simpler, now that everything is handled inside OpenResty. But another big improvement is that we have a much simpler approach to queuing this data using heka, which is much faster and solves potentially unbounded memory growth with disk buffering: 18F/api.data.gov#233

Performance

Basically, things are much faster and efficient. Our general goal based on prototypes had been to get the local time spent in the API Umbrella proxy down below 1 millisecond. I ended up stripping some of the initial over-eager optimizations I had done to simplify the initial code, so I think we might have crept up more in the 1-2ms range (depending on hardware). However, things are still definitely faster and more efficient than the old stack in any benchmark I've run. I also think there's quite a bit of room for further optimization if we make take a more focused look at that again (which should be easier to do now that we have all the functionality out of the way).

Resiliency

The stack should be much better at handling unexpected failures, particularly with the database going down. We now cache more things, like valid API keys locally, so even in the event of a complete database outage, API Umbrella's proxying should be able to continue to serve recent, existing users for a while (new users, or uncached users will fail, but this at least reduces the general impact of a small database outage).

We've also added a health check API endpoint that can be used to better determine the overall health of the API Umbrella platform. This could be used in load balancers to proactively remove servers that might be responding, but aren't actually able to service requests.

Process Management

One area that continued to cause pain in the old stack was how we were running the various processes that make up API Umbrella. API Umbrella is composed of multiple processes, but we want to offer an easy way to start and stop all of these services together. What we had before had grown a little unwieldy and could occasionally get in a funky state where things would not stop or start properly. All of that has been much simplified and improved.

We were previously using a combination of forever.js and supervisord, which was a bit of an odd mix, since they both do similar things (essentially, supervisor was being used for managing the processes, and forever.js was responsible for starting supervisord and responding to custom signals). We've switched to just using perp for all this. It's much lighter weight and more reliable since we replace the "api-umbrella" process with the perpboot process, and then trap specific signals we need to respond to. This helps cleanup our overly complex process tree and removes the need for us to run our own persistent process to respond to signals. It should lead to faster starts/stops, and also removes some overly-fancy logic we had that could lead to API Umbrella shutting down if a single component failed to start (which isn't ideal from a resiliency perspective).

Configuration Reloading

Despite improvements made, we still had some issues where sending a "reload" signal to API Umbrella wouldn't always reload specific config settings. Thanks to the process management simplifications made, this has been greatly simplified to better guarantee that a reload will always get the completely updated config.

DNS Changes & Keep-Alive Connections

To handle DNS changes with API backend servers, we currently end up reloading nginx. This works, but leads to hyper-frequent nginx reloads if you have a lot of API backends with changing IPs. These reloads should be zero-downtime, but I think this may lead to corner-case issues with keep-alive connections. I haven't proved this, but it's my current theory that this might be responsible for some ELB proxying issues we've potted.

In the new stack, all of the DNS changes are handled inside nginx without any reloads necessary. Overall, it's much simpler (and I'm hoping might explain/fix possible keep-alive issues). In addition, there's a path towards making the code even simpler, since a future version of OpenResty will natively support managing upstream server connections (right now we rely on an nginx plugin that works, but has a few rough edges).

Packaging Improvements

The binary packages for API Umbrella are much smaller in size since we've been able to simplify our dependencies. We've also made an effort to rely more on system dependencies wherever possible, rather than packaging all of our dependencies into our own binary. In addition to a smaller binary, this also means that when there are things like OpenSSL security updates, you only have to worry about updating the copy on your system, rather than also needing to upgrade API Umbrella.

We were previously using omnibus for building our packages (which is what led to a lot of the over-eager dependency inclusion), but now we're using a simpler standalone make process combined with fpm. Docker is being used to create packages for different distros. These changes should speed up the packaging process so it will be easier for us to release packages more frequently.

We'll also be looking to publish our binaries in real Yum and Apt repos so that dependencies can be automatically fetched (this is more of an issue for apt). This should also simplify upgrading API Umbrella.

Misc

There's probably more I'm forgetting, so I'll update this if I remember anything else. There's also more we need to outline on the roadmap related to this (shifting our APIs to Lua, unifying our testing frameworks), but this is the bulk of the big revamp we want to get out of the way.

If you've made it through this long-winded list, you'll probably see that the changes really extend beyond the switch to Lua. By taking a fresh look at things more broadly, it has allowed us to make improvements in other areas we've identified over the years. I hate to do such a massive code dump all at once in this pull request, but that's unfortunately how things turned out. However, with the major portions of the revamp out of the way, it should make things much easier to improve and iterate on moving forward. I don't take these kind of big changes (including a language switch) lightly, so hopefully we won't be doing anything like this again in the foreseeable future. But the benefits of the revamp have proved worthwhile, and we're excited about where this will let us go. To the future!

GUI and others added 30 commits June 5, 2015 09:28
At this point, all the test/server/* tests should pass most of the time.
There still seems to be some sporadic testing issues that can cause some
failures that we still need to track down, but things are basically
implemented.
Instead of `*example.com`, we'll now use `.example.com` which will match
both `example.com` and `*.example.com`. This ensures we get along with
nginx server name parsing.
Instead of `*example.com`, we'll now use `.example.com` which will match
both `example.com` and `*.example.com`. This ensures we get along with
nginx server name parsing.
This also uncovered a problem with wildcards and our hostname
normalization, which largely defeated the wildcard handling. Instead,
let's only apply the host normalization to the default host, since that
was the only place we really wanted it.
This reduces the need to publish multiple API Backends when only
reordering a single API backend for matching purposes. This becomes more
important when there are admin groups and scopes in use, since it
prevents one admin group from affecting another group's APIs when
performing a reorder.

Previously, we ordered each API backend with a numeric counter like 1,
2, 3, 4, 5, etc. So if you wanted to move the last backend to the front,
you actually ended up having to change the order for all the backends.

The new approach is similar in that it uses a counter, but we provide a
significant gap between each item (10,000). So instead, API backends
added would by default have a sort order of 0, 10000, 20000, 30000,
40000, etc. If you wanted to shift the last backend to the front, it
would get a new sort order of -10000, while leaving the rest of the
backends untouched. More important than simply going negative, if you
wanted to move the last backend to the second position, it would leave
the rest of the items untouched by choosing a sort order in between the
existing values (so 5,000 in this example). Subsequent moves in between
the first two items would continue dividing the available integer range
in half (2,500, 1,250, etc.)

With this approach of leaving gaps and then filling them in to fulfill
moves, there is a theoretical point where the surrounding items will
need to be shifted. Similarly, the sort order uses 32bit signed
integers, so there's also a theoretical maximum upper bound and lower
bound if enough entries get added with the gaps in place. Both
situations should be handled in this implementation by shifting
surrounding records to make room as need. In those cases, other records
may still be changed when sorting, so this isn't a 100% perfect
solution, but I think both situations should be exceedingly rare given
the amount of data typically present and number of reorders that
typically happen.
Implement different approach to API backend reordering to reduce reorder churn
- Roles from sub-URL settings now are appended by default to any
  original roles, rather than overwriting them. This seems more
  intuitive and less dangerous, since you're less likely to accidentally
  lift all role restrictions by simply adding a sub-url setting.
- There's a new option to explicitly have sub-URL roles override the
  parent roles, but by default, they are appended.
- When multiple roles are defined on the API, now the API key must have
  all of the roles in order to access the API, rather than any single
  role. I don't think the previous approach of only requiring a single
  role was very intuitive, and I think this approach makes more sense
  particularly with sub-URL roles appending to the base roles by
  default.
- Fix crash possible if API backend was setup with role requirements but
  api keys not required. While this combination of settings doesn't make
  sense, it may have been encountered with an API backend that required
  a role, but then a sub-URL setting trying to disable key requirements
  (for CORS preflight requests as an example).
- Remove special handling of "admin" role that previously granted access
  to all APIs. This wasn't really documented anywhere, and I don't think
  we have a strong use-case for this kind of global admin role, so it
  seems safest to remove this so a user isn't accidentally granted
  global access.
The override roles checkbox is part of some changes to how roles get
applied and inherited made in:
NREL/api-umbrella-gatekeeper@94670b0

This turned into a larger change than just adding the single override
checkbox for roles, since we only want this checkbox to show up in the
context of the sub-url settings, and not the other places that those
settings fields are shared (api global settings). So we can more easily
control pieces of the settings fields, it's been turned into an Ember
component, instead of a render call, which gives us the ability to pass
extra variables around. This in turn led to a few other things:

- Turn the selectize input from a sort of strange view wrapper into an
  easyform input type. This should be more straightforward, and since
  Views will be removed in Ember 2, it seems like a good move to go
  ahead and make.
- Tweaks to the selectize and ace editor inputs to improve usability and
  testability. Now the associated labels for these input types are
  correct, so clicking on the labels lands you in the correct input
  field, rather than still being tied to the underlying, hidden field.
  This and some related changes help with testability of these input
  types.
- More shifting of text in the api and settings forms to i18n while
  we're at it.
- A much more complete full integration test for the whole API backend
  and API user sections. Since we needed to shift the shared settings
  from a view/controller to a component, we needed a much more thorough
  integration suite to ensure this didn't break any piece unexpectedly.
Add option to override roles in sub-settings (and some refactoring)
Fixes and changes to how roles get applied
Related to it being possible to enter invalid error data (but valid
YAML) via the web UI: #153

In these cases, we will now return the default internal server error,
which at least prevents bad error responses without an HTTP status code
at all, which could really throw some clients for a tizzy.
This ensures that the error data isn't an unexpected type at the root
like a string: #153
Ensure that the error data yaml entered is the expected type (hash)
Better error handling for if error data is unexpectedly not an object.
I'm favoring the ideas of simplifying our stack down to Lua more and
more, so let's go ahead and switch the CLI app out for Lua.
We have a pull-request in, so hopefully this will be integrated into the
releases at some point. In the meantime, we need it for the CLI app to
dump YAML properly.
We're also giving our app an explicit namespace in lua for better
separation.
Also try to improve the build process a bit, so that if things are
already installed, we don't have to perform any actions.
GUI added 20 commits October 25, 2015 22:35
Shift the /etc/resolv.conf and /etc/hosts parsing to the CLI, since this
only needs be done once per reload, and then we the values stored in our
runtime config.
This has been replaced by the web backends concept.
Implement the "services" logic so that only certain processes startup
depending on the broader service groups defined for api umbrella.
- Track down a couple of processes that weren't in our default port range
  for api umbrella services (~14000).
- Bind all the internal-only services to 127.0.0.1 instead of the public
  network interface.
- Make sure all of our test processes run on different ports than the
  default ports (so tests can be run at the same time).
- Remove a few bits of defunct config related to ports.
The CI environment is having some odd issues with the rake version under
the static site suddenly. Bump some web dependencies while we're at it.
Running into rack bundler issues in newer version. I think perhaps the
the various rack changes made in the newer versions of Puma aren't
compatible with this older Rails 3.2 version of rack.
@GUI GUI mentioned this pull request Oct 27, 2015
6 tasks
Right now, we'll use the standard CentOS 6 output, but we'll tailor this
later for other distros.
GUI added a commit that referenced this pull request Oct 30, 2015
The grand Lua pull request
@GUI GUI merged commit 7dbcfb1 into master Oct 30, 2015
@perfaram
Copy link
Contributor

Whoa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants