New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clustering) retiring Serf - new cache invalidation mechanism #2561

Merged
merged 12 commits into from Jun 16, 2017

Conversation

Projects
None yet
7 participants
@thibaultcha
Member

thibaultcha commented May 25, 2017

This rewrite makes Kong independent when dealing with cache invalidation events. This opens the way to #2355 and to a simpler installation and usage experience for Kong.

Incentive

As we know, Kong caches datastore entities (configuration entities mostly, such as APIs, Consumers, Plugins, Credentials...) to prevent unnecessary overhead and database round trips on a per-request basis. As of Kong 0.10, Serf has always been the dependency Kong relies upon to broadcast invalidation events to other members of the cluster. This approach is not without compromises, the main ones being:

  • A dependency to a third-party software (not Nginx/OpenResty related).
  • The need for a CLI to start and stop Kong, vs. the Nginx binary: because the Serf agent must be started as a daemon.
  • This makes deployments and containerization somewhat awkward.
  • Invoking Serf from a running Kong node requires some blocking I/O.

New Implementation

TLDR: database polling of a time-series entity called "cluster events".

Cluster events are implemented in both PostgreSQL and Cassandra. A Kong node can broadcast() an event on containing data (text) on a channel, and other Kong nodes can subscribe() to a channel and receive said events at a configurable interval.

Here is a high-level, non-exhaustive list of changes:

  • New cluster_events.lua module. An abstract framework for both PostgreSQL and Cassandra to broadcast messages across the cluster.
  • As per our defined constant, events stay in the database for 1 day (TTL)
  • For Cassandra, our table contains events with a partition key made of the channel they are published to. More details in the related commit, but this means we are quite efficient in R/W operations on this table, but that we can handle roughly 2 billion events a day. TTLs are natively supported.
  • For PostgreSQL, we use a trigger on insert to handle the TTL of expired events.
  • Cassandra has pagination support when reading events, PostgreSQL doesn't (yet).
  • The interval at which events are polled is configurable (5 seconds by default).
  • To mitigate distributed environments with eventual consistency behaviors, we introduce a configurable "propagation delay" to delay events handling until the data should have been propagated. This is an initial implementation.
  • The new mlcache.lua module is a temporary module waiting the completion of lua-resty-mlcache, and potentially the replacement of lua-resty-worker-events.
  • A new cache.lua module replaces the old database_cache.lua module. It handles inter-worker and inter-nodes invalidation, and relies on mlcache.lua for better performance and more powerful features.
  • All entities accessed via database_cache.get_or_set() have been replaced.
  • We try to provide node-consistency (request made on Admin API of a node instantly gets reflected in the node's proxy server) for entities. Currently, this is only done for the router (could be improved). Other entities like the balancer will need some refactoring to easily achieve this, and smaller, single entities will need an IPC module offering more consistency.
  • To generate the cache key of an entity, we introduce a new schema property in most entities: cache_key, which can be a table taking multiple values. This makes cache key generation based on variable(s) easier and safer.
  • Cached entities now have a TTL (previously was "forever") of 1 hour, and 5 minutes for a lookup miss. This is a change of default behavior as a safeguard against potentially stale data.
  • As another safeguard to potentially safe data, the /cache endpoint on the Admin API has been kept.
  • We remove mediator_lua and use lua-resty-worker-events instead. The later can achieve the job we used the former for, and has more essential features for our usage. This was a duplicate/unnecessary dependency.
  • We created the skeleton for a pub/sub API to subscribe to CRUD operations on database entities. Such as apis:updated or consumers:deleted. Plugins should be able to listen on those events and act accordingly.
  • With this change, Kong becomes purely stateless. There is no nodes table anymore storing each and every node currently connected to the cluster. This is by design for now, until the need to do so eventually arises.
  • Core entities invalidation tests have been moved to a new test suite: spec/04-invalidations and are written in a more efficient manner than the previously existing tests, avoiding too many "sleep" calls (NOP) to wait for events propagation. This is thanks to the new design and will be improved with a better upcoming IPC module ensuring inter-worker consistency for each request - later on.

Path forward

This should be reviewed and heavily tested. There should be some more tests written, especially around the cluster_events module itself. A particular attention should be given to the invalidation of core entities, and attention given there would be greatly appreciated.

Reviews and comments are welcome, questions as well! Local testing of this would be extremely helpful.

Regarding CI, the Travis builds might still fail at this point, partly for unrelated reasons (non-deterministic tests), partly because this is still in a "polishing" state.

Here is a small list of leftovers todos:

  • Run the new invalidation test suite on Travis CI
  • Find a way to test the quiet option of the DAO (they are currently commented out for linting reasons)
  • Write more tests for the cluster_events module with both databases
  • Audit the lua-resty-lock options of the mlcache module.

@thibaultcha thibaultcha added this to the 0.11.0 milestone May 25, 2017

@thibaultcha thibaultcha self-assigned this May 25, 2017

@p0pr0ck5

First round of comments, not enough time to walk deeply through everything today but wanted to get some thoughts on paper. Also, I notice several integration tests are failing consistently in a local env based on commit 689ae8c: https://gist.github.com/p0pr0ck5/6b31ddbcdf3a941aaccd3bb9314758e5

local DEBUG = ngx.DEBUG
local SHM_CACHE = "cache"

This comment has been minimized.

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

lets use a more narrowly namespaced name

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

lets use a more narrowly namespaced name

This comment has been minimized.

@thibaultcha

thibaultcha Jun 2, 2017

Member

This is not a namespace prefix, but the name of the shm entry, which is defined currently already in the Nginx config.

@thibaultcha

thibaultcha Jun 2, 2017

Member

This is not a namespace prefix, but the name of the shm entry, which is defined currently already in the Nginx config.

This comment has been minimized.

@p0pr0ck5

p0pr0ck5 Jun 2, 2017

Contributor

yes, "namespace" is colloquial, referring to usage within the shm. it may be possible that other users want to use this shm? we should be a good citizen imo and be clear about what our keys are doing in the shm.

@p0pr0ck5

p0pr0ck5 Jun 2, 2017

Contributor

yes, "namespace" is colloquial, referring to usage within the shm. it may be possible that other users want to use this shm? we should be a good citizen imo and be clear about what our keys are doing in the shm.

This comment has been minimized.

@thibaultcha

thibaultcha Jun 16, 2017

Member

Alright, let's take advantage of the fact that 0.11 will have many updates to the Nginx config template already to choose a better name, and group the breaking changes in one.

I suggest we prefix all of our shms with kong_*. How about: kong_db_cache for this one?

@thibaultcha

thibaultcha Jun 16, 2017

Member

Alright, let's take advantage of the fact that 0.11 will have many updates to the Nginx config template already to choose a better name, and group the breaking changes in one.

I suggest we prefix all of our shms with kong_*. How about: kong_db_cache for this one?

This comment has been minimized.

@p0pr0ck5

p0pr0ck5 Jun 16, 2017

Contributor

sgtm!

@p0pr0ck5

p0pr0ck5 Jun 16, 2017

Contributor

sgtm!

This comment has been minimized.

@thibaultcha

thibaultcha Jun 16, 2017

Member

Cool. Let's do that right after the merge, to have one single commit taking care of them all at once, to preserve atomicity then.

@thibaultcha

thibaultcha Jun 16, 2017

Member

Cool. Let's do that right after the merge, to have one single commit taking care of them all at once, to preserve atomicity then.

Show outdated Hide outdated kong/cmd/restart.lua
Show outdated Hide outdated kong/kong.lua
Show outdated Hide outdated kong/cluster_events.lua
Show outdated Hide outdated kong/plugins/rate-limiting/policies/init.lua
local remaining_ttl = ttl - (now() - at)
-- value_type of 0 is a nil entry
if value_type == 0 then

This comment has been minimized.

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

i wonder if we should add an unmarshaller for nil types, so we avoid a branch here? such a dispatch could simply just return nil itself, so we would return the return signature of remaining_ttl, nil, nil. my guess is we'll have quite a lot of negative caches, so this branch becomes fairly unbiased (though this is a fairly subjective assumption).

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

i wonder if we should add an unmarshaller for nil types, so we avoid a branch here? such a dispatch could simply just return nil itself, so we would return the return signature of remaining_ttl, nil, nil. my guess is we'll have quite a lot of negative caches, so this branch becomes fairly unbiased (though this is a fairly subjective assumption).

This comment has been minimized.

@thibaultcha

thibaultcha Jun 2, 2017

Member

I would not address this here, but in the actual lua-resty-mlcache repo

@thibaultcha

thibaultcha Jun 2, 2017

Member

I would not address this here, but in the actual lua-resty-mlcache repo

Show outdated Hide outdated kong/cluster_events.lua
local setmetatable = setmetatable
local LOCK_KEY_PREFIX = "lua-resty-mlcache:lock:"

This comment has been minimized.

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

pedantic: do we call this key lua-resty-mlcache when this lib is not technically lua-resty-mlcache?

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

pedantic: do we call this key lua-resty-mlcache when this lib is not technically lua-resty-mlcache?

This comment has been minimized.

@thibaultcha

thibaultcha Jun 16, 2017

Member

Yeah, I think that's fine, maybe we'll get to replacing this file with the upstream mlcache lib

@thibaultcha

thibaultcha Jun 16, 2017

Member

Yeah, I think that's fine, maybe we'll get to replacing this file with the upstream mlcache lib

-- still not in shm, we are responsible for running the callback
local ok, err = pcall(cb, ...)

This comment has been minimized.

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

lets say something super crappy happens here, like the callback causes the worker to crash. would we end up with a stale lock in the shm? is this a case for making the resty.lock timeout configurable, or at least thinner than the default (30 secs)?

@p0pr0ck5

p0pr0ck5 May 30, 2017

Contributor

lets say something super crappy happens here, like the callback causes the worker to crash. would we end up with a stale lock in the shm? is this a case for making the resty.lock timeout configurable, or at least thinner than the default (30 secs)?

This comment has been minimized.

@thibaultcha

thibaultcha Jun 2, 2017

Member

This is why this PR says we need to audit those settings, another small adjustment to make.

@thibaultcha

thibaultcha Jun 2, 2017

Member

This is why this PR says we need to audit those settings, another small adjustment to make.

This comment has been minimized.

@thibaultcha

thibaultcha Jun 14, 2017

Member

This is why there was a timeout value in the original implementation. In the current one with a double lock mechanism. there is a timeout for both locks as well, in case such a scenario happens.

@thibaultcha

thibaultcha Jun 14, 2017

Member

This is why there was a timeout value in the original implementation. In the current one with a double lock mechanism. there is a timeout for both locks as well, in case such a scenario happens.

Show outdated Hide outdated kong/core/handler.lua
@Tieske

another incomplete review. But it's a start.

Unless I'm mistaken we're using a single cache, right? Will we be making entity specific caches (on Lua land) later? eg. grabbing an api from an lru cache that also holds 10000 consumers takes longer than necessary if we would split them up.
Using multiple caches might require a different cache-key setup, as it would also need to carry the info for which entity (which cache) it is about. If so, we must make sure that the new cache-key supports that way of working.

Similar for wrapping entity specific code. Eg. consumer fetching is duplicated all over the codebase, whereas we could implement a single module for consumers that would have get_consumer_by_uuid as method and would hide all underlying invalidation, events, caches, etc. But I guess that's another refactor after this, but maybe good to agree on here.

Show outdated Hide outdated kong.conf.default
# of an entity.
# Single-datacenter setups or PostgreSQL
# servers should suffer no such delays, and
# this value can be safely set to 0.

This comment has been minimized.

@Tieske

Tieske Jun 2, 2017

Member

I do not understand what this setting does?

@Tieske

Tieske Jun 2, 2017

Member

I do not understand what this setting does?

This comment has been minimized.

@thibaultcha

thibaultcha Jun 15, 2017

Member

Updated the description to clarify the meaning

@thibaultcha

thibaultcha Jun 15, 2017

Member

Updated the description to clarify the meaning

Show outdated Hide outdated kong/cluster_events.lua
Show outdated Hide outdated kong/cluster_events.lua
@thefosk

This comment has been minimized.

Show comment
Hide comment
@thefosk

thefosk Jun 2, 2017

Member

Turns out that we may want to keep Kong nodes registering to the database, and having node health-checks, to show this information in our other products.

Member

thefosk commented Jun 2, 2017

Turns out that we may want to keep Kong nodes registering to the database, and having node health-checks, to show this information in our other products.

@thibaultcha

This comment has been minimized.

Show comment
Hide comment
@thibaultcha

thibaultcha Jun 2, 2017

Member

I know, this can come in another PR though, with more meaningful data and a better and more modularized implementation.

Member

thibaultcha commented Jun 2, 2017

I know, this can come in another PR though, with more meaningful data and a better and more modularized implementation.

@mwaaas

This comment has been minimized.

Show comment
Hide comment
@mwaaas

mwaaas Jun 9, 2017

Is there a eta for this "With this change, Kong becomes purely stateless. " currently blocking us from using it in eb docker

mwaaas commented Jun 9, 2017

Is there a eta for this "With this change, Kong becomes purely stateless. " currently blocking us from using it in eb docker

@mwaaas

Not seeing documentaion for how to use the new cache strategy

@p0pr0ck5

This comment has been minimized.

Show comment
Hide comment
@p0pr0ck5

p0pr0ck5 Jun 9, 2017

Contributor

@Tieske can you help me understand your meaning behind this comment:

eg. grabbing an api from an lru cache that also holds 10000 consumers takes longer than necessary if we would split them up.

LRU cache operations are O(1). There is no performance gain by using multiple LRU caches (if anything, it's just more complexity in management).

Contributor

p0pr0ck5 commented Jun 9, 2017

@Tieske can you help me understand your meaning behind this comment:

eg. grabbing an api from an lru cache that also holds 10000 consumers takes longer than necessary if we would split them up.

LRU cache operations are O(1). There is no performance gain by using multiple LRU caches (if anything, it's just more complexity in management).

thibaultcha added some commits Apr 5, 2017

feat(cluster) new cluster events module
This module is an abstract framework that allows for broadcasting and
subscribing (pub/sub) to events across all Kong nodes connected to the
same database. It works for both PostgreSQL and Cassandra.

It consist of time series entities stored as rows inside the Kong
database. This module allows for several "channels" (although we will
most likely only use one as of now, the "invalidations" channel).

For Cassandra, each channel is a separate row in the database. The user
should have appropriate replication factor settings to ensure
availability. This makes insertion and retrieval of this time series
data particularly efficient. For Postgres, each event is its own row. We
create the necessary indexes for performance reasons. The low
cardinality of the "channels" column is helpful for such indexes.

An "event" is pure text. How said text is serialized (if complex schemas
are required) is the sole responsibility of the user.

An "event" is run once and only once per node. There is an shm tracking
the already ran event (via a v4 UUID key) for a given Nginx node.

"events" are periodically polled from a single worker via a recurring
`ngx.timer`. There is no "special worker" responsible for such polling,
to prevent a worker crash from corrupting a running node (if it stops
polling for events).

To mitigate distributed systems - multi-datacenter Cassandra cluster -
the "events" are polled with a "backwards offset" (last 60s for example)
to ensure none were missed, if they were replicated too late. This means
we fetch the same events several times, but they will only run once
anyways as previously mentioned.
feat(mlcache) new [temporary] kong.mlcache module
This module is to replace the current database_cache.lua module with a
nicer and more powerful interface. This module is currently being
developed as a separate, generic lua-resty library and hence no tests
from this library were inserted into Kong's test suite.

This module is to disappear in favor of a dependency to the said
lua-resty-mlcache library (once completed).
feat(cache) new kong.cache module to replace database_cache.lua
This module takes advantage of the new `kong.cluster_events` and
`lua-resty-mlcache` modules to build a more powerful caching mechanism
for Kong configuration entities (datastore entities).

We support inter-worker, inter-nodes invalidations (via
`lua-resty-worker-events` and `kong.cluster_events`), TTL, infinite TTL,
negative TTL (cache miss), better error handling, and overall
performance. It also supports a more intuitive cache "probing"
mechanism.
refactor(core) use new invalidations framework
We take advantage of the new `kong.cache` module to build a better
configuration entities caching mechanism, with invalidation events that
do not use the locally running Serf agent.

This refactors all caching branches of the core entities (APIs,
Consumers, Plugins, Credentials...) and provides more reliable and
performant integration tests.

We purposefully do not remove any Serf-related code. At this stage, 2
invalidation mechanism are running side-by-side.

Additionally, this takes advantage of `lua-resty-worker-events`
supporting inter-worker and intra-worker communication to build the
foundation of a "event-driven" handlers (such as being able to
subscribe to CRUD operations in the configuration entities, for future
plugins and such). `lua-resty-worker-events` is - unfortunately -
*eventually consistent* across workers, and will *eventually* either be
improved, or replaced with a consistent - and more performant -
implementation (one is currently being developed as we speak, but future
OpenResty work may also open the way to proper IPC via cosockets).
feat(admin) update /cache/ endpoint to use newer caching module
We simply replace the underlying `database_cache.lua` dependency with
our new `kong.cache` module. There is no distinction between Lua-land or
shm cache anymore. The shm cache *is considered* the source of truth for
whether a value is cached or not because:

1) the LRU, Lua-land caching is a convenience caching (emphasized by its
   size unit not being defined in memory and potentially smaller than shm).
2) the shm/LRU lookups are of the same order of magnitude, vs the DB
   lookup being significantly higher.
refactor(plugins) use new invalidations framework
Like our previous work to use the new `kong.cache` module for core
entities instead of `database_cache.lua`, this takes care of it for all
entities used in plugins (theoretically not part of Kong core). It is
also considered a separate unit of work for the sake of limiting the
patch size of a single commit, and ease future archaeology work.
refactor(*) retire Serf - for good
Remove *all* reference to Serf from our various components: datastore,
CLI (commands), core, configuration, and Admin API. We also stop
spawning Serf as a daemon from the CLI, and remove commands that don't
make sense anymore (such as checking for daemon processes running).

From now on, only our new cache invalidations events mechanism is
running.

A follow-up commit will take care of introducing a migration removing
our `nodes` table in the datastore.

From now on, the need for a CLI to start and stop Kong is drastically
reduced, almost nullified: assuming the Nginx configuration is written
and exists in the prefix, Kong can start via the Nginx binary directly,
and be the foreground process in container deployments.

This opens the way to better Nginx integration, such as described in

🎆
feat(conf) new properties for datastore cache
3 new properties now define the behavior of the new invalidation events
framework, and configuration entities caching. Their role and behavior
is detailed in this commit, as part of the default configuration file.
feat(conf) warn log when dangerous cache setting detected
When using Cassandra, one must not leave the `db_update_propagation` to
its default of 0, at the risk of missing invalidation events.

This logs a warning message in the CLI. The "special format" that
warrants a timestamp, log level, and increase visibility of a log in the
CLI was downgraded to "warn".

In the future, we will need to make this logging facility log in ngx_lua
with the proper level.
@p0pr0ck5

YOLO

@thibaultcha thibaultcha merged commit 055e3af into next Jun 16, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@thibaultcha thibaultcha deleted the feat/retiring-serf branch Jun 16, 2017

@olderwei

This comment has been minimized.

Show comment
Hide comment
@olderwei

olderwei Nov 30, 2017

since the version 0.11 has cancelled the command of kong cluster, how do I know which nodes make up a cluster

olderwei commented Nov 30, 2017

since the version 0.11 has cancelled the command of kong cluster, how do I know which nodes make up a cluster

@thibaultcha

This comment has been minimized.

Show comment
Hide comment
@thibaultcha

thibaultcha Nov 30, 2017

Member

@olderwei In 0.11 you cannot: all Kong nodes are purely stateless. The kong cluster command was only useful for debugging purposes when a Kong node's Serf agent seemed to not be receiving events. You should use orchestration tools to manage your cluster and have visibility into its current state.

Member

thibaultcha commented Nov 30, 2017

@olderwei In 0.11 you cannot: all Kong nodes are purely stateless. The kong cluster command was only useful for debugging purposes when a Kong node's Serf agent seemed to not be receiving events. You should use orchestration tools to manage your cluster and have visibility into its current state.

@olderwei

This comment has been minimized.

Show comment
Hide comment
@olderwei

olderwei Nov 30, 2017

@thibaultcha so do you have any recommendations about the orchestration tools?

olderwei commented Nov 30, 2017

@thibaultcha so do you have any recommendations about the orchestration tools?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment