-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clustering) retiring Serf - new cache invalidation mechanism #2561
Conversation
deee788
to
689ae8c
Compare
kong/cluster_events.lua
Outdated
end | ||
|
||
if not ngx_debug and _init then | ||
return error("kong.cluster_events was already instanciated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through and caught this:
instanciated
-> instantiated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments, not enough time to walk deeply through everything today but wanted to get some thoughts on paper. Also, I notice several integration tests are failing consistently in a local env based on commit 689ae8c: https://gist.github.com/p0pr0ck5/6b31ddbcdf3a941aaccd3bb9314758e5
local DEBUG = ngx.DEBUG | ||
|
||
|
||
local SHM_CACHE = "cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use a more narrowly namespaced name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a namespace prefix, but the name of the shm entry, which is defined currently already in the Nginx config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, "namespace" is colloquial, referring to usage within the shm. it may be possible that other users want to use this shm? we should be a good citizen imo and be clear about what our keys are doing in the shm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's take advantage of the fact that 0.11 will have many updates to the Nginx config template already to choose a better name, and group the breaking changes in one.
I suggest we prefix all of our shms with kong_*
. How about: kong_db_cache
for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Let's do that right after the merge, to have one single commit taking care of them all at once, to preserve atomicity then.
kong/cmd/restart.lua
Outdated
@@ -10,6 +10,7 @@ local function execute(args) | |||
end | |||
|
|||
pcall(stop.execute, args) | |||
ngx.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change seems.... tangentially related? a code comment about what's going on here (if it's necessary) would be welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to do a small update here, ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now properly implemented with a kill -0
loop to make sure Nginx is properly terminated before trying to start it again.
kong/kong.lua
Outdated
@@ -145,23 +132,18 @@ function Kong.init() | |||
local conf_path = pl_path.join(ngx.config.prefix(), ".kong_env") | |||
local config = assert(conf_loader(conf_path)) | |||
|
|||
local events = Events() -- retrieve node plugins | |||
local dao = assert(DAOFactory.new(config, events)) -- instanciate long-lived DAO | |||
local dao = assert(DAOFactory.new(config)) -- instanciate long-lived DAO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instanciate -> instantiate
kong/cluster_events.lua
Outdated
dao_factory.db_type) | ||
end | ||
|
||
-- instanciation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instantiation
local _, err = cache.sh_incr(cache_key, value) | ||
if err then | ||
ngx_log("[rate-limiting] could not increment counter for period '"..period.."': "..tostring(err)) | ||
local ok, err = shm:add(cache_key, 0, EXPIRATIONS[period]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, fun story: when you call shm:incr
on a value, regardless of initial state, the node's expires
member is unconditionally set to 0. for this current case, this call to add
is meaningless, and we can stick with a single call of incr(cache_key, value, 0)
below. this leaves us with the problem of keys that are only removed via LRU. :|
See openresty/lua-nginx-module#942 for more context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this issue being raised already. We could update this in this PR, or in another one :)
local remaining_ttl = ttl - (now() - at) | ||
|
||
-- value_type of 0 is a nil entry | ||
if value_type == 0 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we should add an unmarshaller for nil types, so we avoid a branch here? such a dispatch could simply just return nil
itself, so we would return the return signature of remaining_ttl, nil, nil
. my guess is we'll have quite a lot of negative caches, so this branch becomes fairly unbiased (though this is a fairly subjective assumption).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not address this here, but in the actual lua-resty-mlcache repo
kong/cluster_events.lua
Outdated
local CURRENT_AT_KEY = "cluster_events:at" | ||
|
||
|
||
local POLL_LOCK_EXPTIME = 60 * 5 -- 5 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we derive this from something configurable, such as db_update_frequency
, rather than a hardcoded value?
local setmetatable = setmetatable | ||
|
||
|
||
local LOCK_KEY_PREFIX = "lua-resty-mlcache:lock:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pedantic: do we call this key lua-resty-mlcache
when this lib is not technically lua-resty-mlcache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's fine, maybe we'll get to replacing this file with the upstream mlcache lib
|
||
-- still not in shm, we are responsible for running the callback | ||
|
||
local ok, err = pcall(cb, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets say something super crappy happens here, like the callback causes the worker to crash. would we end up with a stale lock in the shm? is this a case for making the resty.lock timeout configurable, or at least thinner than the default (30 secs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why this PR says we need to audit those settings, another small adjustment to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why there was a timeout value in the original implementation. In the current one with a double lock mechanism. there is a timeout for both locks as well, in case such a scenario happens.
kong/core/handler.lua
Outdated
cache:invalidate("parsed_ssl_certificates:" .. sni.name) | ||
end | ||
end, "crud", "ssl_certificates") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: missing second whitespace line here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another incomplete review. But it's a start.
Unless I'm mistaken we're using a single cache, right? Will we be making entity specific caches (on Lua land) later? eg. grabbing an api from an lru cache that also holds 10000 consumers takes longer than necessary if we would split them up.
Using multiple caches might require a different cache-key setup, as it would also need to carry the info for which entity (which cache) it is about. If so, we must make sure that the new cache-key supports that way of working.
Similar for wrapping entity specific code. Eg. consumer fetching is duplicated all over the codebase, whereas we could implement a single module for consumers that would have get_consumer_by_uuid
as method and would hide all underlying invalidation, events, caches, etc. But I guess that's another refactor after this, but maybe good to agree on here.
kong.conf.default
Outdated
#db_cache_miss_ttl = 300 # Time-to-live (in seconds) of a miss from the | ||
# datastore. Such misses can be cached for a | ||
# different period of time that hits | ||
# (configurable via `cache_ttl`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not necessary, as long as we consistently implement the "add" hook/event for each entity. Invalidation upon adding will remove the cache-miss-sentinel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of those TTL values are a safeguard against eventually stale data. It's a very bad idea imo to not have a TTL in our cache, and to solely rely on the invalidation events being called. It could go wrong, somehow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but we don't need another one for the cache-miss, just use the regular one, as it is only a safeguard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with not having it. It's just that they offer better granularity and are more future-proof, with minimum cognitive overhead from a user friendliness perspective...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this again; shouldn't there be a setting that determines how far back events are being dealt with, than the whole expiry thing is not necessary right? if a case occurs where events are missed, then increasing that setting should suffice. If not, there is a serious bug somewhere. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this setting to rely on db_cache_ttl
in the latest version of this branch.
# of an entity. | ||
# Single-datacenter setups or PostgreSQL | ||
# servers should suffer no such delays, and | ||
# this value can be safely set to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand what this setting does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description to clarify the meaning
kong/cluster_events.lua
Outdated
return | ||
end | ||
|
||
local ok, err = timer_at(self.poll_interval, poll_handler, self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a probably non-problem here might be that the self
here will never be GC'ed. So if this self
object might be renewed, and disposed of during the lifetime of the worker process, then there is a problem, otherwise not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it should be fine, because we ensure that instances of this module are singletons.
kong/cluster_events.lua
Outdated
-- unlock | ||
|
||
self.shm:delete(POLL_LOCK_KEY) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you delete the lock, then another worker might do a poll in between, hence shortening the poll interval. Or am I missing something?
t in seconds;
t1 - worker 1 does a poll
t1.5 - worker 1 completes the poll, deletes the lock
t3 - worker 2 tries the lock, gets it and starts a poll
t3.5 - worker 2 completes the poll
t6 - worker 1 tries again (every 5 seconds)
etc.
the above will have done 2 polls in 5 seconds. Correct?
Possible way of working;
- all workers start polling at 0.1 second intervals
- worker starting a poll acquires lock, with a timeout of 5 seconds
- does the poll thing, but never unlocks.
- all other workers will not get the lock and hence not poll
- first worker to try after the lock expired will do the next poll
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative approach could be to leverage the flags
portion with a number representing ngx.now()
+ poll_interval
, and the lock can only be acquired when the poller's ngx.now()
is greater than such a value. But this becomes problematic with poll intervals < 1s, and feels kinda fishy.
Alt-alternatively, we can solve the problem of misaligned workers as @Tieske described by forcing each execution to a floor period of time modulo poll interval, e.g., with an interval of 5s, each worker is forced to run at t1, t6, t11, etc... by calculating the timer_at as a diff from now()
and the next floor up. Of course, both of these scenarios imply that we care about explicitly and consistently running polls exactly at every poll_interval
seconds- do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tieske The behavior you described was the one in place before a short review by @p0pr0ck5 a few weeks ago, in a private fork. From the top of my head, I forgot what was the original, other issue, but will try to find it, and if none, then yes, we can come back to this behavior which also has the benefit of having the smaller cognitive load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, asked @p0pr0ck5 to remind me: it was that if a poll takes longer to execute than the lock exptime, we'd have 2 running timers with poll at the same time.
Here is a 3rd alternative: use 2 locks, one with exptime, one removed when the poll is complete by the current worker, and only allow a worker to run the lock if both lock are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was that if a poll takes longer to execute than the lock exptime, we'd have 2 running timers with poll at the same time.
Yes, that is the drawback indeed.
use 2 locks, one with exptime, one removed when the poll is complete by the current worker, and only allow a worker to run the lock if both lock are available.
I like this one best. As that would be the more elegant solution.
Now what are sensible defaults then? lock 1 has a 5 seconds timeout (causing a 5 second interval). Lock 2 has a 30 second (??) timeout as a safety feature? and only being able to acquire 1 lock issues a warning in the logs right?
How often will each worker try to acquire the locks to do a poll? 1/10th of the poll interval, eg. 5 seconds => 0.5 second. Then the poll interval will be between 5 and 5.5 seconds. With 4 workers the average interval time would then be 0.5/4 + 5 = 5.125 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to use a double lock mechanism now
Turns out that we may want to keep Kong nodes registering to the database, and having node health-checks, to show this information in our other products. |
I know, this can come in another PR though, with more meaningful data and a better and more modularized implementation. |
993a0e5
to
5939253
Compare
kong/core/handler.lua
Outdated
|
||
local entity_channel = fmt("%s", data.schema.table) | ||
local entity_operation_channel = fmt("%s:%s", data.schema.table, | ||
data.schema.operation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big mistake. This should be data.operation
. This is one of the things we'll need tests for once the design is ok for everybody.
5939253
to
09ac0c7
Compare
Is there a eta for this "With this change, Kong becomes purely stateless. " currently blocking us from using it in eb docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing documentaion for how to use the new cache strategy
@Tieske can you help me understand your meaning behind this comment:
LRU cache operations are O(1). There is no performance gain by using multiple LRU caches (if anything, it's just more complexity in management). |
09ac0c7
to
ae0a0e7
Compare
62fbe3f
to
2cd2b6c
Compare
02b7ef4
to
0c67b92
Compare
This module is an abstract framework that allows for broadcasting and subscribing (pub/sub) to events across all Kong nodes connected to the same database. It works for both PostgreSQL and Cassandra. It consist of time series entities stored as rows inside the Kong database. This module allows for several "channels" (although we will most likely only use one as of now, the "invalidations" channel). For Cassandra, each channel is a separate row in the database. The user should have appropriate replication factor settings to ensure availability. This makes insertion and retrieval of this time series data particularly efficient. For Postgres, each event is its own row. We create the necessary indexes for performance reasons. The low cardinality of the "channels" column is helpful for such indexes. An "event" is pure text. How said text is serialized (if complex schemas are required) is the sole responsibility of the user. An "event" is run once and only once per node. There is an shm tracking the already ran event (via a v4 UUID key) for a given Nginx node. "events" are periodically polled from a single worker via a recurring `ngx.timer`. There is no "special worker" responsible for such polling, to prevent a worker crash from corrupting a running node (if it stops polling for events). To mitigate distributed systems - multi-datacenter Cassandra cluster - the "events" are polled with a "backwards offset" (last 60s for example) to ensure none were missed, if they were replicated too late. This means we fetch the same events several times, but they will only run once anyways as previously mentioned.
4c39ac6
to
944bad5
Compare
This module is to replace the current database_cache.lua module with a nicer and more powerful interface. This module is currently being developed as a separate, generic lua-resty library and hence no tests from this library were inserted into Kong's test suite. This module is to disappear in favor of a dependency to the said lua-resty-mlcache library (once completed).
This module takes advantage of the new `kong.cluster_events` and `lua-resty-mlcache` modules to build a more powerful caching mechanism for Kong configuration entities (datastore entities). We support inter-worker, inter-nodes invalidations (via `lua-resty-worker-events` and `kong.cluster_events`), TTL, infinite TTL, negative TTL (cache miss), better error handling, and overall performance. It also supports a more intuitive cache "probing" mechanism.
05fb389
to
8740757
Compare
We take advantage of the new `kong.cache` module to build a better configuration entities caching mechanism, with invalidation events that do not use the locally running Serf agent. This refactors all caching branches of the core entities (APIs, Consumers, Plugins, Credentials...) and provides more reliable and performant integration tests. We purposefully do not remove any Serf-related code. At this stage, 2 invalidation mechanism are running side-by-side. Additionally, this takes advantage of `lua-resty-worker-events` supporting inter-worker and intra-worker communication to build the foundation of a "event-driven" handlers (such as being able to subscribe to CRUD operations in the configuration entities, for future plugins and such). `lua-resty-worker-events` is - unfortunately - *eventually consistent* across workers, and will *eventually* either be improved, or replaced with a consistent - and more performant - implementation (one is currently being developed as we speak, but future OpenResty work may also open the way to proper IPC via cosockets).
We simply replace the underlying `database_cache.lua` dependency with our new `kong.cache` module. There is no distinction between Lua-land or shm cache anymore. The shm cache *is considered* the source of truth for whether a value is cached or not because: 1) the LRU, Lua-land caching is a convenience caching (emphasized by its size unit not being defined in memory and potentially smaller than shm). 2) the shm/LRU lookups are of the same order of magnitude, vs the DB lookup being significantly higher.
Like our previous work to use the new `kong.cache` module for core entities instead of `database_cache.lua`, this takes care of it for all entities used in plugins (theoretically not part of Kong core). It is also considered a separate unit of work for the sake of limiting the patch size of a single commit, and ease future archaeology work.
Remove *all* reference to Serf from our various components: datastore, CLI (commands), core, configuration, and Admin API. We also stop spawning Serf as a daemon from the CLI, and remove commands that don't make sense anymore (such as checking for daemon processes running). From now on, only our new cache invalidations events mechanism is running. A follow-up commit will take care of introducing a migration removing our `nodes` table in the datastore. From now on, the need for a CLI to start and stop Kong is drastically reduced, almost nullified: assuming the Nginx configuration is written and exists in the prefix, Kong can start via the Nginx binary directly, and be the foreground process in container deployments. This opens the way to better Nginx integration, such as described in :fireworks:
3 new properties now define the behavior of the new invalidation events framework, and configuration entities caching. Their role and behavior is detailed in this commit, as part of the default configuration file.
68a9bb2
to
dd6f5f8
Compare
When using Cassandra, one must not leave the `db_update_propagation` to its default of 0, at the risk of missing invalidation events. This logs a warning message in the CLI. The "special format" that warrants a timestamp, log level, and increase visibility of a log in the CLI was downgraded to "warn". In the future, we will need to make this logging facility log in ngx_lua with the proper level.
dd6f5f8
to
e5cce15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YOLO
since the version 0.11 has cancelled the command of kong cluster, how do I know which nodes make up a cluster |
@olderwei In 0.11 you cannot: all Kong nodes are purely stateless. The |
@thibaultcha so do you have any recommendations about the orchestration tools? |
This rewrite makes Kong independent when dealing with cache invalidation events. This opens the way to #2355 and to a simpler installation and usage experience for Kong.
Incentive
As we know, Kong caches datastore entities (configuration entities mostly, such as APIs, Consumers, Plugins, Credentials...) to prevent unnecessary overhead and database round trips on a per-request basis. As of Kong 0.10, Serf has always been the dependency Kong relies upon to broadcast invalidation events to other members of the cluster. This approach is not without compromises, the main ones being:
New Implementation
TLDR: database polling of a time-series entity called "cluster events".
Cluster events are implemented in both PostgreSQL and Cassandra. A Kong node can
broadcast()
an event on containing data (text) on a channel, and other Kong nodes cansubscribe()
to a channel and receive said events at a configurable interval.Here is a high-level, non-exhaustive list of changes:
cluster_events.lua
module. An abstract framework for both PostgreSQL and Cassandra to broadcast messages across the cluster.mlcache.lua
module is a temporary module waiting the completion of lua-resty-mlcache, and potentially the replacement of lua-resty-worker-events.cache.lua
module replaces the olddatabase_cache.lua
module. It handles inter-worker and inter-nodes invalidation, and relies onmlcache.lua
for better performance and more powerful features.database_cache.get_or_set()
have been replaced.cache_key
, which can be a table taking multiple values. This makes cache key generation based on variable(s) easier and safer./cache
endpoint on the Admin API has been kept.apis:updated
orconsumers:deleted
. Plugins should be able to listen on those events and act accordingly.nodes
table anymore storing each and every node currently connected to the cluster. This is by design for now, until the need to do so eventually arises.spec/04-invalidations
and are written in a more efficient manner than the previously existing tests, avoiding too many "sleep" calls (NOP) to wait for events propagation. This is thanks to the new design and will be improved with a better upcoming IPC module ensuring inter-worker consistency for each request - later on.Path forward
This should be reviewed and heavily tested. There should be some more tests written, especially around the
cluster_events
module itself. A particular attention should be given to the invalidation of core entities, and attention given there would be greatly appreciated.Reviews and comments are welcome, questions as well! Local testing of this would be extremely helpful.
Regarding CI, the Travis builds might still fail at this point, partly for unrelated reasons (non-deterministic tests), partly because this is still in a "polishing" state.
Here is a small list of leftovers todos:
quiet
option of the DAO (they are currently commented out for linting reasons)mlcache
module.