-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hotfix(cache): removes deadlocks when cache callbacks error #2197
Conversation
Main cause is exiting early on error conditions, when the lock has not yet been released. It occurs both in the `get_or_set` method itself as well as in the callbacks all over the code base.
kong/tools/database_cache.lua
Outdated
|
||
local ok, err = _M.set(key, value[1], ttl) | ||
if not ok then | ||
ngx_log(ngx.ERR, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error message needs some context:
ngx_log(ngx.ERR, "[db cache] could not set value from cache cb: ", err)
kong/tools/database_cache.lua
Outdated
value = pack(cb(...)) | ||
|
||
if value[1] ~= nil then | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
kong/tools/database_cache.lua
Outdated
@@ -4,6 +4,9 @@ local json_decode = require("cjson.safe").decode | |||
local cache = ngx.shared.cache | |||
local ngx_log = ngx.log | |||
local gettime = ngx.now | |||
local utils = require "kong.tools.utils" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention across the entire code base, requires go at the top.
kong/tools/database_cache.lua
Outdated
-- shm cache success | ||
local ok, lerr = lock:unlock() | ||
if not ok and lerr then | ||
ngx_log(ngx.ERR, "failed to unlock: ", lerr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those messages need more contexts as well:
ngx_log(ngx.ERR, "[db cache] failed to unlock mutex: ", lerr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why we often have such a function at the beginning of a module:
local function log(lvl, ...)
return ngx.log(lvl, "[module_name] ", ...)
end
kong/tools/database_cache.lua
Outdated
ngx_log(ngx.ERR, "failed to unlock: ", err) | ||
local ok, lerr = lock:unlock() | ||
if not ok and lerr then | ||
ngx_log(ngx.ERR, "failed to unlock: ", lerr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto:
ngx_log(ngx.ERR, "[db cache] failed to unlock mutex: ", lerr)
end | ||
|
||
return value | ||
return unpack(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things:
- in places where we hit a lock error, we just
return
- not in the new places L.194 and L.188
- here on L.203, we might return
nil, nil, ...
All of these points are a problem because the way we check for errors after get_ot_set()
is like this:
local thing, err = cache.get_or_set(...)
if err then
return responses.send_HTTP_INTERNAL_SERVER_ERROR(err)
end
Hence, I feel like we should always return an error to the caller if such happens in get_or_set()
, instead of failing at a later point in the caller's code and make tracing of the error harder.
I think the main issue is the callers are not handling cache miss properly (missing entity from the DB hit), and they should, and lock issues should be reported internally only. But this is the fastest change we can do to address it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thibaultcha Not sure I'm following your reasoning. Hence here's what I did and why;
in places where we hit a lock error, we just return
That is for creating and acquiring the lock, non-recoverable, so exit
not in the new places L.194 and L.188
We already got the info from the db, so recoverable, the current request can be completed, yet the error on the 'process level' (locking/setting) is properly logged => the error is a get_or_set
error
here on L.203, we might return nil, nil, ...
This allows the callback to return multiple return values in case of errors to distinguish between for example "return a nil + error" result or "terminate request with a 500" result => the error is a caller error and is dealt with by the caller.
There is one occasion where "nil+error_msg+status_code" is returned for example
Hence, I feel like we should always return an error to the caller if such happens in get_or_set(), instead of failing at a later point in the caller's code and make tracing of the error harder.
The way I wrote it now was to allow the caller to determine what error to return in what format, and to deal with it as well. Also for tracing, you want the error to be logged from the calling code (line numbers and function names), dealing with it from get_or_set
would make tracing a lot harder as you wouldn't know where it originated.
Makes sense? Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows the callback to return multiple return values
got that, but now if a callback does not properly return an error, we have:
return nil, [nil]
And the caller does:
local _, err = get_or_set()
if err then -- nil, because this is the [nil] one returned by the get_or_set cb
end
-- we missed our chance to return HTTP 500 because the callback errored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then still something further down it will go awry, because no data was returned. So it will surface as a bug anyway.
And this situation will always exist, as the callback should be able to return nil
for the data. And then return nil
(same as return nil, nil
) is a valid non-error response. So the only way to test for an error will always be on err
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thibaultcha verified all get_or_set
accesses again, fixed one occurrence where a wrong error was thrown/logged.
imo good to go now.
nil, load_consumer, conf.anonymous, true) | ||
if err then | ||
responses.send_HTTP_INTERNAL_SERVER_ERROR(err) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the issue I wanted to point out in our talk is more how plugins can sometimes expect the cached value to always be present without considering it might just not exist in the data store. Agreed, if we found a key credential in the datastore, we should have a Consumer attached to it as well.
In this case, maybe we want to throw a 500, or maybe we'd rather handle that case just like the missing credentials above, and return HTTP 403. So two things:
- Should we return HTTP 403 here if no Consumer was retrieved from the DB?
- Are we sure no other plugins is guilty of the behavior I described above? aka always assuming we'll retrieve an entity from
cache.get_or_set()
?
Some more background on the actual issue if such a case happens in this plugin:
If the cache.get_or_set
call returns nil, nil
, then the set_consumer
function will error out when trying to index consumer.id
. This will throw a Lua error, bother the Kong user, and return an HTTP 500 to the client making this call. Instead, see L.108 to L.110, where the plugin handles the case where no credential was found in the DB (request provided with authentication, but such authentication is invalid), and thus, replies HTTP 403 to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return HTTP 403 here if no Consumer was retrieved from the DB?
This case is not a 403, but a 500. A consumer was configured as "anonymous" but with a bad uuid (was never there), or with a consumer that was later deleted.
Hence a configuration error, and hence a 500 imo. This is also the quicker route to the maintainers that something is wrong with the configuration
Are we sure no other plugins is guilty of the behavior I described above? aka always assuming we'll retrieve an entity from
cache.get_or_set()
?
All callbacks properly return the error, that's what I checked. But then still, this behaviour is a contract that the caller itself created between the callback he provided and his result handling code. It is not a responsibility of the caching mechanism to handle this.
Anyway, this is probably going away with an lru cache anyway, the way we discussed it.
Can you help merge the fix to kong 0.9.x |
Main cause is exiting early on error conditions, when the lock has not
yet been released. It occurs both in the
get_or_set
method itself aswell as in the callbacks all over the code base.
Issues resolved
Fix #2186