hotfix(cache): removes deadlocks when cache callbacks error #2197

Tieske · 2017-03-13T13:26:59Z

Main cause is exiting early on error conditions, when the lock has not
yet been released. It occurs both in the get_or_set method itself as
well as in the callbacks all over the code base.

Issues resolved

Fix #2186

Main cause is exiting early on error conditions, when the lock has not yet been released. It occurs both in the `get_or_set` method itself as well as in the callbacks all over the code base.

thibaultcha · 2017-03-13T18:56:25Z

kong/tools/database_cache.lua

+
+    local ok, err = _M.set(key, value[1], ttl)
+    if not ok then
+      ngx_log(ngx.ERR, err)


error message needs some context:

ngx_log(ngx.ERR, "[db cache] could not set value from cache cb: ", err)

thibaultcha · 2017-03-13T18:56:36Z

kong/tools/database_cache.lua

+  value = pack(cb(...))
+
+  if value[1] ~= nil then
+


thibaultcha · 2017-03-13T19:00:43Z

kong/tools/database_cache.lua

@@ -4,6 +4,9 @@ local json_decode = require("cjson.safe").decode
 local cache = ngx.shared.cache
 local ngx_log = ngx.log
 local gettime = ngx.now
+local utils = require "kong.tools.utils"


By convention across the entire code base, requires go at the top.

thibaultcha · 2017-03-13T19:01:28Z

kong/tools/database_cache.lua

+    -- shm cache success
+    local ok, lerr = lock:unlock()
+    if not ok and lerr then
+      ngx_log(ngx.ERR, "failed to unlock: ", lerr)


Those messages need more contexts as well:

ngx_log(ngx.ERR, "[db cache] failed to unlock mutex: ", lerr)

This is why we often have such a function at the beginning of a module:

local function log(lvl, ...) return ngx.log(lvl, "[module_name] ", ...) end

thibaultcha · 2017-03-13T19:01:39Z

kong/tools/database_cache.lua

-    ngx_log(ngx.ERR, "failed to unlock: ", err)
+  local ok, lerr = lock:unlock()
+  if not ok and lerr then
+    ngx_log(ngx.ERR, "failed to unlock: ", lerr)


ditto:

ngx_log(ngx.ERR, "[db cache] failed to unlock mutex: ", lerr)

thibaultcha · 2017-03-13T19:17:19Z

kong/tools/database_cache.lua

  end

-  return value
+  return unpack(value)


A few things:

in places where we hit a lock error, we just return

not in the new places L.194 and L.188

here on L.203, we might return nil, nil, ...

All of these points are a problem because the way we check for errors after get_ot_set() is like this:

local thing, err = cache.get_or_set(...) if err then return responses.send_HTTP_INTERNAL_SERVER_ERROR(err) end

Hence, I feel like we should always return an error to the caller if such happens in get_or_set(), instead of failing at a later point in the caller's code and make tracing of the error harder.

I think the main issue is the callers are not handling cache miss properly (missing entity from the DB hit), and they should, and lock issues should be reported internally only. But this is the fastest change we can do to address it right now.

@thibaultcha Not sure I'm following your reasoning. Hence here's what I did and why;

in places where we hit a lock error, we just return

That is for creating and acquiring the lock, non-recoverable, so exit

not in the new places L.194 and L.188

We already got the info from the db, so recoverable, the current request can be completed, yet the error on the 'process level' (locking/setting) is properly logged => the error is a get_or_set error

here on L.203, we might return nil, nil, ...

This allows the callback to return multiple return values in case of errors to distinguish between for example "return a nil + error" result or "terminate request with a 500" result => the error is a caller error and is dealt with by the caller.
There is one occasion where "nil+error_msg+status_code" is returned for example

Hence, I feel like we should always return an error to the caller if such happens in get_or_set(), instead of failing at a later point in the caller's code and make tracing of the error harder.

The way I wrote it now was to allow the caller to determine what error to return in what format, and to deal with it as well. Also for tracing, you want the error to be logged from the calling code (line numbers and function names), dealing with it from get_or_set would make tracing a lot harder as you wouldn't know where it originated.

Makes sense? Did I miss something?

This allows the callback to return multiple return values

got that, but now if a callback does not properly return an error, we have:

return nil, [nil]

And the caller does:

local _, err = get_or_set() if err then -- nil, because this is the [nil] one returned by the get_or_set cb end -- we missed our chance to return HTTP 500 because the callback errored

But then still something further down it will go awry, because no data was returned. So it will surface as a bug anyway.

And this situation will always exist, as the callback should be able to return nil for the data. And then return nil (same as return nil, nil) is a valid non-error response. So the only way to test for an error will always be on err.

@thibaultcha verified all get_or_set accesses again, fixed one occurrence where a wrong error was thrown/logged.

imo good to go now.

thibaultcha · 2017-03-20T20:52:07Z

kong/plugins/key-auth/handler.lua

+                            nil, load_consumer, conf.anonymous, true)
+      if err then
+        responses.send_HTTP_INTERNAL_SERVER_ERROR(err)
+      end


I guess the issue I wanted to point out in our talk is more how plugins can sometimes expect the cached value to always be present without considering it might just not exist in the data store. Agreed, if we found a key credential in the datastore, we should have a Consumer attached to it as well.

In this case, maybe we want to throw a 500, or maybe we'd rather handle that case just like the missing credentials above, and return HTTP 403. So two things:

Should we return HTTP 403 here if no Consumer was retrieved from the DB?

Are we sure no other plugins is guilty of the behavior I described above? aka always assuming we'll retrieve an entity from cache.get_or_set()?

Some more background on the actual issue if such a case happens in this plugin:

If the cache.get_or_set call returns nil, nil, then the set_consumer function will error out when trying to index consumer.id. This will throw a Lua error, bother the Kong user, and return an HTTP 500 to the client making this call. Instead, see L.108 to L.110, where the plugin handles the case where no credential was found in the DB (request provided with authentication, but such authentication is invalid), and thus, replies HTTP 403 to the client.

Should we return HTTP 403 here if no Consumer was retrieved from the DB?

This case is not a 403, but a 500. A consumer was configured as "anonymous" but with a bad uuid (was never there), or with a consumer that was later deleted.

Hence a configuration error, and hence a 500 imo. This is also the quicker route to the maintainers that something is wrong with the configuration

Are we sure no other plugins is guilty of the behavior I described above? aka always assuming we'll retrieve an entity from cache.get_or_set()?

All callbacks properly return the error, that's what I checked. But then still, this behaviour is a contract that the caller itself created between the callback he provided and his result handling code. It is not a responsibility of the caching mechanism to handle this.

Anyway, this is probably going away with an lru cache anyway, the way we discussed it.

rkt2spc · 2017-12-25T06:28:45Z

Can you help merge the fix to kong 0.9.x

fix(cache): removes deadlocks when cache callbacks error

f484793

Main cause is exiting early on error conditions, when the lock has not yet been released. It occurs both in the `get_or_set` method itself as well as in the callbacks all over the code base.

Tieske mentioned this pull request Mar 13, 2017

Cache lock problem #2186

Closed

thibaultcha requested changes Mar 13, 2017

View reviewed changes

thibaultcha added the pr/changes requested Changes were requested to this PR by a maintainer. Please address them and ping back once done. label Mar 13, 2017

reworked review comments

9c973d1

thibaultcha added pr/status/needs review and removed pr/changes requested Changes were requested to this PR by a maintainer. Please address them and ping back once done. labels Mar 14, 2017

Tieske mentioned this pull request Mar 14, 2017

failed to acquire cache lock and failed to unlock #2166

Closed

fix(plugin): properly log the 500 error

3ddbadd

thibaultcha reviewed Mar 20, 2017

View reviewed changes

thibaultcha approved these changes Mar 21, 2017

View reviewed changes

thibaultcha merged commit 5c7e511 into master Mar 21, 2017

thibaultcha deleted the fix/cachelocks branch March 21, 2017 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix(cache): removes deadlocks when cache callbacks error #2197

hotfix(cache): removes deadlocks when cache callbacks error #2197

Tieske commented Mar 13, 2017

thibaultcha Mar 13, 2017

thibaultcha Mar 13, 2017

thibaultcha Mar 13, 2017

thibaultcha Mar 13, 2017

thibaultcha Mar 13, 2017

thibaultcha Mar 13, 2017

thibaultcha Mar 13, 2017

Tieske Mar 13, 2017

thibaultcha Mar 13, 2017

Tieske Mar 14, 2017 •

edited

Loading

Tieske Mar 20, 2017

thibaultcha Mar 20, 2017

Tieske Mar 21, 2017

rkt2spc commented Dec 25, 2017

hotfix(cache): removes deadlocks when cache callbacks error #2197

hotfix(cache): removes deadlocks when cache callbacks error #2197

Conversation

Tieske commented Mar 13, 2017

Issues resolved

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tieske Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkt2spc commented Dec 25, 2017

Tieske Mar 14, 2017 •

edited

Loading