Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix(cache): removes deadlocks when cache callbacks error #2197

Merged
merged 3 commits into from
Mar 21, 2017

Conversation

Tieske
Copy link
Member

@Tieske Tieske commented Mar 13, 2017

Main cause is exiting early on error conditions, when the lock has not
yet been released. It occurs both in the get_or_set method itself as
well as in the callbacks all over the code base.

Issues resolved

Fix #2186

Main cause is exiting early on error conditions, when the lock has not
yet been released. It occurs both in the `get_or_set` method itself as
well as in the callbacks all over the code base.
@Tieske Tieske mentioned this pull request Mar 13, 2017

local ok, err = _M.set(key, value[1], ttl)
if not ok then
ngx_log(ngx.ERR, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error message needs some context:

ngx_log(ngx.ERR, "[db cache] could not set value from cache cb: ", err)

value = pack(cb(...))

if value[1] ~= nil then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

@@ -4,6 +4,9 @@ local json_decode = require("cjson.safe").decode
local cache = ngx.shared.cache
local ngx_log = ngx.log
local gettime = ngx.now
local utils = require "kong.tools.utils"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention across the entire code base, requires go at the top.

-- shm cache success
local ok, lerr = lock:unlock()
if not ok and lerr then
ngx_log(ngx.ERR, "failed to unlock: ", lerr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those messages need more contexts as well:

ngx_log(ngx.ERR, "[db cache] failed to unlock mutex: ", lerr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why we often have such a function at the beginning of a module:

local function log(lvl, ...)
  return ngx.log(lvl, "[module_name] ", ...)
end

ngx_log(ngx.ERR, "failed to unlock: ", err)
local ok, lerr = lock:unlock()
if not ok and lerr then
ngx_log(ngx.ERR, "failed to unlock: ", lerr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto:

ngx_log(ngx.ERR, "[db cache] failed to unlock mutex: ", lerr)

end

return value
return unpack(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • in places where we hit a lock error, we just return
  • not in the new places L.194 and L.188
  • here on L.203, we might return nil, nil, ...

All of these points are a problem because the way we check for errors after get_ot_set() is like this:

local thing, err = cache.get_or_set(...)
if err then
  return responses.send_HTTP_INTERNAL_SERVER_ERROR(err)
end

Hence, I feel like we should always return an error to the caller if such happens in get_or_set(), instead of failing at a later point in the caller's code and make tracing of the error harder.

I think the main issue is the callers are not handling cache miss properly (missing entity from the DB hit), and they should, and lock issues should be reported internally only. But this is the fastest change we can do to address it right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thibaultcha Not sure I'm following your reasoning. Hence here's what I did and why;

in places where we hit a lock error, we just return

That is for creating and acquiring the lock, non-recoverable, so exit

not in the new places L.194 and L.188

We already got the info from the db, so recoverable, the current request can be completed, yet the error on the 'process level' (locking/setting) is properly logged => the error is a get_or_set error

here on L.203, we might return nil, nil, ...

This allows the callback to return multiple return values in case of errors to distinguish between for example "return a nil + error" result or "terminate request with a 500" result => the error is a caller error and is dealt with by the caller.
There is one occasion where "nil+error_msg+status_code" is returned for example

Hence, I feel like we should always return an error to the caller if such happens in get_or_set(), instead of failing at a later point in the caller's code and make tracing of the error harder.

The way I wrote it now was to allow the caller to determine what error to return in what format, and to deal with it as well. Also for tracing, you want the error to be logged from the calling code (line numbers and function names), dealing with it from get_or_set would make tracing a lot harder as you wouldn't know where it originated.

Makes sense? Did I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the callback to return multiple return values

got that, but now if a callback does not properly return an error, we have:

return nil, [nil]

And the caller does:

local _, err = get_or_set()
if err then -- nil, because this is the [nil] one returned by the get_or_set cb

end

-- we missed our chance to return HTTP 500 because the callback errored

Copy link
Member Author

@Tieske Tieske Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then still something further down it will go awry, because no data was returned. So it will surface as a bug anyway.

And this situation will always exist, as the callback should be able to return nil for the data. And then return nil (same as return nil, nil) is a valid non-error response. So the only way to test for an error will always be on err.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thibaultcha verified all get_or_set accesses again, fixed one occurrence where a wrong error was thrown/logged.

imo good to go now.

@thibaultcha thibaultcha added the pr/changes requested Changes were requested to this PR by a maintainer. Please address them and ping back once done. label Mar 13, 2017
@thibaultcha thibaultcha added pr/status/needs review and removed pr/changes requested Changes were requested to this PR by a maintainer. Please address them and ping back once done. labels Mar 14, 2017
nil, load_consumer, conf.anonymous, true)
if err then
responses.send_HTTP_INTERNAL_SERVER_ERROR(err)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the issue I wanted to point out in our talk is more how plugins can sometimes expect the cached value to always be present without considering it might just not exist in the data store. Agreed, if we found a key credential in the datastore, we should have a Consumer attached to it as well.

In this case, maybe we want to throw a 500, or maybe we'd rather handle that case just like the missing credentials above, and return HTTP 403. So two things:

  1. Should we return HTTP 403 here if no Consumer was retrieved from the DB?
  2. Are we sure no other plugins is guilty of the behavior I described above? aka always assuming we'll retrieve an entity from cache.get_or_set()?

Some more background on the actual issue if such a case happens in this plugin:

If the cache.get_or_set call returns nil, nil, then the set_consumer function will error out when trying to index consumer.id. This will throw a Lua error, bother the Kong user, and return an HTTP 500 to the client making this call. Instead, see L.108 to L.110, where the plugin handles the case where no credential was found in the DB (request provided with authentication, but such authentication is invalid), and thus, replies HTTP 403 to the client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return HTTP 403 here if no Consumer was retrieved from the DB?

This case is not a 403, but a 500. A consumer was configured as "anonymous" but with a bad uuid (was never there), or with a consumer that was later deleted.

Hence a configuration error, and hence a 500 imo. This is also the quicker route to the maintainers that something is wrong with the configuration

Are we sure no other plugins is guilty of the behavior I described above? aka always assuming we'll retrieve an entity from cache.get_or_set()?

All callbacks properly return the error, that's what I checked. But then still, this behaviour is a contract that the caller itself created between the callback he provided and his result handling code. It is not a responsibility of the caching mechanism to handle this.

Anyway, this is probably going away with an lru cache anyway, the way we discussed it.

@thibaultcha thibaultcha merged commit 5c7e511 into master Mar 21, 2017
@thibaultcha thibaultcha deleted the fix/cachelocks branch March 21, 2017 17:47
@rkt2spc
Copy link

rkt2spc commented Dec 25, 2017

Can you help merge the fix to kong 0.9.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache lock problem
3 participants