Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(runloop) prefetch services when rebuilding router #4101

Closed
wants to merge 1 commit into from

Conversation

p0pr0ck5
Copy link
Contributor

Summary

Trade space for time to improve performance when rebuilding the router by loading the service associated with a given route from a local table, rather than requiring a separate query for each route in the cluster. This reduces the number of queries needed to rebuild the router with N routes from N to roughly N / 1000 * 2.

This is a proof-of-concept commit that needs further discussion (for example, should we cache services in memory over a longer lifetime, rather than using a function-local table for potentially several thousand objects).

Full changelog

  • prefetch services when rebuilding router

Trade space for time to improve performance when rebuilding the
router by loading the service associated with a given route from
a local table, rather than requiring a separate query for each
route in the cluster. This reduces the number of queries needed to
rebuild the router with N routes from N to roughly N / 1000 * 2.

This is a proof-of-concept commit that needs further discussion
(for example, should we cache services in memory over a longer lifetime,
rather than using a function-local table for potentially several
thousand objects).
@bungle bungle added the pr/discussion This PR is being debated. Probably just a few details. label Dec 16, 2018
@p0pr0ck5
Copy link
Contributor Author

Given the comments in #4055 (comment), I'm changing the label to this to 'please review'. I think this is a valid change that should be pushed as part of 1.0 as it involves no additional necessary configuration and no breaking client change.

@p0pr0ck5 p0pr0ck5 added pr/please review and removed pr/discussion This PR is being debated. Probably just a few details. labels Dec 17, 2018
@bungle
Copy link
Member

bungle commented Dec 17, 2018

Why not use caching strategy from #4102 instead?

@bungle
Copy link
Member

bungle commented Dec 17, 2018

TBH, I think deciding on these these will be post-1.0.0. The issue of 500/401 is already (mostly) solved.

@p0pr0ck5
Copy link
Contributor Author

The goal here was to implement an improvement with as little behavioral change as possible, to be tested in accordance with the traffic patterns noted in #4055. Caching Services potentially imbues non-trivial memory pressure on a given node, depending on the size of the cluster config; I was hoping to have this change tested with as little divergence from current/RC4 code as possible. But definitely, caching object has value if it's not too big a burden to bear in the shm. I've no strong feelings about caching Service objects in-memory vs fetching them, but tbvh I don't have the bandwidth the implement and test a larger changeset at this point. Hopefully someone can review this and help drive it forward into 1.0. I'll be happy to help with a pair of hands where and when i can :)

@p0pr0ck5
Copy link
Contributor Author

The issue of 500/401 is already solved.

I'm not sure that it is, per #4055. The latest test results from @jeremyjpj0916 for code that is released (available as of #4084; see #4055 (comment)) indicated that the constant-500 errors were solved (due to the incorrect mutex handling), but potentially-sporadic 500 responses are still very possible (see #4055 (comment)). This PR attempts to solve that problem and that problem alone - future optimizations (timer-based rebuild, configurable consistency, etc) are out of scope here.

@bungle
Copy link
Member

bungle commented Dec 17, 2018

We already presented a maybe theoretical problem with paging on routes. Your PR introduces another (possibly theoretical) issue: think about new route and service being added while building a router using paging. You get new route in results, but the related service is not found in your gathered services, so your code will return an error. The caching does not have that problem.

@p0pr0ck5
Copy link
Contributor Author

Iterative results. This approach is a pragmatic- send less expensive network traffic where we don't need. In odd cases where many Routes are added near-simultaneously, the router likely breaks as-is. This PR doesn't try to change this; it just makes talking to the DB (and in particular, strongly-consistent C* installations, the behavior of which is easily forgotten during development) less expensive. Nothing more, nothing less.

@bungle
Copy link
Member

bungle commented Dec 17, 2018

This PR changes that as it makes hitting here more possible:
https://github.com/Kong/kong/pull/4101/files#diff-1f0f137350308286de76ec59740c8edaR150

At least you could add a fallback to normal select in that case.

@bungle
Copy link
Member

bungle commented Dec 17, 2018

As I am obviously biased as I would merge #4102 or #4107 (similar to what we do with plugins already) instead of this, so I ask others to do review and possible merge (I do not have anything against it, though I do not see why it cannot wait).

@p0pr0ck5
Copy link
Contributor Author

This will get handled elsewhere

@p0pr0ck5 p0pr0ck5 deleted the perf/prefetch-router-services branch January 11, 2019 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants