-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(runloop) prefetch services when rebuilding router #4101
Conversation
Trade space for time to improve performance when rebuilding the router by loading the service associated with a given route from a local table, rather than requiring a separate query for each route in the cluster. This reduces the number of queries needed to rebuild the router with N routes from N to roughly N / 1000 * 2. This is a proof-of-concept commit that needs further discussion (for example, should we cache services in memory over a longer lifetime, rather than using a function-local table for potentially several thousand objects).
Given the comments in #4055 (comment), I'm changing the label to this to 'please review'. I think this is a valid change that should be pushed as part of 1.0 as it involves no additional necessary configuration and no breaking client change. |
Why not use |
TBH, I think deciding on these these will be |
The goal here was to implement an improvement with as little behavioral change as possible, to be tested in accordance with the traffic patterns noted in #4055. Caching Services potentially imbues non-trivial memory pressure on a given node, depending on the size of the cluster config; I was hoping to have this change tested with as little divergence from current/RC4 code as possible. But definitely, caching object has value if it's not too big a burden to bear in the shm. I've no strong feelings about caching Service objects in-memory vs fetching them, but tbvh I don't have the bandwidth the implement and test a larger changeset at this point. Hopefully someone can review this and help drive it forward into 1.0. I'll be happy to help with a pair of hands where and when i can :) |
I'm not sure that it is, per #4055. The latest test results from @jeremyjpj0916 for code that is released (available as of #4084; see #4055 (comment)) indicated that the constant-500 errors were solved (due to the incorrect mutex handling), but potentially-sporadic 500 responses are still very possible (see #4055 (comment)). This PR attempts to solve that problem and that problem alone - future optimizations (timer-based rebuild, configurable consistency, etc) are out of scope here. |
We already presented a maybe theoretical problem with paging on |
Iterative results. This approach is a pragmatic- send less expensive network traffic where we don't need. In odd cases where many Routes are added near-simultaneously, the router likely breaks as-is. This PR doesn't try to change this; it just makes talking to the DB (and in particular, strongly-consistent C* installations, the behavior of which is easily forgotten during development) less expensive. Nothing more, nothing less. |
This PR changes that as it makes hitting here more possible: At least you could add a fallback to normal select in that case. |
This will get handled elsewhere |
Summary
Trade space for time to improve performance when rebuilding the router by loading the service associated with a given route from a local table, rather than requiring a separate query for each route in the cluster. This reduces the number of queries needed to rebuild the router with N routes from N to roughly N / 1000 * 2.
This is a proof-of-concept commit that needs further discussion (for example, should we cache services in memory over a longer lifetime, rather than using a function-local table for potentially several thousand objects).
Full changelog