New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove scheduled routing #11776
Remove scheduled routing #11776
Conversation
@@ -365,7 +365,6 @@ private static Settings getRandomNodeSettings(long seed) { | |||
Random random = new Random(seed); | |||
Builder builder = Settings.settingsBuilder() | |||
// decrease the routing schedule so new nodes will be added quickly - some random value between 30 and 80 ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment can go away..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, missed it..
I think it's awesome how easy this turned up to be. I looked at all the places we submit a cluster state update task and didn't see any place we miss a reroute (and would rely on the 10s) apart from the node join we already knew about. Made some small suggestions. |
this.schedule = settings.getAsTime("cluster.routing.schedule", timeValueSeconds(10)); | ||
clusterService.addFirst(this); | ||
if (clusterService != null) { | ||
clusterService.addFirst(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point we should really add these ourside of the classes itself IMO. but that is totally unrelated
@bleskes I pushed a first review round, I don't like the suggestion on the atomic reference one, I don't think it adds a lot of value, on the other hand, I have an idea on what would, but its a bigger change, will see if I can work on it and if it stays small |
LGTM. I'm +1 on 1.7 - it makes things easier to trace, reason about and debug. |
560116d
to
3a85874
Compare
I am ok with this too... this entire circular dep stuff is just really bad but lets move on |
Today, we have scheduled reroute that kicks every 10 seconds and checks if a reroute is needed. We use it when adding nodes, since we don't reroute right away once its added, and give it a time window to add additional nodes. We do have recover after nodes setting and such in order to wait for enough nodes to be added, and also, it really depends at what part of the 10s window you end up, sometimes, it might not be effective at all. In general, its historic from the times before we had recover after nodes and such. This change removes the 10s scheduling, simplifies RoutingService, and adds explicit reroute when a node is added to the system. It also adds unit tests to RoutingService. closes elastic#11776
3a85874
to
435ce7f
Compare
Today, we have scheduled reroute that kicks every 10 seconds and checks if a reroute is needed. We use it when adding nodes, since we don't reroute right away once its added, and give it a time window to add additional nodes. We do have recover after nodes setting and such in order to wait for enough nodes to be added, and also, it really depends at what part of the 10s window you end up, sometimes, it might not be effective at all. In general, its historic from the times before we had recover after nodes and such. This change removes the 10s scheduling, simplifies RoutingService, and adds explicit reroute when a node is added to the system. It also adds unit tests to RoutingService. closes #11776
elastic#11776 has simplified our rerouting logic by removing a scheduled background reroute in favor of an explicit reroute during the cluster state processing of a node join (the only place where we didn't do it explicitly). While that change is conceptually good, it change semantics a bit in two ways: - shard listing actions underpinning shard allocation do not have access to that new node yet (causing errors during shard allocation see elastic#11923 - the very first cluster state published to a node already has shard assignments to it. This surfaced other issues we are working to fix separately This commit changes the reroute to be done post processing the initial join cluster state to side step these issues while we work on a longer term solution.
Today, we have scheduled reroute that kicks every 10 seconds and checks if a
reroute is needed. We use it when adding nodes, since we don't reroute right
away once its added, and give it a time window to add additional nodes.
We do have recover after nodes setting and such in order to wait for enough
nodes to be added, and also, it really depends at what part of the 10s window
you end up, sometimes, it might not be effective at all. In general, its historic
from the times before we had recover after nodes and such.
This change removes the 10s scheduling, simplifies RoutingService, and adds
explicit reroute when a node is added to the system. It also adds unit tests
to RoutingService.