Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove scheduled routing #11776

Merged
merged 1 commit into from Jun 23, 2015
Merged

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Jun 19, 2015

Today, we have scheduled reroute that kicks every 10 seconds and checks if a
reroute is needed. We use it when adding nodes, since we don't reroute right
away once its added, and give it a time window to add additional nodes.

We do have recover after nodes setting and such in order to wait for enough
nodes to be added, and also, it really depends at what part of the 10s window
you end up, sometimes, it might not be effective at all. In general, its historic
from the times before we had recover after nodes and such.

This change removes the 10s scheduling, simplifies RoutingService, and adds
explicit reroute when a node is added to the system. It also adds unit tests
to RoutingService.

@@ -365,7 +365,6 @@ private static Settings getRandomNodeSettings(long seed) {
Random random = new Random(seed);
Builder builder = Settings.settingsBuilder()
// decrease the routing schedule so new nodes will be added quickly - some random value between 30 and 80 ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment can go away..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, missed it..

@bleskes
Copy link
Contributor

bleskes commented Jun 19, 2015

I think it's awesome how easy this turned up to be. I looked at all the places we submit a cluster state update task and didn't see any place we miss a reroute (and would rely on the 10s) apart from the node join we already knew about.

Made some small suggestions.

this.schedule = settings.getAsTime("cluster.routing.schedule", timeValueSeconds(10));
clusterService.addFirst(this);
if (clusterService != null) {
clusterService.addFirst(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we should really add these ourside of the classes itself IMO. but that is totally unrelated

@kimchy
Copy link
Member Author

kimchy commented Jun 19, 2015

@bleskes I pushed a first review round, I don't like the suggestion on the atomic reference one, I don't think it adds a lot of value, on the other hand, I have an idea on what would, but its a bigger change, will see if I can work on it and if it stays small

@bleskes
Copy link
Contributor

bleskes commented Jun 22, 2015

LGTM. I'm +1 on 1.7 - it makes things easier to trace, reason about and debug.

@s1monw
Copy link
Contributor

s1monw commented Jun 23, 2015

I am ok with this too... this entire circular dep stuff is just really bad but lets move on

Today, we have scheduled reroute that kicks every 10 seconds and checks if a
reroute is needed. We use it when adding nodes, since we don't reroute right
away once its added, and give it a time window to add additional nodes.

We do have recover after nodes setting and such in order to wait for enough
nodes to be added, and also, it really depends at what part of the 10s window
you end up, sometimes, it might not be effective at all. In general, its historic
from the times before we had recover after nodes and such.

This change removes the 10s scheduling, simplifies RoutingService, and adds
explicit reroute when a node is added to the system. It also adds unit tests
to RoutingService.

closes elastic#11776
@kimchy kimchy added the v1.7.0 label Jun 23, 2015
@kimchy kimchy merged commit 435ce7f into elastic:master Jun 23, 2015
@kevinkluge kevinkluge removed the review label Jun 23, 2015
kimchy added a commit that referenced this pull request Jun 23, 2015
Today, we have scheduled reroute that kicks every 10 seconds and checks if a
reroute is needed. We use it when adding nodes, since we don't reroute right
away once its added, and give it a time window to add additional nodes.

We do have recover after nodes setting and such in order to wait for enough
nodes to be added, and also, it really depends at what part of the 10s window
you end up, sometimes, it might not be effective at all. In general, its historic
from the times before we had recover after nodes and such.

This change removes the 10s scheduling, simplifies RoutingService, and adds
explicit reroute when a node is added to the system. It also adds unit tests
to RoutingService.

closes #11776
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jun 30, 2015
elastic#11776 has simplified our rerouting logic by removing a scheduled background reroute in favor of an explicit reroute during the cluster state processing of a node join (the only place where we didn't do it explicitly). While that change is conceptually good, it change semantics a bit in two ways:

 - shard listing actions underpinning shard allocation do not have access to that new node yet (causing errors during shard allocation see elastic#11923
 - the very first cluster state published to a node already has shard assignments to it. This surfaced other issues we are working to fix separately

 This commit changes the reroute to be done post processing the initial join cluster state to side step these issues while we work on a longer term solution.
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v1.7.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants