index.unassigned.node_left.delayed_timeout not working stably in 1.7 #12566

mkliu · 2015-07-30T19:27:23Z

For example in the case below (data retrieved from _cluster/health)
Right after I kill the node:

{
  "cluster_name" : "essandbox-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 19,
  "number_of_data_nodes" : 13,
  "active_primary_shards" : 783,
  "active_shards" : 1480,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 86,
  "delayed_unassigned_shards" : 86,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

I set the timeout to 30s. The node is back at around 10s later. But shards only gradually start recovering at until ~1.5 min later. And it's not at the speed I’m expecting. And I don’t know why it’s relocating_shards.

And worst is sometimes after a while it looks as if it stopped recovering, and I need to manually reroute unassigned.

{
  "cluster_name" : "essandbox-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 14,
  "active_primary_shards" : 783,
  "active_shards" : 1482,
  "relocating_shards" : 2,
  "initializing_shards" : 1,
  "unassigned_shards" : 83,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

The text was updated successfully, but these errors were encountered:

dakrone · 2015-07-30T19:29:14Z

@mkliu when the node left the cluster, do you see the log message about the delay in the ES logs, it should look like:

delaying allocation for [N] unassigned shards, next check in [Ns]

(where N is a number), can you paste what it says?

mkliu · 2015-07-30T20:17:12Z

July 30th 2015, 11:13:47.609    essandbox-cluster   INFO    [xxx] delaying allocation for [86] unassigned shards, next check in [29.1s]
July 30th 2015, 11:14:19.007    essandbox-cluster   INFO    [xxx] delaying allocation for [0] unassigned shards, next check in [0s]
July 30th 2015, 11:14:19.580    essandbox-cluster   INFO    [xxx] delaying allocation for [0] unassigned shards, next check in [0s]

dakrone · 2015-07-30T20:52:25Z

@mkliu according to the timestamp it looks like it did do the reroute at the correct time (13:47 and then ~30 seconds later at 14:19).

The log message is confusing and will be fixed by #12532

mkliu · 2015-07-30T21:17:20Z

@dakrone hmm, it's actually not doing reroute, as described in the first post, I had the manually kick start in the end. The

  "delaying allocation for [0] unassigned shards"

goes on and on and on and on.

dakrone · 2015-08-05T18:13:27Z

@mkliu can you increase the logging level for your cluster to DEBUG and make the master log available so I can take a look?

dakrone · 2015-10-30T20:59:54Z

I think this may have been fixed by #12678 , @mkliu can you confirm?

clintongormley · 2016-01-14T14:25:12Z

No further feedback. Closing

clintongormley added discuss :Allocation labels Aug 5, 2015

clintongormley assigned dakrone Aug 5, 2015

dakrone added the feedback_needed label Oct 30, 2015

jolynch mentioned this issue Nov 9, 2015

1.7.X Reroutes seem really fragile #14631

Closed

clintongormley closed this as completed Jan 14, 2016

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.unassigned.node_left.delayed_timeout not working stably in 1.7 #12566

index.unassigned.node_left.delayed_timeout not working stably in 1.7 #12566

mkliu commented Jul 30, 2015

dakrone commented Jul 30, 2015

mkliu commented Jul 30, 2015

dakrone commented Jul 30, 2015

mkliu commented Jul 30, 2015

dakrone commented Aug 5, 2015

dakrone commented Oct 30, 2015

clintongormley commented Jan 14, 2016

index.unassigned.node_left.delayed_timeout not working stably in 1.7 #12566

index.unassigned.node_left.delayed_timeout not working stably in 1.7 #12566

Comments

mkliu commented Jul 30, 2015

dakrone commented Jul 30, 2015

mkliu commented Jul 30, 2015

dakrone commented Jul 30, 2015

mkliu commented Jul 30, 2015

dakrone commented Aug 5, 2015

dakrone commented Oct 30, 2015

clintongormley commented Jan 14, 2016