Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master: "CurrentState[RELOCATED] operation only allowed when started/recovering" (and three stuck shards) #22043

Closed
blinken opened this issue Dec 8, 2016 · 12 comments
Labels
discuss :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source.

Comments

@blinken
Copy link

blinken commented Dec 8, 2016

Hi,

Overnight our ElasticSearch 5.0.0 cluster stopped ingesting data. There seemed to be two (possibly related) faults.

Firstly, the master had a 47GB log file and was spamming these messages as fast as it could:

[2016-12-08T02:13:07,748][WARN ][o.e.i.s.IndexShard       ] [dal146] [logstash-na-runit-2016.12.07][4] failed to flush index
org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RELOCATED] operation only allowed when started/recovering
        at org.elasticsearch.index.shard.IndexShard.verifyStartedOrRecovering(IndexShard.java:1136) ~[elasticsearch-5.0.0.jar:5.0.0]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:764) ~[elasticsearch-5.0.0.jar:5.0.0]
        at org.elasticsearch.index.shard.IndexShard$2.doRun(IndexShard.java:1740) [elasticsearch-5.0.0.jar:5.0.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:504) [elasticsearch-5.0.0.jar:5.0.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.0.0.jar:5.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]

Note our master-eligible nodes also hold data, however the shard referenced is hosted on a different node (dal154). There were no errors in the logs on that node. Disabling allocation and restarting dal154 and re-enabling allocation did not stop the master printing these error messages. I then did the same disable/restart/enable of the master (which caused it to fail over to another node) which seems to have resolved the issue.

At the same time, we had three indices which seemed to be stuck in an endless loop trying to relocate. Here's a screenshot of the Marvel interface (stage would oscillate between "INIT" and "TRANSLOG" without the bytes/files indicators increasing)

image

At the time, our cluster routing allocation settings required that these indices move on to {dal276,dal277}-* (cold storage).

Here's the entries for these shards from /_cluster/state/routing_table (I've translated the node IDs)

          "14": [
            {
              "state": "INITIALIZING",
              "primary": false,
              "node": "3hrYNKELRq-KY8_ju3O65A",  # dal277-2
              "relocating_node": null,
              "shard": 14,
              "index": "logstash-na-nginx-2016.12.06",
              "expected_shard_size_in_bytes": 100899614438,
              "recovery_source": {
                "type": "PEER"
              },
              "allocation_id": {
                "id": "zfBTYzxSRE-iE27v2CErSg"
              }
            },
            {
              "state": "RELOCATING",
              "primary": true,
              "node": "6U_xqPDvSxKvGiWkRfxCyQ",  # dal155
              "relocating_node": "yvqmYP7CR1yXUE-04lKkDw",  # dal276-3
              "shard": 14,
              "index": "logstash-na-nginx-2016.12.06",
              "expected_shard_size_in_bytes": 100876123539,
              "allocation_id": {
                "id": "zBJA9XBwRIOaIRBDkNuwLA",
                "relocation_id": "ZXZWraThToeVPxJB-bnMGQ"
              }
            }

...

          "0": [
            {
              "state": "STARTED",
              "primary": false,
              "node": "3hrYNKELRq-KY8_ju3O65A",  # dal277-2
              "relocating_node": null,
              "shard": 0,
              "index": "logstash-na-nginx-2016.12.05",
              "allocation_id": {
                "id": "w8lY2i8BSqy7VMLIPsMUaw"
              }
            },
            {
              "state": "RELOCATING",
              "primary": true,
              "node": "IzU7DVD7TluamdfRn5XQ3A",  # dal140
              "relocating_node": "GN4AYurzS6SCpDFyKFPVUw",  # dal276-2
              "shard": 0,
              "index": "logstash-na-nginx-2016.12.05",
              "expected_shard_size_in_bytes": 97135123662,
              "allocation_id": {
                "id": "qQYAmpscQkq5xyZ0BgOy_w",
                "relocation_id": "FySZvSodQA2EB59xiN6KsA"
              }
            }
          ],

...

          "6": [
            {
              "state": "STARTED",
              "primary": false,
              "node": "k7SPcS4OTQ6vT63_SUbTqg",  # dal277-1
              "relocating_node": null,
              "shard": 6,
              "index": "logstash-na-nginx-2016.12.04",
              "allocation_id": {
                "id": "G3ohX-y1T5aOVU2LYjn25g"
              }
            },
            {
              "state": "RELOCATING",
              "primary": true,
              "node": "_IDNGuRgQ-iOZrk91r6JNQ",  # dal149
              "relocating_node": "GN4AYurzS6SCpDFyKFPVUw",  # dal276-2
              "shard": 6,
              "index": "logstash-na-nginx-2016.12.04",
              "expected_shard_size_in_bytes": 77842706635,
              "allocation_id": {
                "id": "qgs_lwAMQhyRynL91AsSwQ",
                "relocation_id": "vRpq5yK2S1y-kleA3rSsXA"
              }
            }
          ],

To fix these shards, I tried running /_cluster/reroute commands "move" and "cancel", but both the source and the destination nodes denied owning the shard:

POST /_cluster/reroute
{
  "commands" : [ {
        "cancel" : {
            "index" : "logstash-na-runit-2016.12.05", 
            "shard" : "0",
            "node" : "dal140"
        }
      }
  ]
}

Typical response:

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[dal142][199.231.78.147:9300][cluster:admin/reroute]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "[cancel_allocation] can't cancel 0, failed to find it on node {dal140}{IzU7DVD7TluamdfRn5XQ3A}{CJdG7tkARnizzBMRZW7tLw}{199.231.78.228}{199.231.78.228:9300}{host=dal140, store=hot}"
  },
  "status": 400
}

^ same error for node=dal276-2 and the other two broken shards

I put the whole cluster into debug mode and captured the following logs on dal276-2:

[2016-12-08T03:08:20,436][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.04][6] wipe translog location - creating new translog
[2016-12-08T03:08:20,437][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] no translog ID present in the current generation - creating one
[2016-12-08T03:08:20,722][DEBUG][o.e.i.f.p.ParentChildIndexFieldData] [dal276-2] [logstash-na-nginx-2016.12.05] global-ordinals [_parent] took [3.3micros]
[2016-12-08T03:08:20,725][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] close now acquiring writeLock
[2016-12-08T03:08:20,725][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] close acquired writeLock
[2016-12-08T03:08:20,725][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.05][0] translog closed
[2016-12-08T03:08:20,731][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] engine closed [api]
[2016-12-08T03:08:20,811][DEBUG][o.e.i.f.p.ParentChildIndexFieldData] [dal276-2] [logstash-na-nginx-2016.12.04] global-ordinals [_parent] took [2.8micros]
[2016-12-08T03:08:20,815][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] close now acquiring writeLock
[2016-12-08T03:08:20,815][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] close acquired writeLock
[2016-12-08T03:08:20,815][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.04][6] translog closed
[2016-12-08T03:08:20,820][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] engine closed [api]
[2016-12-08T03:08:21,301][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.05][0] wipe translog location - creating new translog
[2016-12-08T03:08:21,302][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] no translog ID present in the current generation - creating one
[2016-12-08T03:08:21,385][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.04][6] wipe translog location - creating new translog
[2016-12-08T03:08:21,386][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] no translog ID present in the current generation - creating one
[2016-12-08T03:08:21,730][DEBUG][o.e.i.f.p.ParentChildIndexFieldData] [dal276-2] [logstash-na-nginx-2016.12.05] global-ordinals [_parent] took [4.1micros]
[2016-12-08T03:08:21,733][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] close now acquiring writeLock
[2016-12-08T03:08:21,733][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] close acquired writeLock
[2016-12-08T03:08:21,733][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.05][0] translog closed
[2016-12-08T03:08:21,739][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] engine closed [api]
[2016-12-08T03:08:21,791][DEBUG][o.e.i.f.p.ParentChildIndexFieldData] [dal276-2] [logstash-na-nginx-2016.12.04] global-ordinals [_parent] took [3micros]
[2016-12-08T03:08:21,794][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] close now acquiring writeLock
[2016-12-08T03:08:21,794][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] close acquired writeLock
[2016-12-08T03:08:21,795][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.04][6] translog closed
[2016-12-08T03:08:21,799][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] engine closed [api]
[2016-12-08T03:08:22,331][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.05][0] wipe translog location - creating new translog
[2016-12-08T03:08:22,332][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] no translog ID present in the current generation - creating one
[2016-12-08T03:08:22,385][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.04][6] wipe translog location - creating new translog
[2016-12-08T03:08:22,387][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] no translog ID present in the current generation - creating one
[2016-12-08T03:08:22,774][DEBUG][o.e.i.f.p.ParentChildIndexFieldData] [dal276-2] [logstash-na-nginx-2016.12.04] global-ordinals [_parent] took [3.9micros]
[2016-12-08T03:08:22,779][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] close now acquiring writeLock
[2016-12-08T03:08:22,779][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] close acquired writeLock
[2016-12-08T03:08:22,779][DEBUG][o.e.i.t.Translog         ] [dal276-2] [logstash-na-nginx-2016.12.04][6] translog closed
[2016-12-08T03:08:22,780][DEBUG][o.e.i.f.p.ParentChildIndexFieldData] [dal276-2] [logstash-na-nginx-2016.12.05] global-ordinals [_parent] took [1.6micros]
[2016-12-08T03:08:22,784][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.04][6] engine closed [api]
[2016-12-08T03:08:22,785][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] close now acquiring writeLock
[2016-12-08T03:08:22,785][DEBUG][o.e.i.e.Engine           ] [dal276-2] [logstash-na-nginx-2016.12.05][0] close acquired writeLock

There were no logs mentioning any of the three indices on dal140 or the master.

In the end I stopped allocation and restarted dal276-2 and dal276-3 and the issue appears to be resolved - the following three full recoveries are taking place as the cluster boots up again (only one recovery is for one of the original broken indices):

image

Here's the /_cluster/routing/allocation output for the three shards now:

          "14": [
            {
              "state": "STARTED",
              "primary": false,
              "node": "3hrYNKELRq-KY8_ju3O65A",  # dal277-2
              "relocating_node": null,
              "shard": 14,
              "index": "logstash-na-nginx-2016.12.06",
              "allocation_id": {
                "id": "XUqiQZKdTm6sNIa0dRbEWA"
              }
            },
            {
              "state": "RELOCATING",
              "primary": true,
              "node": "6U_xqPDvSxKvGiWkRfxCyQ",  # dal155
              "relocating_node": "GN4AYurzS6SCpDFyKFPVUw",  # dal276-2
              "shard": 14,
              "index": "logstash-na-nginx-2016.12.06",
              "expected_shard_size_in_bytes": 100875566603,
              "allocation_id": {
                "id": "zBJA9XBwRIOaIRBDkNuwLA",
                "relocation_id": "st_vHPOxThWZQBqJ5MV0Bw"
              }
            }
          ]

...

          "0": [
            {
              "state": "STARTED",
              "primary": true,
              "node": "3hrYNKELRq-KY8_ju3O65A",  # dal277-2
              "relocating_node": null,
              "shard": 0,
              "index": "logstash-na-nginx-2016.12.05",
              "allocation_id": {
                "id": "w8lY2i8BSqy7VMLIPsMUaw"
              }
            },
            {
              "state": "STARTED",
              "primary": false,
              "node": "GN4AYurzS6SCpDFyKFPVUw",  # dal276-2
              "relocating_node": null,
              "shard": 0,
              "index": "logstash-na-nginx-2016.12.05",
              "allocation_id": {
                "id": "6b9TalyaTmanWO3DKKdNug"
              }
            }
          ],

...

          "6": [
            {
              "state": "STARTED",
              "primary": false,
              "node": "GN4AYurzS6SCpDFyKFPVUw",  # dal276-2
              "relocating_node": null,
              "shard": 6,
              "index": "logstash-na-nginx-2016.12.04",
              "allocation_id": {
                "id": "qot4rsA_Rq2nf9bRivb2qA"
              }
            },
            {
              "state": "STARTED",
              "primary": true,
              "node": "k7SPcS4OTQ6vT63_SUbTqg",  # dal277-1
              "relocating_node": null,
              "shard": 6,
              "index": "logstash-na-nginx-2016.12.04",
              "allocation_id": {
                "id": "G3ohX-y1T5aOVU2LYjn25g"
              }
            }
          ],

Elasticsearch version:
5.0.0

Plugins installed:
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [aggs-matrix-stats]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [ingest-common]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-expression]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-groovy]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-mustache]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-painless]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [percolator]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [reindex]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [transport-netty3]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [transport-netty4]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded plugin [x-pack]

JVM version:
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)

OS version:
Debian GNU/Linux 7.10 (wheezy)
Linux dal276 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
The ElasticSearch cluster should not stop ingesting logs. Relocation should not get into a state requiring a node restart to resolve.

Steps to reproduce:
Unknown.

Provide logs (if relevant):
See above

@bleskes
Copy link
Contributor

bleskes commented Dec 8, 2016

@blinken how much time is there between the beginning of log spamming and the relocation command to move shards from hot to cold?

@blinken
Copy link
Author

blinken commented Dec 8, 2016

The index logstash-na-runit-2016.12.07 would have been created at 2016-12-07 00:01 UTC. Our curator policy is to move indices to cold storage 24 hours after they are created, and the next curator run was at 2016-12-08 00:07:01 UTC - so I'd expect the index to be moved off at that time.

The master started spamming the logs at 2016-12-07 08:31:39 UTC, only eight hours after the index was created. Here's a list of curator runs, and I'm not sure I see much of a correlation with the issue starting at 2016-12-07 08:31:39.

(last few runs)
2016-12-07 02:07:02.20316
2016-12-07 03:07:02.18692
2016-12-07 10:07:02.35900
2016-12-07 16:07:02.36152
2016-12-07 18:07:02.48396
2016-12-07 19:07:02.25172
2016-12-07 22:07:02.39559
2016-12-08 00:07:02.54826
2016-12-08 01:07:01.98269
2016-12-08 08:07:01.79631

@clintongormley clintongormley added :Allocation :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. discuss labels Dec 9, 2016
@blinken
Copy link
Author

blinken commented Dec 12, 2016

Thanks all, I notice this has a fix going in for 5.0.3 / 5.1.2. I'll test when those versions are released and advise if we run into this again. Appreciate the quick response!

@ywelsch
Copy link
Contributor

ywelsch commented Dec 12, 2016

@blinken The fix covers the log file spamming. I don't have a good theory though for the three shards which seemed to be stuck in an endless loop trying to relocate. Please report back if anything like this occurs again.

@blinken
Copy link
Author

blinken commented Dec 14, 2016

@ywelsch ok, here's a similar I just noticed. logstash-na-nginx-2016.12.12 shard 6 replica has been unassigned for at least 24 hours now, and on investigation I find that in Marvel it's looping between INIT and TRANSLOG (shown).

image

I note that there's a second entry in Marvel for the same shard, with the same translog data - which seems to be a UI bug because I suspect it's trying to show that the primary is also relocating:

          "6": [
            {
              "state": "RELOCATING",
              "primary": true,
              "node": "GFhQ_bRuQgyTxSDih69efw",
              "relocating_node": "yvqmYP7CR1yXUE-04lKkDw",
              "shard": 6,
              "index": "logstash-na-nginx-2016.12.12",
              "expected_shard_size_in_bytes": 97396189390,
              "allocation_id": {
                "id": "PYlTDKUKSEyAwt90PDe--Q",
                "relocation_id": "wS62LxLNSTiIG52ujUYWeA"
              }
            },
            {
              "state": "INITIALIZING",
              "primary": false,
              "node": "3hrYNKELRq-KY8_ju3O65A",
              "relocating_node": null,
              "shard": 6,
              "index": "logstash-na-nginx-2016.12.12",
              "recovery_source": {
                "type": "PEER"
              },
              "allocation_id": {
                "id": "RMrA_0sARNWkTrW1TvzwwA"
              },
              "unassigned_info": {
                "reason": "NODE_LEFT",
                "at": "2016-12-13T19:02:13.914Z",
                "delayed": true,
                "details": "node_left[3hrYNKELRq-KY8_ju3O65A]",
                "allocation_status": "no_attempt"
              }
            }
          ],

Correct me if I'm wrong, but my understanding is that a primary should not be relocating while the secondary shard is in the process of initializing?

I don't mind leaving this in this state overnight (UTC) as we still seem to be ingesting data - let me know if you would like me to collect any specific logs etc.

@ywelsch
Copy link
Contributor

ywelsch commented Dec 14, 2016

a primary should not be relocating while the secondary shard is in the process of initializing

that's actually ok. We allow this to happen concurrently. Once primary relocation completes, the replica fails over to the new primary to resume initialization.

let me know if you would like me to collect any specific logs etc.

It looks like an issue with primary relocation. At the end of primary relocation there is a (usually) short phase where replica recoveries are delayed (this would explain the switching back and forth of the replica from INIT to TRANSLOG). The primary relocation seems to be stuck. To investigate this further I need:

  • stack dumps from the nodes involved in primary relocation (source node: GFhQ_bRuQgyTxSDih69efw and target node: yvqmYP7CR1yXUE-04lKkDw).
  • the list of current tasks running on these nodes provided through /_tasks?nodes= GFhQ_bRuQgyTxSDih69efw,yvqmYP7CR1yXUE-04lKkDw , see also here: https://www.elastic.co/guide/en/elasticsearch/reference/5.0/tasks.html
  • the full logs from the master node and the two nodes above.

If you don't want to share this information publicly you can send it to my e-mail address (first_name@elastic.co).

@blinken
Copy link
Author

blinken commented Dec 14, 2016

No problem, I'll send via email. Unfortunately it seems dal111 (GFhQ_bRuQgyTxSDih69efw) has generated 208GB of logs since midnight UTC. I'll attach the last 100k lines unless there's specific data you would like.

@ywelsch
Copy link
Contributor

ywelsch commented Dec 14, 2016

I'll attach the last 100k lines

sounds good. The logs might not reveal much (the interesting bits in this case are only logged at TRACE level, but increasing logging to that level would probably bring the cluster down). The stack traces / task lists on the other hand could reveal some interesting stuff.

@blinken
Copy link
Author

blinken commented Dec 14, 2016

I've sent the data through except the stack traces (which I'll follow up with tomorrow morning).

ywelsch added a commit that referenced this issue Dec 29, 2016
Recoveries are tracked on the target node using RecoveryTarget objects that are kept in a RecoveriesCollection. Each recovery has a unique id that is communicated from the recovery target to the source so that it can call back to the target and execute actions using the right recovery context. In case of a network disconnect, recoveries are retried. At the moment, the same recovery id is reused for the restarted recovery. This can lead to confusion though if the disconnect is unilateral and the recovery source continues with the recovery process. If the target reuses the same recovery id while doing a second attempt, there might be two concurrent recoveries running on the source for the same target.

This commit changes the recovery retry process to use a fresh recovery id. It also waits for the first recovery attempt to be fully finished (all resources locally freed) to further prevent concurrent access to the shard. Finally, in case of primary relocation, it also fails a second recovery attempt if the first attempt moved past the finalization step, as the relocation source can then be moved to RELOCATED state and start indexing as primary into the target shard (see TransportReplicationAction). Resetting the target shard in this state could mean that indexing is halted until the recovery retry attempt is completed and could also destroy existing documents indexed and acknowledged before the reset.

Relates to #22043
ywelsch added a commit that referenced this issue Dec 29, 2016
Recoveries are tracked on the target node using RecoveryTarget objects that are kept in a RecoveriesCollection. Each recovery has a unique id that is communicated from the recovery target to the source so that it can call back to the target and execute actions using the right recovery context. In case of a network disconnect, recoveries are retried. At the moment, the same recovery id is reused for the restarted recovery. This can lead to confusion though if the disconnect is unilateral and the recovery source continues with the recovery process. If the target reuses the same recovery id while doing a second attempt, there might be two concurrent recoveries running on the source for the same target.

This commit changes the recovery retry process to use a fresh recovery id. It also waits for the first recovery attempt to be fully finished (all resources locally freed) to further prevent concurrent access to the shard. Finally, in case of primary relocation, it also fails a second recovery attempt if the first attempt moved past the finalization step, as the relocation source can then be moved to RELOCATED state and start indexing as primary into the target shard (see TransportReplicationAction). Resetting the target shard in this state could mean that indexing is halted until the recovery retry attempt is completed and could also destroy existing documents indexed and acknowledged before the reset.

Relates to #22043
ywelsch added a commit that referenced this issue Dec 29, 2016
Recoveries are tracked on the target node using RecoveryTarget objects that are kept in a RecoveriesCollection. Each recovery has a unique id that is communicated from the recovery target to the source so that it can call back to the target and execute actions using the right recovery context. In case of a network disconnect, recoveries are retried. At the moment, the same recovery id is reused for the restarted recovery. This can lead to confusion though if the disconnect is unilateral and the recovery source continues with the recovery process. If the target reuses the same recovery id while doing a second attempt, there might be two concurrent recoveries running on the source for the same target.

This commit changes the recovery retry process to use a fresh recovery id. It also waits for the first recovery attempt to be fully finished (all resources locally freed) to further prevent concurrent access to the shard. Finally, in case of primary relocation, it also fails a second recovery attempt if the first attempt moved past the finalization step, as the relocation source can then be moved to RELOCATED state and start indexing as primary into the target shard (see TransportReplicationAction). Resetting the target shard in this state could mean that indexing is halted until the recovery retry attempt is completed and could also destroy existing documents indexed and acknowledged before the reset.

Relates to #22043
@ywelsch
Copy link
Contributor

ywelsch commented Dec 29, 2016

As some of the conversation went by e-mail (sharing private logs etc.), I'm going to quickly summarize our findings here.

The endless recovery loop can happen when the initial connection between recovery target and source is prematurely closed. A fix has been made in #22325 that detects this situation and correctly initiates a second recovery attempt instead of looping endlessly. The fix will be released as part of ES v5.1.2.

We've also investigated why the connection could have been prematurely closed (and suggest a fix, see below). In this specific case we think that is was triggered by an inactivity timeout on the connection. The reason for this can be best explained by a visualization of how the recovery process proceeds at the connection level:

Recovery Source            Recovery Target


            (channel 1) start recovery
  <-------------------------------------

     send file chunk X  (channel 2)
  ------------------------------------->
                    (channel 2)  ack
  <-------------------------------------

     send file chunk Y  (channel 2)
  ------------------------------------->
                    (channel 2)  ack
  <-------------------------------------

  ...

    ack  (channel 1)
  ------------------------------------->

This should illustrate that while the files are being sent from the recovery source to the target there is an idling channel (channel 1) waiting for the recovery to finish. @blinken confirmed that recoveries / relocations take multiple hours (large shards over slow connections). We therefore suggest configuring / enabling of the following connection keep-alive options:

@blinken
Copy link
Author

blinken commented Jan 3, 2017

Just to close the circle on my end, the issue has not reoccurred in the last two weeks after setting the following sysctls -

net.ipv4.tcp_keepalive_time=300
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=18

I also changed cluster.routing.allocation.node_concurrent_recoveries from 8 to 4, which would reduce network load.

I have not deployed transport.ping_schedule (but I plan to).

Thanks again to everyone for the assistance here.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 3, 2017

@blinken Thanks for the update. I'm closing the issue now as we have a fix for v5.1.2 and a confirmed workaround.

@ywelsch ywelsch closed this as completed Jan 3, 2017
@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source.
Projects
None yet
Development

No branches or pull requests

5 participants