Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk index tasks stuck forever #31099

Closed
jeancornic opened this issue Jun 5, 2018 · 16 comments
Closed

Bulk index tasks stuck forever #31099

jeancornic opened this issue Jun 5, 2018 · 16 comments
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@jeancornic
Copy link

jeancornic commented Jun 5, 2018

Elasticsearch version (bin/elasticsearch --version): 6.1.3

Plugins installed: []

JVM version (java -version):

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux *** 4.4.0-1049-aws #58-Ubuntu SMP Fri Jan 12 23:17:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Seeing a weird behaviour, with bulk index actions that never finish.
They are triggered by the Elasticsearch Java client (version 6.1.3), through a BulkRequestBuilder, with the default timeout (1m). Seems we never enter the ActionListener callbacks.

I found out recently that it's linked with tasks running indefinitely. When checking the /_tasks api, I'm seeing tasks that have been running for several days (!)..

$ curl 'localhost:9200/_tasks?pretty&human&actions=indices:data/write/bulk&detailed'
...
    "pd5MT4eUSEqsOy_oq_q-vw" : {
      "name" : "***",
      "transport_address" : "***",
      "host" : "***",
      "ip" : "***",
      "roles" : [
        "data"
      ],
      "attributes" : {
        "availability_zone" : "us-east-1c",
        "tag" : "fresh"
      },
      "tasks" : {
        "pd5MT4eUSEqsOy_oq_q-vw:208477658" : {
          "node" : "pd5MT4eUSEqsOy_oq_q-vw",
          "id" : 208477658,
          "type" : "netty",
          "action" : "indices:data/write/bulk",
          "start_time" : "2018-04-12T21:17:45.175Z",
          "start_time_in_millis" : 1523567865175,
          "running_time" : "53.4d",
          "running_time_in_nanos" : 4622108629250516,
          "cancellable" : false
        }
      }
    },
...

I would expect the bulk action to run for 1min, and then timeout.

I came across several issues / PRs on github:

Steps to reproduce: I have no idea, it happens randomly, since we migrated from 5.6 to 6.1.3.

Provide logs (if relevant): found nothing relevant nor in ES, nor in the client logs.

@mayya-sharipova mayya-sharipova added >bug :Core/Infra/Core Core issues without another label labels Jun 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jeancornic
Copy link
Author

Do you need more details? Or that I run some commands on the cluster?

I also found out recently that this bug is responsible for stuck relocations. Somehow the ongoing bulk operation prevents the relocation from finishing completely, and the workaround I found is is to close / re-open the index.

@s1monw
Copy link
Contributor

s1monw commented Jun 14, 2018

@bleskes this seems more of a distributed systems issue? Can you follow up

@bleskes
Copy link
Contributor

bleskes commented Jun 18, 2018

@jeancornic what's peculiar about this issue is that the task indices:data/write/bulk is a top level bulk request. That one potentially creates the index (are you using this feature?) and the breaks the request into sub requests contain all operations that are targeted at specific shard and sends those off. Are you seeing any of those being stuck? you can identify them by indices:data/write/bulk[s]* (note the [s] at the end).

@jeancornic
Copy link
Author

@bleskes sure, follows a json extract (group_by=parents view).
And no, we disabled auto_create_index.
We see them on several nodes, several indices.

    ...
    "pd5MT4eUSEqsOy_oq_q-vw:208477658" : {
      "node" : "pd5MT4eUSEqsOy_oq_q-vw",
      "id" : 208477658,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[938], indices[]",
      "start_time" : "2018-04-12T21:17:45.175Z",
      "start_time_in_millis" : 1523567865175,
      "running_time" : "66.7d",
      "running_time_in_nanos" : 5765089168375518,
      "cancellable" : false,
      "children" : [
        {
          "node" : "pd5MT4eUSEqsOy_oq_q-vw",
          "id" : 208477692,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[6], index[17.1523351040000.0]",
          "start_time" : "2018-04-12T21:17:45.177Z",
          "start_time_in_millis" : 1523567865177,
          "running_time" : "66.7d",
          "running_time_in_nanos" : 5765089166199822,
          "cancellable" : false,
          "parent_task_id" : "pd5MT4eUSEqsOy_oq_q-vw:208477658",
          "children" : [
            {
              "node" : "pacubgT0TfeaE0sIHWzbfA",
              "id" : 261543271,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[6], index[17.1523351040000.0]",
              "start_time" : "2018-04-12T21:17:45.182Z",
              "start_time_in_millis" : 1523567865182,
              "running_time" : "66.7d",
              "running_time_in_nanos" : 5765089159914921,
              "cancellable" : false,
              "parent_task_id" : "pd5MT4eUSEqsOy_oq_q-vw:208477692",
              "children" : [
                {
                  "node" : "pacubgT0TfeaE0sIHWzbfA",
                  "id" : 261543299,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[6], index[17.1523351040000.0]",
                  "start_time" : "2018-04-12T21:17:45.196Z",
                  "start_time_in_millis" : 1523567865196,
                  "running_time" : "66.7d",
                  "running_time_in_nanos" : 5765089146064289,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:261543271"
                },
                {
                  "node" : "3lgZ2bOCQs6xXcUda6OyuQ",
                  "id" : 256995983,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[6], index[17.1523351040000.0]",
                  "start_time" : "2018-04-12T21:17:45.217Z",
                  "start_time_in_millis" : 1523567865217,
                  "running_time" : "66.7d",
                  "running_time_in_nanos" : 5765089124501837,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:261543271"
                }
              ]
            }
          ]
        }
      ]
    },
    ...
    "Pnwo6h_ATI6aV0WDKnubfw:225377459" : {
      "node" : "Pnwo6h_ATI6aV0WDKnubfw",
      "id" : 225377459,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[706], indices[]",
      "start_time" : "2018-04-13T07:04:22.308Z",
      "start_time_in_millis" : 1523603062308,
      "running_time" : "66.3d",
      "running_time_in_nanos" : 5729892034527237,
      "cancellable" : false,
      "children" : [
        {
          "node" : "Pnwo6h_ATI6aV0WDKnubfw",
          "id" : 225377480,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[11], index[210.1523585220000.0]",
          "start_time" : "2018-04-13T07:04:22.310Z",
          "start_time_in_millis" : 1523603062310,
          "running_time" : "66.3d",
          "running_time_in_nanos" : 5729892032405972,
          "cancellable" : false,
          "parent_task_id" : "Pnwo6h_ATI6aV0WDKnubfw:225377459",
          "children" : [
            {
              "node" : "pacubgT0TfeaE0sIHWzbfA",
              "id" : 279050835,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[11], index[210.1523585220000.0]",
              "start_time" : "2018-04-13T07:04:22.312Z",
              "start_time_in_millis" : 1523603062312,
              "running_time" : "66.3d",
              "running_time_in_nanos" : 5729892030298850,
              "cancellable" : false,
              "parent_task_id" : "Pnwo6h_ATI6aV0WDKnubfw:225377480",
              "children" : [
                {
                  "node" : "pacubgT0TfeaE0sIHWzbfA",
                  "id" : 279050836,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[11], index[210.1523585220000.0]",
                  "start_time" : "2018-04-13T07:04:22.312Z",
                  "start_time_in_millis" : 1523603062312,
                  "running_time" : "66.3d",
                  "running_time_in_nanos" : 5729892030288175,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:279050835"
                },
                {
                  "node" : "pd5MT4eUSEqsOy_oq_q-vw",
                  "id" : 220870271,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[11], index[210.1523585220000.0]",
                  "start_time" : "2018-04-13T07:04:22.320Z",
                  "start_time_in_millis" : 1523603062320,
                  "running_time" : "66.3d",
                  "running_time_in_nanos" : 5729892022670077,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:279050835"
                }
              ]
            }
          ]
        }
      ]
    },
    ...
   "bGuSrzBFTUii9qto-vPdQw:2391948877" : {
      "node" : "bGuSrzBFTUii9qto-vPdQw",
      "id" : 2391948877,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[835], indices[]",
      "start_time" : "2018-06-12T16:05:56.318Z",
      "start_time_in_millis" : 1528819556318,
      "running_time" : "5.9d",
      "running_time_in_nanos" : 513398024435017,
      "cancellable" : false,
      "children" : [
        {
          "node" : "bGuSrzBFTUii9qto-vPdQw",
          "id" : 2391948914,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[4], index[841.1528651620000.1]",
          "start_time" : "2018-06-12T16:05:56.323Z",
          "start_time_in_millis" : 1528819556323,
          "running_time" : "5.9d",
          "running_time_in_nanos" : 513398020168910,
          "cancellable" : false,
          "parent_task_id" : "bGuSrzBFTUii9qto-vPdQw:2391948877",
          "children" : [
            {
              "node" : "evdbfKMRT_CDufvaHRaDAA",
              "id" : 2819410023,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[4], index[841.1528651620000.1]",
              "start_time" : "2018-06-12T16:05:56.322Z",
              "start_time_in_millis" : 1528819556322,
              "running_time" : "5.9d",
              "running_time_in_nanos" : 513398020793778,
              "cancellable" : false,
              "parent_task_id" : "bGuSrzBFTUii9qto-vPdQw:2391948914",
              "children" : [
                {
                  "node" : "evdbfKMRT_CDufvaHRaDAA",
                  "id" : 2819410080,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[4], index[841.1528651620000.1]",
                  "start_time" : "2018-06-12T16:05:56.346Z",
                  "start_time_in_millis" : 1528819556346,
                  "running_time" : "5.9d",
                  "running_time_in_nanos" : 513397996695634,
                  "cancellable" : false,
                  "parent_task_id" : "evdbfKMRT_CDufvaHRaDAA:2819410023"
                },
                {
                  "node" : "FDW9DMoWTiOX7AeQ8xmdCA",
                  "id" : 2589448571,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[4], index[841.1528651620000.1]",
                  "start_time" : "2018-06-12T16:05:56.380Z",
                  "start_time_in_millis" : 1528819556380,
                  "running_time" : "5.9d",
                  "running_time_in_nanos" : 513397960736985,
                  "cancellable" : false,
                  "parent_task_id" : "evdbfKMRT_CDufvaHRaDAA:2819410023"
                }
              ]
            }
          ]
        }
      ]
    },
    ...

@bleskes bleskes added :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Core/Infra/Core Core issues without another label labels Jun 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Jun 18, 2018

@jeancornic thanks. It seems these are stuck waiting for a replica that never responded. The two tickets that you mention are similar in nature, but I don't see anything here that point in that direction.

I understand that you see this regularly. Is that true? if so, can you maybe upgrade to 6.3.0 and see how it goes? it will help reduce the scope of the search.

Some more generic questions about your setup:

  1. Do you use dynamic mapping?
  2. Is there anything else of note in the logs? The replica nodes (FDW9DMoWTiOX7AeQ8xmdCA & pd5MT4eUSEqsOy_oq_q-vw) are interesting.

@jeancornic
Copy link
Author

jeancornic commented Jun 19, 2018

@bleskes Thanks for those comments.
We see this regularly indeed, several times a week.

Migrating to 6.3.0 is not very straightforward for us to be honest.

  1. We don't use dynamic mapping
  2. Investigated another task than the one in example, more recent:
"WpsOgw9JTxivBHkMatVPrQ:2158610039" : {
      "node" : "WpsOgw9JTxivBHkMatVPrQ",
      "id" : 2158610039,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[671], indices[]",
      "start_time" : "2018-06-17T22:06:29.484Z",
      "start_time_in_millis" : 1529273189484,
      "running_time" : "1.7d",
      "running_time_in_nanos" : 153023445214125,
      "cancellable" : false,
      "children" : [
        {
          "node" : "WpsOgw9JTxivBHkMatVPrQ",
          "id" : 2158610093,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[29], index[13.1529271600000.1]",
          "start_time" : "2018-06-17T22:06:29.487Z",
          "start_time_in_millis" : 1529273189487,
          "running_time" : "1.7d",
          "running_time_in_nanos" : 153023441823453,
          "cancellable" : false,
          "parent_task_id" : "WpsOgw9JTxivBHkMatVPrQ:2158610039",
          "children" : [
            {
              "node" : "63tYulq1Q7qhsMAcQXR6Bg",
              "id" : 1928749409,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[29], index[13.1529271600000.1]",
              "start_time" : "2018-06-17T22:06:29.488Z",
              "start_time_in_millis" : 1529273189488,
              "running_time" : "1.7d",
              "running_time_in_nanos" : 153023440722690,
              "cancellable" : false,
              "parent_task_id" : "WpsOgw9JTxivBHkMatVPrQ:2158610093",
              "children" : [
                {
                  "node" : "FDW9DMoWTiOX7AeQ8xmdCA",
                  "id" : 2871784213,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[29], index[13.1529271600000.1]",
                  "start_time" : "2018-06-17T22:06:29.544Z",
                  "start_time_in_millis" : 1529273189544,
                  "running_time" : "1.7d",
                  "running_time_in_nanos" : 153023387248385,
                  "cancellable" : false,
                  "parent_task_id" : "63tYulq1Q7qhsMAcQXR6Bg:1928749409"
                },
                {
                  "node" : "63tYulq1Q7qhsMAcQXR6Bg",
                  "id" : 1928749487,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[29], index[13.1529271600000.1]",
                  "start_time" : "2018-06-17T22:06:29.499Z",
                  "start_time_in_millis" : 1529273189499,
                  "running_time" : "1.7d",
                  "running_time_in_nanos" : 153023430180000,
                  "cancellable" : false,
                  "parent_task_id" : "63tYulq1Q7qhsMAcQXR6Bg:1928749409"
                }
              ]
            }
          ]
        }
      ]
    }

In the logs (on the 3 nodes WpsOgw9JTxivBHkMatVPrQ, 63tYulq1Q7qhsMAcQXR6Bg & FDW9DMoWTiOX7AeQ8xmdCA), I don't find anything relevant:

  • indexing slowlogs, < 500ms

  • some debug logs, async flush request, async roll translog generation & merge

...
[2018-06-17T22:11:22,131][DEBUG][o.e.i.s.IndexShard       ] [***] [13.1529271600000.1][4] submitting async flush request
...
[2018-06-17T22:06:37,917][DEBUG][o.e.i.s.IndexShard       ] [***] [13.1529271600000.1][0] submitting async roll translog generation request
...
[2018-06-17T22:06:41,358][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [***] [13.1529271600000.1][4] merge segment [_7e] done: took [25.4s], [375.2 MB], [526,460 docs], [0s stopped], [17.8s throttled], [364.3 MB written], [16.5 MB/sec throttle]
...

Seeing also one

[2018-06-19T02:18:09,096][WARN ][o.e.t.n.Netty4Transport  ] [***] send message failed [channel: org.elasticsearch.transport.netty4.NettyTcpChannel@542b0ecb]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

but 4 hours after the bulk, probably unrelated.

@bleskes
Copy link
Contributor

bleskes commented Jun 22, 2018

@jeancornic thanks. Nothing pops up directly here, which means it's going to be a long search. It's good it happens frequently as it greatly increases to chance of finding out what its.

The the last log message you sent is important:

[2018-06-19T02:18:09,096][WARN ][o.e.t.n.Netty4Transport  ] [***] send message failed [channel: org.elasticsearch.transport.netty4.NettyTcpChannel@542b0ecb]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

Can you post the full stack trace for this? can you also check other nodes for transport level logs that look suspicious?

@linyy2233
Copy link

linyy2233 commented Oct 9, 2018

We also encountered the same problem, this task caused several nodes to load very high, and then the cluster could not write. Restart the node with high load to recover. Is there a solution? Thank you.

@linyy2233
Copy link

node hot_threads
image

@ywelsch
Copy link
Contributor

ywelsch commented Oct 9, 2018

@linyy2233 it looks like you're having a different (unrelated) problem, possibly caused by having a very large JSON number in your document? Can you open a topic on discuss.elastic.co please for further investigation? Thanks.

Also, as no reply on original ticket, closing this.

@ywelsch ywelsch closed this as completed Oct 9, 2018
@seborys
Copy link

seborys commented Dec 20, 2018

@jeancornic Did you ever get more info on this? Experiencing the same.

@ywelsch
Copy link
Contributor

ywelsch commented Dec 20, 2018

For the case where tasks are stuck on replicas (Tasks are organized in a tree hierarchy, and the leaves are tasks with an action name indices:data/write/bulk[s][r], i.e., ending in [r]), we found the root cause yesterday and created a fix (see #36770). Note that there can be other reasons for stuck tasks, so if your leaf tasks are not replica tasks, they might be stuck for a different reason.

@linyy2233
Copy link

Thanks, I changed this parameter to false and it does not appear.
numeric_detection

@IvanBoyko
Copy link

IvanBoyko commented Mar 30, 2020

We experience the same problem on 6.8.7

Is there any generic workaround how to delete stuck tasks?
We are Ok to lose this data, it would be great to recover the cluster at all.

Example:

    "hep4suvHTgWKgYGcDoseCA:5469": {
      "node": "hep4suvHTgWKgYGcDoseCA",
      "id": 5469,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[2], indices[.monitoring-es-6-2020.03.30]",
      "start_time_in_millis": 1585591712424,
      "running_time_in_nanos": 6472053578715,
      "cancellable": false,
      "headers": {},
      "children": [
        {
          "node": "hep4suvHTgWKgYGcDoseCA",
          "id": 5483,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "status": {
            "phase": "waiting_on_primary"
          },
          "description": "requests[2], index[.monitoring-es-6-2020.03.30]",
          "start_time_in_millis": 1585591712440,
          "running_time_in_nanos": 6472037944119,
          "cancellable": false,
          "parent_task_id": "hep4suvHTgWKgYGcDoseCA:5469",
          "headers": {},
          "children": [
            {
              "node": "hep4suvHTgWKgYGcDoseCA",
              "id": 6148,
              "type": "direct",
              "action": "indices:data/write/bulk[s][p]",
              "status": {
                "phase": "primary"
              },
              "description": "requests[2], index[.monitoring-es-6-2020.03.30]",
              "start_time_in_millis": 1585591720956,
              "running_time_in_nanos": 6463522255014,
              "cancellable": false,
              "parent_task_id": "hep4suvHTgWKgYGcDoseCA:5483",
              "headers": {}
            }
          ]
        }
      ]
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

No branches or pull requests

9 participants