Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637

devoncrouse · 2015-12-24T22:59:41Z

There have been a couple issues raised with this error, but I think this is a little different, and seems explainable.

I have an ES 1.7.x cluster with dedicated masters/clients/datas, and everything is happy. Even have Chaos Monkey running, and cluster recovery works exactly like I'd expect.

The problem for me is more in the Apache Spark integration; when a node is lost during an active job, it seems like es-spark tries relentlessly to pull the data from the same (non-existent) node id, without attempting to figure out where the replica(s) are or primary went:

These settings are nice in theory, but ineffective for this problem as stated above:

es.batch.write.retry.count=300
es.batch.write.retry.wait=5s

Error example:

Job aborted due to stage failure: Task 2 in stage 406.0 failed 4 times, most recent failure: Lost task 2.3 in stage 406.0 (TID 53333, 10.100.30.222): org.apache.spark.util.TaskCompletionListenerException: ElasticsearchIllegalArgumentException[No data node with id[czeFxNWjSW6wYzkNq8sUhA] found] at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:72) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Maybe there's a configuration that would help alleviate this (besides es.nodes.client.only; I'm moving a lot of data and would prefer not to take the performance hit/scale up client nodes), and if anyone has a workaround I'd appreciate it.

The text was updated successfully, but these errors were encountered:

costin · 2016-01-11T17:50:19Z

@devoncrouse I assume you are talking about reads. Currently, due to the way Lucene and ES works, ES-Hadoop cannot just move a read to a different client since there are no guarantees that the documents will be returned in a different order. That is because a scan is really tied to a node - if the node dies, ES-Hadoop has no proper, supported way to resume the scroll to a different node (or restart it and properly discard the loaded data). Hence why it currently is looking for the same node id, once the scanning has actually started.

Client nodes can help here if you they behave better than data nodes; note that typically data nodes disappear due to GCs which tend to happen when writing occurs (not so much reading). Increasing the cluster size or allocating more memory typically fixes this issue.

costin · 2016-01-11T17:54:37Z

And by the way, try using the latest ES-Hadoop 2.2-rc1. It contains an important fix regarding a bug that did not preserve the configuration properly across the job stages.

devoncrouse · 2016-01-11T20:02:13Z

Thanks for the context; that makes sense. It's less that the data nodes are unstable; cluster scaling events and forced failures (Chaos Monkey) are exposing the issue in my case. I'll try the latest, and adapt my application to fail fast in this case and restart the read from scratch.

I'll let you close this if appropriate; flagged as a bug, not sure if this was something to be potentially addressed in the future.

costin · 2016-01-11T21:53:27Z

It's a missing feature mainly due to the way ES works right now. Once sequence/ordered id make it into ES things will be different since then, one has some guarantees such as, I've read up to key X, restart a scan on a replica/fallback node but from X onwards.
It's early days still but at least it gives the clients (such as ES-Hadoop) more features directly into ES - trying to replicate those outside it's... a losing game.

costin · 2016-01-24T23:16:39Z

Closing the issue. Once the feature makes it into ES, will open a new one (it's on the watch list).

costin added bug :Rest labels Jan 11, 2016

costin added question v2.2.0 and removed bug labels Jan 24, 2016

costin closed this as completed Jan 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637

Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637

devoncrouse commented Dec 24, 2015

costin commented Jan 11, 2016

costin commented Jan 11, 2016

devoncrouse commented Jan 11, 2016

costin commented Jan 11, 2016

costin commented Jan 24, 2016

Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637

Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637

Comments

devoncrouse commented Dec 24, 2015

costin commented Jan 11, 2016

costin commented Jan 11, 2016

devoncrouse commented Jan 11, 2016

costin commented Jan 11, 2016

costin commented Jan 24, 2016