Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637

Closed
devoncrouse opened this issue Dec 24, 2015 · 5 comments

Comments

@devoncrouse
Copy link

There have been a couple issues raised with this error, but I think this is a little different, and seems explainable.

I have an ES 1.7.x cluster with dedicated masters/clients/datas, and everything is happy. Even have Chaos Monkey running, and cluster recovery works exactly like I'd expect.

The problem for me is more in the Apache Spark integration; when a node is lost during an active job, it seems like es-spark tries relentlessly to pull the data from the same (non-existent) node id, without attempting to figure out where the replica(s) are or primary went:

These settings are nice in theory, but ineffective for this problem as stated above:

es.batch.write.retry.count=300
es.batch.write.retry.wait=5s

Error example:

Job aborted due to stage failure: Task 2 in stage 406.0 failed 4 times, most recent failure: Lost task 2.3 in stage 406.0 (TID 53333, 10.100.30.222): org.apache.spark.util.TaskCompletionListenerException: ElasticsearchIllegalArgumentException[No data node with id[czeFxNWjSW6wYzkNq8sUhA] found] at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:72) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Maybe there's a configuration that would help alleviate this (besides es.nodes.client.only; I'm moving a lot of data and would prefer not to take the performance hit/scale up client nodes), and if anyone has a workaround I'd appreciate it.

@costin
Copy link
Member

costin commented Jan 11, 2016

@devoncrouse I assume you are talking about reads. Currently, due to the way Lucene and ES works, ES-Hadoop cannot just move a read to a different client since there are no guarantees that the documents will be returned in a different order. That is because a scan is really tied to a node - if the node dies, ES-Hadoop has no proper, supported way to resume the scroll to a different node (or restart it and properly discard the loaded data). Hence why it currently is looking for the same node id, once the scanning has actually started.

Client nodes can help here if you they behave better than data nodes; note that typically data nodes disappear due to GCs which tend to happen when writing occurs (not so much reading). Increasing the cluster size or allocating more memory typically fixes this issue.

@costin
Copy link
Member

costin commented Jan 11, 2016

And by the way, try using the latest ES-Hadoop 2.2-rc1. It contains an important fix regarding a bug that did not preserve the configuration properly across the job stages.

@devoncrouse
Copy link
Author

Thanks for the context; that makes sense. It's less that the data nodes are unstable; cluster scaling events and forced failures (Chaos Monkey) are exposing the issue in my case. I'll try the latest, and adapt my application to fail fast in this case and restart the read from scratch.

I'll let you close this if appropriate; flagged as a bug, not sure if this was something to be potentially addressed in the future.

@costin
Copy link
Member

costin commented Jan 11, 2016

It's a missing feature mainly due to the way ES works right now. Once sequence/ordered id make it into ES things will be different since then, one has some guarantees such as, I've read up to key X, restart a scan on a replica/fallback node but from X onwards.
It's early days still but at least it gives the clients (such as ES-Hadoop) more features directly into ES - trying to replicate those outside it's... a losing game.

@costin
Copy link
Member

costin commented Jan 24, 2016

Closing the issue. Once the feature makes it into ES, will open a new one (it's on the watch list).

@costin costin closed this as completed Jan 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants