New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark ElasticsearchIllegalArgumentException[No data node with id[...] found] #637
Comments
@devoncrouse I assume you are talking about reads. Currently, due to the way Lucene and ES works, ES-Hadoop cannot just move a read to a different client since there are no guarantees that the documents will be returned in a different order. That is because a scan is really tied to a node - if the node dies, ES-Hadoop has no proper, supported way to resume the scroll to a different node (or restart it and properly discard the loaded data). Hence why it currently is looking for the same node id, once the scanning has actually started. Client nodes can help here if you they behave better than data nodes; note that typically data nodes disappear due to GCs which tend to happen when writing occurs (not so much reading). Increasing the cluster size or allocating more memory typically fixes this issue. |
And by the way, try using the latest ES-Hadoop 2.2-rc1. It contains an important fix regarding a bug that did not preserve the configuration properly across the job stages. |
Thanks for the context; that makes sense. It's less that the data nodes are unstable; cluster scaling events and forced failures (Chaos Monkey) are exposing the issue in my case. I'll try the latest, and adapt my application to fail fast in this case and restart the read from scratch. I'll let you close this if appropriate; flagged as a bug, not sure if this was something to be potentially addressed in the future. |
It's a missing feature mainly due to the way ES works right now. Once sequence/ordered id make it into ES things will be different since then, one has some guarantees such as, I've read up to key X, restart a scan on a replica/fallback node but from X onwards. |
Closing the issue. Once the feature makes it into ES, will open a new one (it's on the watch list). |
There have been a couple issues raised with this error, but I think this is a little different, and seems explainable.
I have an ES 1.7.x cluster with dedicated masters/clients/datas, and everything is happy. Even have Chaos Monkey running, and cluster recovery works exactly like I'd expect.
The problem for me is more in the Apache Spark integration; when a node is lost during an active job, it seems like es-spark tries relentlessly to pull the data from the same (non-existent) node id, without attempting to figure out where the replica(s) are or primary went:
These settings are nice in theory, but ineffective for this problem as stated above:
Error example:
Maybe there's a configuration that would help alleviate this (besides
es.nodes.client.only
; I'm moving a lot of data and would prefer not to take the performance hit/scale up client nodes), and if anyone has a workaround I'd appreciate it.The text was updated successfully, but these errors were encountered: