New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark]Is there a way to make elasticsearch-hadoop stick to the load-balancer(client), instead of going trying to ping all the data nodes? #373
Comments
It turns out that only Basically, this environment is set up based on this post: According to the code in Is there a way to make |
@wingchen es-hadoop/spark relies on connecting to the data nodes directly to support a parallel, node-to-node architecture. In other words, for each read and write, for each shard of the target shard, es-hadoop/spark will create a task/split that works directly against the data node. This way, each task works locally with the data without impacting the rest of the cluster. If we were to support client nodes, the benefits of such a parallelized query would go away since everything would be shoved through the client. There would be no locality and in fact, no parallelism. In fact, in your case, rather than having 2 parallel tasks reading data from |
Thanks for the quick response (and for replying during the weekend). I agree with you. It's faster to have driver querying each individual nodes, instead of sticking to the client node. We are evaluating the impact of changing the config this way. I got passed the previous error with this new code update. However, I still run into an Exception:
It turns out |
@wingchen What's your configuration? Can you please turn on logging all the way to |
This is too little for a gist, so I am posting it here:
It turns out that all the other logs are the same, but I am looking into why too. |
There should be a lot more than that such as what data is sent and received from Elastic; make sure to enable |
Correction. The only one still exists. I have updated all the logs to: https://gist.github.com/wingchen/7eb26cffbe59f6626e5a The client node was activated successfully though:
|
@wingchen I've added a fix in master and pushed another build. There are better messages in case no client nodes are found (including the case of discovery being disabled) and also better node filtering. Can you please try it out and report back. |
Thanks a lot for the update. I am in PST. Unfortunately we got an NPE on the new commit: https://gist.github.com/wingchen/f5b2d86ca128c7aea2b4
|
@costin Do you have any update? thanks |
I've pushed an update a couple of days ago but forgot to comment :( - can you please try it out and report back? If you're online, it would be great if we could connect over the IRC to get to the bottom of this. My user is |
I am in IRC now. |
@wingchen The final fix has been committed and its relevant build pushed to maven. Try it out and let me know if we can close this issue. Cheers, |
confirmed fixed. thanks |
Closing the issue. |
This is spark related.
I got the following exception with the code:
Exception:
It seems to go and try other nodes even after
es.nodes.discovery
is explicitly turned off.Here comes the
/_nodes?pretty
(I removed some sensitive info):The text was updated successfully, but these errors were encountered: