Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explicit client nodes vs preferred parallelism. Method to confirm what spark is actually using. #437

Closed
jeffsteinmetz opened this issue Apr 28, 2015 · 2 comments

Comments

@jeffsteinmetz
Copy link

@costin , regarding your comment from the other closed issue (at the end of this note):

I am actually ok with it using the "data" nodes directly. To confirm, if I pass to the spark conf 2 client nodes IP addresses, it "will" detect the actual data nodes (could be 8, 16, 32 nodes, hundreds) and use those correct? We use client nodes since their IP addresses stay pretty stable. Seems best if Spark talks directly to Data Nodes - seems like this is the default option - which is great (hoping to confirm).

I browsed around the beta 4 docs, where in the docs does it describe what you mention below as a way to "force" clients only?

Is there a way to investigate via logger output what nodes it will actually use? With many Data nodes and on only 2 clients, to get the best parallelism from jobs, it would be great if I can confirm spark is distributing the calls to all Data nodes/shards via the partition scheme vs client nodes (even though it was the client node IPs in the spark conf, used to initialize the "gossip" or "discovery" of the larger cluster).

Then visa versa, if "force client only enabled", confirm that worked.

--- original note ---

@jeffsteinmetz For some reason I've only found this comment now - apologies for the huge delay. THe latest Beta (4) has support for client only nodes - in other words, es-hadoop can be forced to connect to the cluster only through these nodes. Clearly it affects parallelism since the queries are distributed between these nodes (instead of going to the data nodes directly) but performance doesn't seem to be affected too much - depends on the volume really and how import locality is.
In other words, if you are doing HUGE bulk reads, you might find it slower, if not, you are unlikely to spot any difference.

@jeffsteinmetz
Copy link
Author

seems this may have answered my question:
#387

@costin
Copy link
Member

costin commented Apr 29, 2015

@jeffsteinmetz to answer your questions,

To confirm, if I pass to the spark conf 2 client nodes IP addresses, it "will" detect the actual data nodes (could be 8, 16, 32 nodes, hundreds) and use those correct?

Yes. The nodes you specify are 'seed' IPs. Note that if somebody wants, this behaviour can be disabled and the discovery avoided so only these nodes are used for cluster state operations.
However the connection to the data nodes, is still applied.

I browsed around the beta 4 docs, where in the docs does it describe what you mention below as a way to "force" clients only?

I think by mistake that section hasn't been updated - in fact I'll raise an issue about this. es.nodes.client.only is the setting and indeed, the docs don't mention that...

Is there a way to investigate via logger output what nodes it will actually use?

You've already seen from the other ticket however turning on logging on the rest package gives you a LOT of information. Additionally, using some type of wire analyzer also works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants