explicit client nodes vs preferred parallelism.  Method to confirm what spark is actually using.

@costin , regarding your comment from the other closed issue (at the end of this note):

I am actually ok with it using the "data" nodes directly.  To confirm, if I pass to the spark conf 2 client nodes IP addresses, it "will" detect the actual data nodes (could be 8, 16, 32 nodes, hundreds) and use those correct?  We use client nodes since their IP addresses stay pretty stable.  Seems best if Spark talks directly to Data Nodes -  seems like this is the default option - which is great (hoping to confirm).

I browsed around the beta 4 docs, where in the docs does it describe what you mention below as a way to "force" clients only?  

Is there a way to investigate via logger output what nodes it will actually use?  With many Data nodes and on only 2 clients, to get the best parallelism from jobs, it would be great if I can confirm spark is distributing the calls to all Data nodes/shards via the partition scheme vs client nodes (even though it was the client node IPs in the spark conf, used to initialize the "gossip" or "discovery" of the larger cluster).

Then visa versa, if "force client only enabled", confirm that worked.

--- original note ---

@jeffsteinmetz For some reason I've only found this comment now - apologies for the huge delay. THe latest Beta (4) has support for client only nodes - in other words, es-hadoop can be forced to connect to the cluster only through these nodes. Clearly it affects parallelism since the queries are distributed between these nodes (instead of going to the data nodes directly) but performance doesn't seem to be affected too much - depends on the volume really and how import locality is.
In other words, if you are doing HUGE bulk reads, you might find it slower, if not, you are unlikely to spot any difference.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

explicit client nodes vs preferred parallelism. Method to confirm what spark is actually using. #437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

explicit client nodes vs preferred parallelism. Method to confirm what spark is actually using. #437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions