You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@costin , regarding your comment from the other closed issue (at the end of this note):
I am actually ok with it using the "data" nodes directly. To confirm, if I pass to the spark conf 2 client nodes IP addresses, it "will" detect the actual data nodes (could be 8, 16, 32 nodes, hundreds) and use those correct? We use client nodes since their IP addresses stay pretty stable. Seems best if Spark talks directly to Data Nodes - seems like this is the default option - which is great (hoping to confirm).
I browsed around the beta 4 docs, where in the docs does it describe what you mention below as a way to "force" clients only?
Is there a way to investigate via logger output what nodes it will actually use? With many Data nodes and on only 2 clients, to get the best parallelism from jobs, it would be great if I can confirm spark is distributing the calls to all Data nodes/shards via the partition scheme vs client nodes (even though it was the client node IPs in the spark conf, used to initialize the "gossip" or "discovery" of the larger cluster).
Then visa versa, if "force client only enabled", confirm that worked.
--- original note ---
@jeffsteinmetz For some reason I've only found this comment now - apologies for the huge delay. THe latest Beta (4) has support for client only nodes - in other words, es-hadoop can be forced to connect to the cluster only through these nodes. Clearly it affects parallelism since the queries are distributed between these nodes (instead of going to the data nodes directly) but performance doesn't seem to be affected too much - depends on the volume really and how import locality is.
In other words, if you are doing HUGE bulk reads, you might find it slower, if not, you are unlikely to spot any difference.
The text was updated successfully, but these errors were encountered:
To confirm, if I pass to the spark conf 2 client nodes IP addresses, it "will" detect the actual data nodes (could be 8, 16, 32 nodes, hundreds) and use those correct?
Yes. The nodes you specify are 'seed' IPs. Note that if somebody wants, this behaviour can be disabled and the discovery avoided so only these nodes are used for cluster state operations.
However the connection to the data nodes, is still applied.
I browsed around the beta 4 docs, where in the docs does it describe what you mention below as a way to "force" clients only?
I think by mistake that section hasn't been updated - in fact I'll raise an issue about this. es.nodes.client.only is the setting and indeed, the docs don't mention that...
Is there a way to investigate via logger output what nodes it will actually use?
You've already seen from the other ticket however turning on logging on the rest package gives you a LOT of information. Additionally, using some type of wire analyzer also works.
@costin , regarding your comment from the other closed issue (at the end of this note):
I am actually ok with it using the "data" nodes directly. To confirm, if I pass to the spark conf 2 client nodes IP addresses, it "will" detect the actual data nodes (could be 8, 16, 32 nodes, hundreds) and use those correct? We use client nodes since their IP addresses stay pretty stable. Seems best if Spark talks directly to Data Nodes - seems like this is the default option - which is great (hoping to confirm).
I browsed around the beta 4 docs, where in the docs does it describe what you mention below as a way to "force" clients only?
Is there a way to investigate via logger output what nodes it will actually use? With many Data nodes and on only 2 clients, to get the best parallelism from jobs, it would be great if I can confirm spark is distributing the calls to all Data nodes/shards via the partition scheme vs client nodes (even though it was the client node IPs in the spark conf, used to initialize the "gossip" or "discovery" of the larger cluster).
Then visa versa, if "force client only enabled", confirm that worked.
--- original note ---
@jeffsteinmetz For some reason I've only found this comment now - apologies for the huge delay. THe latest Beta (4) has support for client only nodes - in other words, es-hadoop can be forced to connect to the cluster only through these nodes. Clearly it affects parallelism since the queries are distributed between these nodes (instead of going to the data nodes directly) but performance doesn't seem to be affected too much - depends on the volume really and how import locality is.
In other words, if you are doing HUGE bulk reads, you might find it slower, if not, you are unlikely to spot any difference.
The text was updated successfully, but these errors were encountered: