Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep original nodes when adding discoveredHosts in SettingsUtils #256

Closed
wants to merge 1 commit into from

Conversation

bseibel
Copy link

@bseibel bseibel commented Aug 28, 2014

This fixes the case where all queries are hitting one node even when
the partition is setting a specific node.

This fixes the case where all queries are hitting one node even when
the partition is setting a specific node.
@costin
Copy link
Member

costin commented Aug 28, 2014

@bseibel thanks for the PR. Can you clarify though what's the issue that you are trying to address? We deliberately use the discovered nodes instead of the provided ones since the given ones can be masters or clients and if valid, will be included in the discovered ones.
Do you see certain load or abnormal behavior on certain nodes?

@bseibel
Copy link
Author

bseibel commented Aug 28, 2014

Right, but since the list is being replaced by discovered nodes each partitions client will always select the same node to connect to, which means all traffic ends up getting routed through a single node in the cluster instead of directly to the node that has the shard.

@costin
Copy link
Member

costin commented Aug 28, 2014

@bseibel the discovery nodes are used only to query information about the cluster state - once an index shard information has been retrieved, each task will communicate with the appropriate node/shard regardless of the discovered nodes.
That's in fact the reason we recommend folks set up only a few nodes to pick up the discovery - the heavy work is done on the target nodes.

Whether the discovered nodes or the specified ones are used - the metadata calls currently go in the same order. We can try and improve this by using some kind of randomizer for the discovery nodes but then again this is a one time, per job, lightweight call.

@costin
Copy link
Member

costin commented Aug 28, 2014

@bseibel By the way, you can double check this by executing a query on a index across multiple nodes and see what connections are made - you can find this out by enabling logging (see the reference docs chapter).

@bseibel
Copy link
Author

bseibel commented Aug 28, 2014

@costin You are correct in saying that each task should communicate with the appropriate node/shard regardless of the discovered nodes, I'm trying to say that this is not what is actually happening. :)

EsInputFormat (line 199):

settings.setHosts(esSplit.nodeIp).setPort(esSplit.httpPort);

These settings are used to create a RestRepository, which is passed to the RestClient constructor which does:

network = new NetworkClient(settings, SettingsUtils.nodes(settings));

which then passes in the list of hosts in the settings object with the previously discovered node list, because SettingsUtils.nodes uses the discovered list if it exists instead.

Thus each task's NetworkClient is going to select the same node, causing all queries to be directed to the same node in the cluster, even though we previously computed (determining splits) which node each split should optimally query.

@bseibel
Copy link
Author

bseibel commented Aug 29, 2014

@costin Sorry I realize after a night of sleep that I've probably omitted a very important piece of useful information, this is happening when running as a Spark job :)

@costin costin closed this in 8ba8fa6 Sep 1, 2014
@costin
Copy link
Member

costin commented Sep 1, 2014

@bseibel Hi. I've identified the issue (as well as improved the Hadoop code-base to be similar to that of native Spark) and pushed a fix plus additional logging to give insight into what is going on.
I'd like to do more testing to be sure the issue is fully resolved however the first batch passes just fine.

I've pushed a snapshot already in Maven for 2.1.0.BUILD-SNAPSHOT. Please give it a try and let me know whether it works for you or not.

@costin costin reopened this Sep 1, 2014
@bseibel
Copy link
Author

bseibel commented Sep 2, 2014

@costin Thanks! This takes care of the issue. And a much nicer fix than my one liner ;)

@costin costin added bug labels Sep 2, 2014
@costin
Copy link
Member

costin commented Sep 2, 2014

Thanks for the feedback and the report - closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants