Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not route MapReduce reads and writes through non-data nodes #512

Closed
jasontedor opened this issue Jul 20, 2015 · 1 comment
Closed

Do not route MapReduce reads and writes through non-data nodes #512

jasontedor opened this issue Jul 20, 2015 · 1 comment

Comments

@jasontedor
Copy link
Member

The embedded REST client in the MapReduce integration in Elasticsearch Hadoop will by default route through non-data nodes (e.g., dedicated master nodes). This can be problematic as non-data nodes tend to have fewer resources (e.g., network and memory) than data nodes; routing reads using org.elasticsearch.hadoop.mr.EsInputFormat and writes using org.elasticsearch.hadoop.mr.EsOutputFormat through them puts pressure on these resources.

For example, dedicated master nodes usually have smaller heaps than data nodes. Reading and writing from Hadoop MapReduce jobs will cause the network buffers in the Elasticsearch JVM on these nodes to fill up leading to continual GC churn. This churn can lead to GC pauses that cause the node to be temporarily partitioned from the cluster; this is especially problematic when the node is the elected master node.

This can be avoided by disabling es.nodes.discovery and only specifying the data nodes in the cluster in the es.nodes property. However, disabling es.nodes.discovery and managing through es.nodes is not user-friendly, especially for large clusters or clusters with frequent node maintenance.

A more user-friendly approach would be to not route through non-data nodes by default but permit routing through non-data nodes via a configuration option. Since this configuration option will be enabled by default, this is a behavior change from previous versions.

@jasontedor jasontedor self-assigned this Jul 20, 2015
jasontedor added a commit to jasontedor/elasticsearch-hadoop that referenced this issue Jul 20, 2015
This commit adds a configuration option es.nodes.data.only for managing whether or not reads using EsInputFormat and writes using
EsOutputFormat in Hadoop MapReduce jobs route through non-data nodes. The reason for wanting to avoid routing through non-data nodes is
because these nodes are usually resource-constrainted relative to data nodes. This configuration option is enabled by default.

Closes elastic#512
costin added a commit that referenced this issue Sep 1, 2015
@costin
Copy link
Member

costin commented Sep 1, 2015

Merged in master and once the CI passes, to 2.1.x.
Excellent PR - thanks again for it. Looking forward to the next one ;)

Cheers,

jasontedor added a commit that referenced this issue Sep 1, 2015
This commit adds a configuration option es.nodes.data.only for managing whether or not reads using EsInputFormat and writes using
EsOutputFormat in Hadoop MapReduce jobs route through non-data nodes. The reason for wanting to avoid routing through non-data nodes is
because these nodes are usually resource-constrainted relative to data nodes. This configuration option is enabled by default.

Closes #512

(cherry picked from commit c23a79f)
costin added a commit that referenced this issue Sep 1, 2015
Polish code

relates #512

(cherry picked from commit b7d3e59)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants