Do not route MapReduce reads and writes through non-data nodes #512

jasontedor · 2015-07-20T01:19:35Z

The embedded REST client in the MapReduce integration in Elasticsearch Hadoop will by default route through non-data nodes (e.g., dedicated master nodes). This can be problematic as non-data nodes tend to have fewer resources (e.g., network and memory) than data nodes; routing reads using org.elasticsearch.hadoop.mr.EsInputFormat and writes using org.elasticsearch.hadoop.mr.EsOutputFormat through them puts pressure on these resources.

For example, dedicated master nodes usually have smaller heaps than data nodes. Reading and writing from Hadoop MapReduce jobs will cause the network buffers in the Elasticsearch JVM on these nodes to fill up leading to continual GC churn. This churn can lead to GC pauses that cause the node to be temporarily partitioned from the cluster; this is especially problematic when the node is the elected master node.

This can be avoided by disabling es.nodes.discovery and only specifying the data nodes in the cluster in the es.nodes property. However, disabling es.nodes.discovery and managing through es.nodes is not user-friendly, especially for large clusters or clusters with frequent node maintenance.

A more user-friendly approach would be to not route through non-data nodes by default but permit routing through non-data nodes via a configuration option. Since this configuration option will be enabled by default, this is a behavior change from previous versions.

The text was updated successfully, but these errors were encountered:

This commit adds a configuration option es.nodes.data.only for managing whether or not reads using EsInputFormat and writes using EsOutputFormat in Hadoop MapReduce jobs route through non-data nodes. The reason for wanting to avoid routing through non-data nodes is because these nodes are usually resource-constrainted relative to data nodes. This configuration option is enabled by default. Closes elastic#512

Polish code relates #512

costin · 2015-09-01T10:29:01Z

Merged in master and once the CI passes, to 2.1.x.
Excellent PR - thanks again for it. Looking forward to the next one ;)

Cheers,

This commit adds a configuration option es.nodes.data.only for managing whether or not reads using EsInputFormat and writes using EsOutputFormat in Hadoop MapReduce jobs route through non-data nodes. The reason for wanting to avoid routing through non-data nodes is because these nodes are usually resource-constrainted relative to data nodes. This configuration option is enabled by default. Closes #512 (cherry picked from commit c23a79f)

Polish code relates #512 (cherry picked from commit b7d3e59)

jasontedor added :MR enhancement labels Jul 20, 2015

jasontedor self-assigned this Jul 20, 2015

jasontedor mentioned this issue Jul 20, 2015

Do not route MapReduce reads and writes through non-data nodes #513

Closed

jasontedor closed this as completed in c23a79f Sep 1, 2015

costin added a commit that referenced this issue Sep 1, 2015

Improve error and debug message

b7d3e59

Polish code relates #512

costin added v2.1.2 v2.2.0-beta1 labels Sep 1, 2015

costin added a commit that referenced this issue Sep 1, 2015

Improve error and debug message

1209d9f

Polish code relates #512 (cherry picked from commit b7d3e59)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not route MapReduce reads and writes through non-data nodes #512

Do not route MapReduce reads and writes through non-data nodes #512

jasontedor commented Jul 20, 2015

costin commented Sep 1, 2015

Do not route MapReduce reads and writes through non-data nodes #512

Do not route MapReduce reads and writes through non-data nodes #512

Comments

jasontedor commented Jul 20, 2015

costin commented Sep 1, 2015