You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The embedded REST client in the MapReduce integration in Elasticsearch Hadoop will by default route through non-data nodes (e.g., dedicated master nodes). This can be problematic as non-data nodes tend to have fewer resources (e.g., network and memory) than data nodes; routing reads using org.elasticsearch.hadoop.mr.EsInputFormat and writes using org.elasticsearch.hadoop.mr.EsOutputFormat through them puts pressure on these resources.
For example, dedicated master nodes usually have smaller heaps than data nodes. Reading and writing from Hadoop MapReduce jobs will cause the network buffers in the Elasticsearch JVM on these nodes to fill up leading to continual GC churn. This churn can lead to GC pauses that cause the node to be temporarily partitioned from the cluster; this is especially problematic when the node is the elected master node.
This can be avoided by disabling es.nodes.discovery and only specifying the data nodes in the cluster in the es.nodes property. However, disabling es.nodes.discovery and managing through es.nodes is not user-friendly, especially for large clusters or clusters with frequent node maintenance.
A more user-friendly approach would be to not route through non-data nodes by default but permit routing through non-data nodes via a configuration option. Since this configuration option will be enabled by default, this is a behavior change from previous versions.
The text was updated successfully, but these errors were encountered:
This commit adds a configuration option es.nodes.data.only for managing whether or not reads using EsInputFormat and writes using
EsOutputFormat in Hadoop MapReduce jobs route through non-data nodes. The reason for wanting to avoid routing through non-data nodes is
because these nodes are usually resource-constrainted relative to data nodes. This configuration option is enabled by default.
Closeselastic#512
This commit adds a configuration option es.nodes.data.only for managing whether or not reads using EsInputFormat and writes using
EsOutputFormat in Hadoop MapReduce jobs route through non-data nodes. The reason for wanting to avoid routing through non-data nodes is
because these nodes are usually resource-constrainted relative to data nodes. This configuration option is enabled by default.
Closes#512
(cherry picked from commit c23a79f)
The embedded REST client in the MapReduce integration in Elasticsearch Hadoop will by default route through non-data nodes (e.g., dedicated master nodes). This can be problematic as non-data nodes tend to have fewer resources (e.g., network and memory) than data nodes; routing reads using
org.elasticsearch.hadoop.mr.EsInputFormat
and writes usingorg.elasticsearch.hadoop.mr.EsOutputFormat
through them puts pressure on these resources.For example, dedicated master nodes usually have smaller heaps than data nodes. Reading and writing from Hadoop MapReduce jobs will cause the network buffers in the Elasticsearch JVM on these nodes to fill up leading to continual GC churn. This churn can lead to GC pauses that cause the node to be temporarily partitioned from the cluster; this is especially problematic when the node is the elected master node.
This can be avoided by disabling
es.nodes.discovery
and only specifying the data nodes in the cluster in thees.nodes
property. However, disablinges.nodes.discovery
and managing throughes.nodes
is not user-friendly, especially for large clusters or clusters with frequent node maintenance.A more user-friendly approach would be to not route through non-data nodes by default but permit routing through non-data nodes via a configuration option. Since this configuration option will be enabled by default, this is a behavior change from previous versions.
The text was updated successfully, but these errors were encountered: