You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all. I am trying to store a Hive table into an Elasticsearch 1.7 index following this approach: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html.
At the end of the ingestion phase, the ES index docs count is greater than Hive table rows count.
The same problem occurs when sending the same data from Spark. Hive and Spark are running on CDH 5.4.
I see some job failures due to this exeception:
WARN TaskSetManager: Lost task 58.0 in stage 11.0 (TID 4791, xxx): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [32/471680](maybe ES was overloaded?). Bailing out...
This is a documented issue (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html) but it seems that usually the problem is having less docs in ES, not more.
I would say that yarn job re-submission mechanism causes hadoop to re-send records which causes doc replication in ES. Does this explanation make sense? Any suggestions about how to fix it?
Thanks in advance for your help.
The text was updated successfully, but these errors were encountered:
Hi.
Yes it does In case of Map Reduce one can actually try to prevent this from happening as documented here. Hive also has an option for this (which ES-Hadoop should document) namely hive.mapred.reduce.tasks.speculative.execution - can you set that to false and see whether it makes any difference?
Hi all. I am trying to store a Hive table into an Elasticsearch 1.7 index following this approach: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html.
At the end of the ingestion phase, the ES index docs count is greater than Hive table rows count.
The same problem occurs when sending the same data from Spark. Hive and Spark are running on CDH 5.4.
I see some job failures due to this exeception:
WARN TaskSetManager: Lost task 58.0 in stage 11.0 (TID 4791, xxx): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [32/471680](maybe ES was overloaded?). Bailing out...
This is a documented issue (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html) but it seems that usually the problem is having less docs in ES, not more.
I would say that yarn job re-submission mechanism causes hadoop to re-send records which causes doc replication in ES. Does this explanation make sense? Any suggestions about how to fix it?
Thanks in advance for your help.
The text was updated successfully, but these errors were encountered: