Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

steccami · 2015-12-16T08:58:00Z

Hi all. I am trying to store a Hive table into an Elasticsearch 1.7 index following this approach: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html.
At the end of the ingestion phase, the ES index docs count is greater than Hive table rows count.
The same problem occurs when sending the same data from Spark. Hive and Spark are running on CDH 5.4.

I see some job failures due to this exeception:
WARN TaskSetManager: Lost task 58.0 in stage 11.0 (TID 4791, xxx): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [32/471680](maybe ES was overloaded?). Bailing out...
This is a documented issue (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html) but it seems that usually the problem is having less docs in ES, not more.

I would say that yarn job re-submission mechanism causes hadoop to re-send records which causes doc replication in ES. Does this explanation make sense? Any suggestions about how to fix it?

Thanks in advance for your help.

costin · 2016-01-15T22:22:27Z

Hi.
Yes it does In case of Map Reduce one can actually try to prevent this from happening as documented here. Hive also has an option for this (which ES-Hadoop should document) namely hive.mapred.reduce.tasks.speculative.execution - can you set that to false and see whether it makes any difference?

related #628

related #628 (cherry picked from commit f2c4325)

steccami · 2016-01-19T13:16:12Z

Thank you very much for your reply! In the future I will check this parameter very carefully.
Regards.

costin · 2016-01-29T10:26:26Z

Closing the issue.

costin added question v2.1.3 v2.2.0 :Hive labels Jan 15, 2016

costin added a commit that referenced this issue Jan 15, 2016

[DOC] Extend section on speculative execution

f2c4325

related #628

costin added a commit that referenced this issue Jan 16, 2016

[DOC] Extend section on speculative execution

71c4afa

related #628 (cherry picked from commit f2c4325)

costin closed this as completed Jan 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

steccami commented Dec 16, 2015

costin commented Jan 15, 2016

steccami commented Jan 19, 2016

costin commented Jan 29, 2016

Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

Comments

steccami commented Dec 16, 2015

costin commented Jan 15, 2016

steccami commented Jan 19, 2016

costin commented Jan 29, 2016