Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

Closed
steccami opened this issue Dec 16, 2015 · 3 comments

Comments

@steccami
Copy link

Hi all. I am trying to store a Hive table into an Elasticsearch 1.7 index following this approach: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html.
At the end of the ingestion phase, the ES index docs count is greater than Hive table rows count.
The same problem occurs when sending the same data from Spark. Hive and Spark are running on CDH 5.4.

I see some job failures due to this exeception:
WARN TaskSetManager: Lost task 58.0 in stage 11.0 (TID 4791, xxx): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [32/471680](maybe ES was overloaded?). Bailing out...
This is a documented issue (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html) but it seems that usually the problem is having less docs in ES, not more.

I would say that yarn job re-submission mechanism causes hadoop to re-send records which causes doc replication in ES. Does this explanation make sense? Any suggestions about how to fix it?

Thanks in advance for your help.

@costin
Copy link
Member

costin commented Jan 15, 2016

Hi.
Yes it does In case of Map Reduce one can actually try to prevent this from happening as documented here. Hive also has an option for this (which ES-Hadoop should document) namely hive.mapred.reduce.tasks.speculative.execution - can you set that to false and see whether it makes any difference?

costin added a commit that referenced this issue Jan 15, 2016
costin added a commit that referenced this issue Jan 16, 2016
related #628

(cherry picked from commit f2c4325)
@steccami
Copy link
Author

Thank you very much for your reply! In the future I will check this parameter very carefully.
Regards.

@costin
Copy link
Member

costin commented Jan 29, 2016

Closing the issue.

@costin costin closed this as completed Jan 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants