Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: saveToEs can evaluate the RDD twice #631

Closed
randallwhitman opened this issue Dec 16, 2015 · 3 comments
Closed

Spark: saveToEs can evaluate the RDD twice #631

randallwhitman opened this issue Dec 16, 2015 · 3 comments

Comments

@randallwhitman
Copy link

Agree or disagree:

When saving an RDD to Elasticsearch, the Elasticsearch-Spark connector uses RDD.take to check whether the input RDD is empty (line 59). This means that if the input RDD has not been persisted, it will be evaluated twice under saveToEs. If the input RDD has a wide dependency, this could be costly. Thus, the caller should persist the RDD before calling saveToEs.

    if (rdd == null || rdd.partitions.length == 0 || rdd.take(1).length == 0) {
      return
    }
@costin
Copy link
Member

costin commented Jan 8, 2016

@randallwhitman The 'pattern' above is actually used (or at least used to, need to double check) in Spark as well. The idea is to optimize and avoid the whole job that gets started behind the scenes, if the RDD is empty. Do you encounter any issues with your RDDs? Do you actually find the RDD being evaluated twice?
rdd.take(1) should trigger minimal evaluation of the RDD content which should be reused.

Just to keep things clean, I've removed the take(1) for now.

@randallwhitman
Copy link
Author

"Do you actually find the RDD being evaluated twice?"

At this point, no, I have not staged a test of it. I had first posed this question as to best practices for using the Es-Spark connector.

costin added a commit that referenced this issue Jan 8, 2016
@costin
Copy link
Member

costin commented Jan 8, 2016

I've checked the Spark code and the mailing list and there's no clear consensus; clearly the operation is not cheap and highly dependent on the RDD. To avoid any side-effects, I've removed the take(1) bit - it might trigger some unneeded jobs (when the RDD is empty) but at least, it is more clearer in general case.
Closing the issue...

@costin costin closed this as completed Jan 8, 2016
costin added a commit that referenced this issue Jan 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants