You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When saving an RDD to Elasticsearch, the Elasticsearch-Spark connector uses RDD.take to check whether the input RDD is empty (line 59). This means that if the input RDD has not been persisted, it will be evaluated twice under saveToEs. If the input RDD has a wide dependency, this could be costly. Thus, the caller should persist the RDD before calling saveToEs.
@randallwhitman The 'pattern' above is actually used (or at least used to, need to double check) in Spark as well. The idea is to optimize and avoid the whole job that gets started behind the scenes, if the RDD is empty. Do you encounter any issues with your RDDs? Do you actually find the RDD being evaluated twice? rdd.take(1) should trigger minimal evaluation of the RDD content which should be reused.
Just to keep things clean, I've removed the take(1) for now.
I've checked the Spark code and the mailing list and there's no clear consensus; clearly the operation is not cheap and highly dependent on the RDD. To avoid any side-effects, I've removed the take(1) bit - it might trigger some unneeded jobs (when the RDD is empty) but at least, it is more clearer in general case.
Closing the issue...
Agree or disagree:
When saving an RDD to Elasticsearch, the Elasticsearch-Spark connector uses
RDD.take
to check whether the input RDD is empty (line 59). This means that if the input RDD has not been persisted, it will be evaluated twice undersaveToEs
. If the input RDD has a wide dependency, this could be costly. Thus, the caller should persist the RDD before callingsaveToEs
.The text was updated successfully, but these errors were encountered: