Spark: saveToEs can evaluate the RDD twice #631

randallwhitman · 2015-12-16T22:25:18Z

Agree or disagree:

When saving an RDD to Elasticsearch, the Elasticsearch-Spark connector uses RDD.take to check whether the input RDD is empty (line 59). This means that if the input RDD has not been persisted, it will be evaluated twice under saveToEs. If the input RDD has a wide dependency, this could be costly. Thus, the caller should persist the RDD before calling saveToEs.

    if (rdd == null || rdd.partitions.length == 0 || rdd.take(1).length == 0) {
      return
    }

The text was updated successfully, but these errors were encountered:

costin · 2016-01-08T17:01:36Z

@randallwhitman The 'pattern' above is actually used (or at least used to, need to double check) in Spark as well. The idea is to optimize and avoid the whole job that gets started behind the scenes, if the RDD is empty. Do you encounter any issues with your RDDs? Do you actually find the RDD being evaluated twice?
rdd.take(1) should trigger minimal evaluation of the RDD content which should be reused.

Just to keep things clean, I've removed the take(1) for now.

randallwhitman · 2016-01-08T17:34:28Z

"Do you actually find the RDD being evaluated twice?"

At this point, no, I have not staged a test of it. I had first posed this question as to best practices for using the Es-Spark connector.

relates #631

costin · 2016-01-08T18:11:35Z

I've checked the Spark code and the mailing list and there's no clear consensus; clearly the operation is not cheap and highly dependent on the RDD. To avoid any side-effects, I've removed the take(1) bit - it might trigger some unneeded jobs (when the RDD is empty) but at least, it is more clearer in general case.
Closing the issue...

relates #631 (cherry picked from commit fc96529)

costin added bug :Rest v2.2.0-rc1 v2.1.3 labels Jan 8, 2016

costin added a commit that referenced this issue Jan 8, 2016

[SPARK] Remove RDD check to avoid double evaluation

fc96529

relates #631

costin closed this as completed Jan 8, 2016

costin added a commit that referenced this issue Jan 16, 2016

[SPARK] Remove RDD check to avoid double evaluation

6051de5

relates #631 (cherry picked from commit fc96529)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: saveToEs can evaluate the RDD twice #631

Spark: saveToEs can evaluate the RDD twice #631

randallwhitman commented Dec 16, 2015

costin commented Jan 8, 2016

randallwhitman commented Jan 8, 2016

costin commented Jan 8, 2016

Spark: saveToEs can evaluate the RDD twice #631

Spark: saveToEs can evaluate the RDD twice #631

Comments

randallwhitman commented Dec 16, 2015

costin commented Jan 8, 2016

randallwhitman commented Jan 8, 2016

costin commented Jan 8, 2016