Make Spark know the partitioning of the read data #153

EnricoMi · 2021-11-24T14:23:36Z

The connector partitions the graph to allow Spark to read it in parallel. But Spark does not know anything about the partitioning. Say the connector partitions the graph by predicate and uid range, Spark would not know that and repartition / shuffle the read data if it wanted to join on partition or uid. If Spark would know the exact partitioning scheme, it could avoid un-needed shuffle steps.

Check to what extend Spark allows data sources to tell it about its partitioning.

EnricoMi added the enhancement New feature or request label Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Spark know the partitioning of the read data #153

Make Spark know the partitioning of the read data #153

EnricoMi commented Nov 24, 2021

Make Spark know the partitioning of the read data #153

Make Spark know the partitioning of the read data #153

Comments

EnricoMi commented Nov 24, 2021