[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072

Dooyoung-Hwang · 2020-11-05T07:18:50Z

Is your feature request related to a problem? Please describe.
When I executed an aggregation query with our custom data source, I found the physical plan of the query was like this.

spark.sql("SELECT bucket, count(*) FROM test_table GROUP BY bucket").explain(true)

== Physical Plan ==
*(2) GpuColumnarToRow false
+- GpuHashAggregate(keys=[bucket#423], functions=[gpucount(1)], output=[bucket#423, count(1)#570L])
   +- GpuCoalesceBatches TargetSize(2147483647)
      +- GpuColumnarExchange gpuhashpartitioning(bucket#423, 10), true, [id=#1210]
         +- GpuHashAggregate(keys=[bucket#423], functions=[partial_gpucount(1)], output=[bucket#423, count#574L])
            +- GpuRowToColumnar TargetSize(2147483647)
               +- *(1) Scan R2Relation(com.skt.spark.r2.RedisConfig@232fa9c6,2147483647) [bucket#423] PushedFilters: [], ReadSchema: struct<bucket:string>

This shows that the InternalRows are built firstly, and they are transformed into ColumnarBatches by GpuRowToColumnar plan. If the custom DataSource can provide RDD[ColumnBatch] to spark-rapids directly, it would be more efficient because the conversion overhead is removed.

Describe the solution you'd like

In spark-rapids, add a trait of scala or an interface of java that requests RDD of ColumnarBatch.
If the class in a custom V1 DataSource, which extends BaseRelation, also implements this interface, the physical plan which scans a custom v1 source can also be overridden by spark-rapids.

The changed physical plan can be illustrated like this.

== Physical Plan ==
*(1) GpuColumnarToRow false
+- GpuHashAggregate(keys=[bucket#423], functions=[gpucount(1)], output=[bucket#423, count(1)#570L])
   +- GpuCoalesceBatches TargetSize(2147483647)
      +- GpuColumnarExchange gpuhashpartitioning(bucket#423, 10), true, [id=#1210]
         +- GpuHashAggregate(keys=[bucket#423], functions=[partial_gpucount(1)], output=[bucket#423, count#574L])
            +- GpuV1SourceScan Batched: true, DataFilters: [], Format: r2, PartitionFilters: [], PushedFilters: [], ReadSchema: ReadSchema: struct<bucket:string>

The text was updated successfully, but these errors were encountered:

jlowe · 2020-11-05T15:06:45Z

If the custom DataSource can provide RDD[ColumnBatch] to spark-rapids directly, it would be more efficient because the conversion overhead is removed.

Does this RDD[ColumnarBatch] contain GPU data or CPU data? If the latter there still would be a conversion from host columnar data to device columnar data. That type of conversion is already supported by the plugin, but it's important to note that a (cheaper) conversion would still occur. The plan would have a HostColumnarToGpu node instead of a GpuRowToColumnar node.

tgravescs · 2021-01-06T16:16:32Z

After discussions data source v1 doesn't support columnar so switch to use data source v2. With datasource v2, custom datasources just work and we insert a HostColumnarToGpu transition to get the data onto the GPU.

In this case I believe the data will already be in an Arrow format ArrowColumnVector we can investigate making the HostColumnarToGpu smarter about getting the data onto the GPU

tgravescs · 2021-01-06T17:57:31Z

note that looking at a couple of sample queries it uses Round of a decimal, which support for it in progress and it also uses average of a decimal which we don't support yet.

tgravescs · 2021-01-06T19:03:16Z

note for sample queries and data we can look at the taxi ride dataset and queries:

https://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
explanation - https://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html
4 queries can so be found here: https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.html

This is the result of other solutions.
https://tech.marksblogg.com/benchmarks.html

sameerz · 2021-01-06T23:42:16Z

Rounding support is being worked on in #1244 .
Average should work once we support casting, which is being tracked in this issue: #1330 .

tgravescs · 2021-01-19T16:03:26Z

Note we may also need percentile_approx here.

tgravescs · 2021-01-19T21:11:21Z

cudf jira for percentile_approx -> rapidsai/cudf#7170

tgravescs · 2021-01-29T18:58:36Z

the main functionality to support faster copy when using datasourcev2 supplying arrow data is commited under #1622. It supports primitive types and Strings. It does not support Decimal or nested types yet.

tgravescs · 2021-02-02T14:35:00Z

note filed separate issue for write side #1648.
I'm going to close this as the initial version is committed

…IDIA#1072) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Dooyoung-Hwang added ? - Needs Triage Need team to review and classify feature request New feature or request labels Nov 5, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 10, 2020

sameerz assigned tgravescs Nov 10, 2020

tgravescs added this to To do in Release 0.4 via automation Jan 5, 2021

tgravescs added the P0 Must have for release label Jan 5, 2021

tgravescs changed the title ~~[FEA] Support API which a custom V1 DataSource can provide RDD[ColumnarBatch] to spark-rapids instead of RDD[InternalRow] or RDD[Row]~~ [FEA] Support for a custom DataSource V2 which supplies Arrow data Jan 6, 2021

tgravescs mentioned this issue Jan 29, 2021

[FEA] Automate integration tests for Arrow HostColumnToGpu functionality #1620

Closed

This was referenced Jan 29, 2021

[FEA] HostColumnarToGPU with ArrowColumnVector copy add support for Decimal or nested types #1632

Open

[FEA] Support for a custom DataSource V2 which writes Arrow data #1648

Open

tgravescs closed this as completed Feb 2, 2021

Release 0.4 automation moved this from To do to Done Feb 2, 2021

tgravescs modified the milestone: Feb 1 - Feb 12 Feb 2, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to cae6132bd877e6b01c46c813be24b161b77376cf (NV…

8c69f03

…IDIA#1072) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072

[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072

Dooyoung-Hwang commented Nov 5, 2020

jlowe commented Nov 5, 2020

tgravescs commented Jan 6, 2021

tgravescs commented Jan 6, 2021

tgravescs commented Jan 6, 2021

sameerz commented Jan 6, 2021

tgravescs commented Jan 19, 2021 •

edited

Loading

tgravescs commented Jan 19, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Feb 2, 2021

[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072

[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072

Comments

Dooyoung-Hwang commented Nov 5, 2020

jlowe commented Nov 5, 2020

tgravescs commented Jan 6, 2021

tgravescs commented Jan 6, 2021

tgravescs commented Jan 6, 2021

sameerz commented Jan 6, 2021

tgravescs commented Jan 19, 2021 • edited Loading

tgravescs commented Jan 19, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Feb 2, 2021

tgravescs commented Jan 19, 2021 •

edited

Loading