Support serializing packed tables directly for the normal shuffle path #10818

firestarman · 2024-05-15T06:57:33Z

Contribute to #10790
Fix #10841

This PR is trying to accelerate the normal shuffle path by partitioning and slicing tables on GPU.

The sliced table is already serializable so can be written to the Shuffle output stream directly, along with a lightweight metadata (a TableMeta) to rebuild the table on the Shuffle read side.

On the Shuffle read side, the new introduced PackedTableIterator will read the tables from the Shuffle input stream and rebuild them on GPU by leveraging the existing utils (MetaUtils, GpuCompressedColumnVector). Next, the existing GpuCoalesceBatches node is used to do the batch concatenation for the downstream operators, similar as what Rapids Shuffle does.

It led to some perf degression in NDS runs, so disable this feature by default. But we got about 2x speedup for a customer query (We got this only when setting the executor cores to 2, but it supposed to be 16).

Waiting for more tests ...

Numbers of 3k parquest data on our cluster.

// ==GPU Serde
app-20240517075217-0003,Power Test Time,607000

// ==CPU Serde
app-20240517070754-0000,Power Test Time,585000

--------- Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-05-15T06:58:29Z

Make it draft because there are still 5 unit tests failing.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-05-16T01:28:38Z

WAR the failing tests by disabling the GPU serde, and filed an issue (#10823) to track the follow-up

firestarman · 2024-05-16T08:43:38Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-05-17T04:06:53Z

build

abellina

This is a quick first pass

abellina · 2024-05-17T14:11:31Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -1788,6 +1788,15 @@ val SHUFFLE_COMPRESSION_LZ4_CHUNK_SIZE = conf("spark.rapids.shuffle.compression.
        .integerConf
        .createWithDefault(20)

+  val SHUFFLE_GPU_SERDE_ENABLED =
+    conf("spark.rapids.shuffle.serde.enabled")


Lets change this to:

spark.rapids.shuffle.serde.type

And there are two types so far: "CPU", "GPU". The way the flag is used is fine we would still convert it to a boolean isGpuSerdeEnabled, but we would test whether spark.rapids.shuffle.serde.type=="GPU".

It's internal right now, but as we learn more about this method we need to add documentation that says when to use which. Or come up with smart heuristics that will pick CPU/GPU automatically (so we can add another type.. AUTO)

Done for the config name part

abellina · 2024-05-17T14:15:26Z

tests/src/test/scala/com/nvidia/spark/rapids/AdaptiveQueryExecSuite.scala

@@ -504,6 +504,7 @@ class AdaptiveQueryExecSuite
      // disable DemoteBroadcastHashJoin rule from removing BHJ due to empty partitions
      .set(SQLConf.NON_EMPTY_PARTITION_RATIO_FOR_BROADCAST_JOIN.key, "0")
      .set(RapidsConf.TEST_ALLOWED_NONGPU.key, "ShuffleExchangeExec,HashPartitioning")
+      .set(RapidsConf.SHUFFLE_GPU_SERDE_ENABLED.key, "false")


we should remove these overrides setting since disabled is the default.

abellina · 2024-05-17T14:21:15Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTableSerde.scala

+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+private sealed trait TableSerde {
+  protected val P_MAGIC_NUM: Int = 0x43554447 // "CUDF".asInt + 1


since we have our own P_MAGIC_NUM could we not use this to detect whether the data is GpuTableSerde or JCudfSerialized? This should be follow on work but I imagine a case where we might want to use a specific serialization format for columns of a certain type, size, complexity vs the other.

abellina

whoops, I didn't mean to approve before.

abellina · 2024-05-17T14:54:14Z

It led to some perf degression in NDS runs, so disable this feature by default.

Which queries were slower? It would be great to get some feedback from you on what is different between the customer query and the NDS queries.

Also which queries got faster from NDS? That would be interesting.

I did also write internally as I'd like to see more standard configurations used for this benchmark on the next run, so we can compare apples-to-apples with our baseline.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

…uffle-gpu-serde Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-05-20T03:29:19Z

build

sameerz · 2024-05-21T20:41:10Z

Please add more context about why the test cases in #10823 are failing before merging this PR. We'd like to understand if that issue needs to be addressed as part of this PR.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-05-28T06:26:49Z

Please add more context about why the test cases in #10823 are failing before merging this PR. We'd like to understand if that issue needs to be addressed as part of this PR.

Done

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-05-30T08:24:35Z

Move to draft since the perf is not as good as our expectation. The previous 2x speedup was got only when setting the executor cores to 2, but it supposed to be 16.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2024-07-29T01:14:54Z

@winningsix I am going to close this since we fail to get a case that can benefit from it. And we can reopen it if we find one in the future.

Support serializing packed tables directly for shuffle write

3a984f2

--------- Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Disble GPU serde for the AQE tests

baadb4b

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman marked this pull request as ready for review May 16, 2024 01:20

firestarman mentioned this pull request May 16, 2024

[BUG] 5 tests will fail when enabling GPU serde for the normal Shuffle. #10823

Open

firestarman added 2 commits May 16, 2024 17:59

Disable by default

11e933d

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Fix a build error

6e8bb5c

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

abellina self-requested a review May 17, 2024 14:00

abellina approved these changes May 17, 2024

View reviewed changes

abellina requested changes May 17, 2024

View reviewed changes

firestarman added 2 commits May 20, 2024 10:22

Address comments

d6082ae

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Merge branch 'branch-24.06' of github.com:NVIDIA/spark-rapids into sh…

0419224

…uffle-gpu-serde Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman changed the base branch from branch-24.06 to branch-24.08 May 27, 2024 01:34

Support buffering small tables for Shuffle read

99820e1

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman added 2 commits May 28, 2024 17:05

Merge remote-tracking branch 'NVDA/branch-24.08' into shuffle-gpu-serde

9727161

Moving split batches to host by a single copying

1bb4cfc

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman marked this pull request as draft May 30, 2024 08:23

firestarman added 3 commits July 2, 2024 05:17

Merge branch 'branch-24.08' into shuffle-gpu-serde

b52038e

d1

b4ea48f

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

retry when copying data to host for merged buffers

8c4b318

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman closed this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support serializing packed tables directly for the normal shuffle path #10818

Support serializing packed tables directly for the normal shuffle path #10818

firestarman commented May 15, 2024 •

edited

Loading

firestarman commented May 15, 2024

firestarman commented May 16, 2024 •

edited

Loading

firestarman commented May 16, 2024

firestarman commented May 17, 2024

abellina left a comment

abellina May 17, 2024

abellina May 17, 2024

firestarman May 20, 2024 •

edited

Loading

abellina May 17, 2024

firestarman May 20, 2024

abellina May 17, 2024

abellina left a comment

abellina commented May 17, 2024

firestarman commented May 20, 2024

sameerz commented May 21, 2024

firestarman commented May 28, 2024

firestarman commented May 30, 2024 •

edited

Loading

firestarman commented Jul 29, 2024

Support serializing packed tables directly for the normal shuffle path #10818

Support serializing packed tables directly for the normal shuffle path #10818

Conversation

firestarman commented May 15, 2024 • edited Loading

firestarman commented May 15, 2024

firestarman commented May 16, 2024 • edited Loading

firestarman commented May 16, 2024

firestarman commented May 17, 2024

abellina left a comment

Choose a reason for hiding this comment

abellina May 17, 2024

Choose a reason for hiding this comment

abellina May 17, 2024

Choose a reason for hiding this comment

firestarman May 20, 2024 • edited Loading

Choose a reason for hiding this comment

abellina May 17, 2024

Choose a reason for hiding this comment

firestarman May 20, 2024

Choose a reason for hiding this comment

abellina May 17, 2024

Choose a reason for hiding this comment

abellina left a comment

Choose a reason for hiding this comment

abellina commented May 17, 2024

firestarman commented May 20, 2024

sameerz commented May 21, 2024

firestarman commented May 28, 2024

firestarman commented May 30, 2024 • edited Loading

firestarman commented Jul 29, 2024

firestarman commented May 15, 2024 •

edited

Loading

firestarman commented May 16, 2024 •

edited

Loading

firestarman May 20, 2024 •

edited

Loading

firestarman commented May 30, 2024 •

edited

Loading