Host Memory OOM handling for RowToColumnarIterator #10617

jbrennan333 · 2024-03-20T22:25:20Z

This adds host memory oom handling for the slower path of GpuRowToColumnarExec.

Most of this patch involves adding code to allocate a single spillable host buffer up front and then slice it up for each column builder. It does a first pass of slicing (one slice for each column) and then the RapidHostColumnBuilder slices its buffer further to pre-allocate data, offsets and validity buffers for itself and its children. The pre-allocation logic is optional for the column builders - we can still use them in the old way of dynamically growing host buffers as needed.
The validity buffer is pre-allocated for all nullable columns, but we only use it if nulls were added.

This also handles cases where we overwrite one of the pre-allocated buffers for the columns. We use the Retryable interface to add checkpoint/restore logic to the RapidsHostColumnBuilders. We checkpoint before writing out a row, and then if we overwrite while writing a row, we restore all of the columns to the checkpointed state. We then save the row that was in progress for later processing. If it is the very first row, or if we have a coalesce goal for a single batch, we re-enable dynamic growth for the builders and try that row again. This risks an OOM, but prevents an outright failure for this case.

This has primarily been tested via existing integration and unit tests, and also running nds locally and on a larger cluster. A performance check of nds at 3TB was run and found no significant performance impact. I added some smaller batch sizes to the main row_conversion_test to force it into the overwrite code paths.

Signed-off-by: Jim Brennan <jimb@nvidia.com>

jbrennan333 · 2024-03-20T22:28:27Z

build

jbrennan333 · 2024-03-21T16:43:09Z

Put up commits to merge up to latest, fix unit test failure, and parameterize batchSizeBytes for the test_row_conversion integration test. By testing with 4mb and 1kb batch sizes, the test now exercises the new code paths that deal with overwriting one of the host columns.

jbrennan333 · 2024-03-21T16:43:24Z

build

jbrennan333 · 2024-03-21T21:59:55Z

Some integration tests were failing because column views were being created with a valid buffer when there were no nulls. Old code would never create the valid buffer if there were no nulls. This code pre-allocates it in case we need it, but if we end up not using it, we need to close it and set to null before creating the gpu columns.

jbrennan333 · 2024-03-21T22:00:17Z

build

jbrennan333 · 2024-03-22T13:58:29Z

build

jbrennan333 · 2024-03-22T21:09:21Z

I have added the host memory retries to this. I will update the description.

jbrennan333 · 2024-03-22T21:09:35Z

build

jbrennan333 · 2024-03-24T18:29:01Z

build

jbrennan333 · 2024-03-25T15:06:37Z

I think the premerge failures may be unrelated:

[2024-03-24T18:49:20.236Z] Caused by: ai.rapids.cudf.CudfException: CUDF failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-708-cuda11/thirdparty/cudf/cpp/src/io/comp/nvcomp_adapter.cpp:688: Compression error: nvCOMP 2.4 or newer is required for Zstandard compression

It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version.

jlowe · 2024-03-25T15:20:39Z

It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version.

That's seems like a scary error. How could we be pulling in such an old nvcomp version during the build?

Tracked by #10627

jbrennan333 · 2024-03-26T13:00:45Z

build

revans2

The code looks good.

jbrennan333 · 2024-03-26T18:31:29Z

This looks like a testing framework failure:

2024-03-26T17:09:05.0979031Z [2024-03-26T17:08:28.028Z] ../../../../integration_tests/src/main/python/join_test.py::test_broadcast_join_right_struct_as_key[Right-Struct(['child0', String],['child1', Byte],['child2', Short],['child3', Integer],['child4', Long],['child5', Boolean],['child6', Date],['child7', Timestamp],['child8', Null],['child9', Decimal(12,2)])][DATAGEN_SEED=1711463898, TZ=UTC, INJECT_OOM, IGNORE_ORDER({'local': True})] Could not connect to ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt to send interrupt signal to process
2024-03-26T17:09:05.0980062Z [2024-03-26T17:08:28.052Z] ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
2024-03-26T17:09:05.0980760Z [2024-03-26T17:08:28.053Z] 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
2024-03-26T17:09:05.0981428Z [2024-03-26T17:08:28.053Z] 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
2024-03-26T17:09:05.0981952Z [2024-03-26T17:08:28.053Z] 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
2024-03-26T17:09:05.0982662Z [2024-03-26T17:08:28.053Z] 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:30)
2024-03-26T17:09:05.0983346Z [2024-03-26T17:08:28.053Z] 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:70)
2024-03-26T17:09:05.0984040Z [2024-03-26T17:08:28.053Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2024-03-26T17:09:05.0984768Z [2024-03-26T17:08:28.053Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2024-03-26T17:09:05.0985228Z [2024-03-26T17:08:28.053Z] 	at java.base/java.lang.Thread.run(Thread.java:829)
2024-03-26T17:09:05.0985393Z [2024-03-26T17:08:28.053Z] 
2024-03-26T17:09:05.0986407Z [2024-03-26T17:08:28.066Z] Fail to find or publish report...java.io.IOException: Unable to create live FilePath for ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt
2024-03-26T17:09:05.0987491Z [2024-03-26T17:08:28.084Z] ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
2024-03-26T17:09:05.0988131Z [2024-03-26T17:08:28.084Z] 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
2024-03-26T17:09:05.0988807Z [2024-03-26T17:08:28.084Z] 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
2024-03-26T17:09:05.0989327Z [2024-03-26T17:08:28.084Z] 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
2024-03-26T17:09:05.0990089Z [2024-03-26T17:08:28.084Z] 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:30)
2024-03-26T17:09:05.0990769Z [2024-03-26T17:08:28.084Z] 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:70)
2024-03-26T17:09:05.0991459Z [2024-03-26T17:08:28.084Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2024-03-26T17:09:05.0992138Z [2024-03-26T17:08:28.084Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2024-03-26T17:09:05.0992542Z [2024-03-26T17:08:28.084Z] 	at java.base/java.lang.Thread.run(Thread.java:829)
2024-03-26T17:09:05.0992717Z [2024-03-26T17:08:28.084Z] 
2024-03-26T17:09:05.0993619Z Unable to create live FilePath for ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt****** Result of stage Premerge CI 2 is FAILURE ******

jbrennan333 · 2024-03-26T18:41:16Z

build

jbrennan333 · 2024-03-26T20:51:36Z

I have been doing some additional testing with ScaleTest query 7. Running this on my desktop at scale 1, complexity 10, and with ShuffleExchangeExec disabled, I am able to force host memory OOMs in this code (RowToColumnarIterator). To see the impact, I changed the original RapidsHostColumnBuilder code to use HostAlloc.alloc() instead of HostMemoryBuffer.allocate() so I could see where we start running out of memory.

Before this patch, I fail with a CPU OOM at 16GB of heap memory. (no oom at 17GB).
With this patch, I fail with a CPU OOM at 6GB of heap memory. (no oom at 7GB).

I am running with 16 cpu cores, 16GB executor memory, and 4 concurrent GPU tasks.

jbrennan333 · 2024-03-26T20:55:17Z

I'm seeing a lot of java gateway errors in the premerge build log:

2024-03-26T19:43:23.7922131Z [2024-03-26T19:38:38.162Z] ConnectionRefusedError: [Errno 111] Connection refused
2024-03-26T19:43:23.7922604Z [2024-03-26T19:38:38.162Z] --------------------------- Captured stderr teardown ---------------------------
2024-03-26T19:43:23.7923269Z [2024-03-26T19:38:38.162Z] 2024-03-26 19:13:51 ERROR    An error occurred while trying to connect to the Java server (****:35623)
2024-03-26T19:43:23.7923586Z [2024-03-26T19:38:38.162Z] Traceback (most recent call last):
2024-03-26T19:43:23.7924860Z [2024-03-26T19:38:38.162Z]   File "/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-9236-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
2024-03-26T19:43:23.7925175Z [2024-03-26T19:38:38.162Z]     connection = self.deque.pop()
2024-03-26T19:43:23.7925543Z [2024-03-26T19:38:38.162Z] IndexError: pop from an empty deque
2024-03-26T19:43:23.7925719Z [2024-03-26T19:38:38.162Z] 
2024-03-26T19:43:23.7926212Z [2024-03-26T19:38:38.162Z] During handling of the above exception, another exception occurred:
2024-03-26T19:43:23.7926381Z [2024-03-26T19:38:38.162Z] 
2024-03-26T19:43:23.7926694Z [2024-03-26T19:38:38.162Z] Traceback (most recent call last):
2024-03-26T19:43:23.7927922Z [2024-03-26T19:38:38.162Z]   File "/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-9236-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1115, in start

jbrennan333 · 2024-03-26T21:42:46Z

So far I have not been able to repro any premerge test failures locally, so I merged up to HEAD and will kick off the build again.

jbrennan333 · 2024-03-26T21:42:51Z

build

jbrennan333 · 2024-03-27T14:44:28Z

I am still having trouble reproducing these premerge integration test failures. I have been able to run the full join_test.py with no failures. All of the tests that are failing during premerge are passing for me locally.

jbrennan333 · 2024-03-27T19:38:38Z

I found a bug (leaked spillable host buffer) while trying to repro. Might explain the premerge test failures.

jbrennan333 · 2024-03-27T19:41:24Z

build

jbrennan333 · 2024-03-27T21:53:17Z

As another test, I ran the full nds power run at scale 100 on my desktop, with

spark.rapids.sql.exec.ShuffleExchangeExec=false
spark.rapids.memory.host.offHeapLimit.enabled=true
spark.rapids.memory.host.offHeapLimit.size=2G \

All queries passed, and the output was validated.

jbrennan333 · 2024-04-01T13:37:29Z

I filed a follow-up PR to add host memory oom handling to other places where GpuColumnarBatchBuilder is used.
#10647

jbrennan333 · 2024-04-01T16:52:00Z

I ran one final nds ab performance check on an 8-node A100 cluster and there was no measurable performance impact from this change.

)" This reverts commit c28c7fa. Signed-off-by: Jason Lowe <jlowe@nvidia.com>

…10657) This reverts commit c28c7fa. Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Change GpuRowToColumnarIterator to allocate a single buffer for builders

0c7dfb3

Signed-off-by: Jim Brennan <jimb@nvidia.com>

jbrennan333 self-assigned this Mar 20, 2024

jbrennan333 added feature request New feature or request reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Mar 20, 2024

jbrennan333 mentioned this pull request Mar 20, 2024

[FEA] Add Host Memory Retry for Row to Columnar Conversion #8887

Open

jbrennan333 added 2 commits March 21, 2024 08:21

Merge branch 'branch-24.04' into jtb-rtoc-new

2147ba4

Fix unit test and add batchSizeBytes parameter to test_row_conversion

11312b1

If nullCount is zero, close valid buffer and set to null

bf8153d

jbrennan333 added 2 commits March 22, 2024 09:38

Move comments

4ce094a

Add host memory retries

5be7594

jbrennan333 marked this pull request as ready for review March 22, 2024 21:54

revans2 previously approved these changes Mar 26, 2024

View reviewed changes

jbrennan333 changed the title ~~Change GpuRowToColumnarIterator to allocate a single buffer for builders~~ Host Memory OOM handling for RowToColumnarIterator Mar 26, 2024

Merge branch 'branch-24.04' into jtb-rtoc-new

a74adbe

Fix leak of spillable host buffer

a5fe235

jbrennan333 dismissed revans2’s stale review via a5fe235 March 27, 2024 19:37

sameerz removed the feature request New feature or request label Mar 27, 2024

jbrennan333 mentioned this pull request Apr 1, 2024

[FEA] Add host memory oom handling for other users of GpuColumnarBatchBuilder #10647

Open

jlowe approved these changes Apr 1, 2024

View reviewed changes

jbrennan333 merged commit c28c7fa into NVIDIA:branch-24.04 Apr 1, 2024
43 checks passed

jlowe mentioned this pull request Apr 2, 2024

[BUG] Databricks cache tests failing with host memory OOM #10656

Closed

jlowe added a commit to jlowe/spark-rapids that referenced this pull request Apr 2, 2024

Revert "Host Memory OOM handling for RowToColumnarIterator (NVIDIA#10617

c08adad

)" This reverts commit c28c7fa. Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe mentioned this pull request Apr 2, 2024

Revert "Host Memory OOM handling for RowToColumnarIterator (#10617)" [databricks] #10657

Merged

jlowe added a commit that referenced this pull request Apr 2, 2024

Revert "Host Memory OOM handling for RowToColumnarIterator (#10617)" (#…

526663c

…10657) This reverts commit c28c7fa. Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host Memory OOM handling for RowToColumnarIterator #10617

Host Memory OOM handling for RowToColumnarIterator #10617

jbrennan333 commented Mar 20, 2024 •

edited

jbrennan333 commented Mar 20, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 22, 2024

jbrennan333 commented Mar 22, 2024

jbrennan333 commented Mar 22, 2024

jbrennan333 commented Mar 24, 2024

jbrennan333 commented Mar 25, 2024

jlowe commented Mar 25, 2024 •

edited

jbrennan333 commented Mar 26, 2024

revans2 left a comment

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Apr 1, 2024 •

edited

jbrennan333 commented Apr 1, 2024

Host Memory OOM handling for RowToColumnarIterator #10617

Host Memory OOM handling for RowToColumnarIterator #10617

Conversation

jbrennan333 commented Mar 20, 2024 • edited

jbrennan333 commented Mar 20, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 21, 2024

jbrennan333 commented Mar 22, 2024

jbrennan333 commented Mar 22, 2024

jbrennan333 commented Mar 22, 2024

jbrennan333 commented Mar 24, 2024

jbrennan333 commented Mar 25, 2024

jlowe commented Mar 25, 2024 • edited

jbrennan333 commented Mar 26, 2024

revans2 left a comment

Choose a reason for hiding this comment

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 26, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Mar 27, 2024

jbrennan333 commented Apr 1, 2024 • edited

jbrennan333 commented Apr 1, 2024

jbrennan333 commented Mar 20, 2024 •

edited

jlowe commented Mar 25, 2024 •

edited

jbrennan333 commented Apr 1, 2024 •

edited