CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

hfp · 2024-04-03T14:18:17Z

          This PR seems to cause:

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue.

( tested on H100 device )

Originally posted by @hfp in #767 (comment)

The text was updated successfully, but these errors were encountered:

Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."

alazzaro · 2024-04-03T14:34:07Z

According to the CUDA description:

cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned.

But then we don't call any printf (all are masked). And I don't understand why we see this problem only on H100...

hfp · 2024-04-03T14:40:06Z

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

alazzaro · 2024-04-03T14:41:54Z

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

it makes sense...

hfp · 2024-04-03T14:48:38Z

Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

* Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned." * Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

alazzaro · 2024-04-16T14:06:35Z

Let's leave this ticket open... I think the issue here is when the RT fails to build a kernel, but I'm not sure...

alazzaro · 2024-04-16T14:11:01Z

(Taking over from #777 (comment))

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

I start to think this is the right solution... But need more time to investigate it (see my previous comment).

hfp added a commit to hfp/dbcsr that referenced this issue Apr 3, 2024

Fixed cp2k#776

c002658

Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."

hfp mentioned this issue Apr 3, 2024

Avoid unused const variable/verbose_print warning #767

Merged

hfp mentioned this issue Apr 4, 2024

Fixed #776 #777

Merged

hfp closed this as completed in 31ddf41 Apr 16, 2024

hfp reopened this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

hfp commented Apr 3, 2024

alazzaro commented Apr 3, 2024

hfp commented Apr 3, 2024

alazzaro commented Apr 3, 2024

hfp commented Apr 3, 2024

alazzaro commented Apr 16, 2024

alazzaro commented Apr 16, 2024 •

edited

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

Comments

hfp commented Apr 3, 2024

alazzaro commented Apr 3, 2024

hfp commented Apr 3, 2024

alazzaro commented Apr 3, 2024

hfp commented Apr 3, 2024

alazzaro commented Apr 16, 2024

alazzaro commented Apr 16, 2024 • edited

alazzaro commented Apr 16, 2024 •

edited