Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

Open
hfp opened this issue Apr 3, 2024 · 6 comments

Comments

@hfp
Copy link
Member

hfp commented Apr 3, 2024

          This PR seems to cause:

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue.

( tested on H100 device )

Originally posted by @hfp in #767 (comment)

hfp added a commit to hfp/dbcsr that referenced this issue Apr 3, 2024
Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
@alazzaro
Copy link
Member

alazzaro commented Apr 3, 2024

According to the CUDA description:

cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned.

But then we don't call any printf (all are masked). And I don't understand why we see this problem only on H100...

@hfp
Copy link
Member Author

hfp commented Apr 3, 2024

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

@alazzaro
Copy link
Member

alazzaro commented Apr 3, 2024

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

it makes sense...

@hfp
Copy link
Member Author

hfp commented Apr 3, 2024

Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

hfp added a commit to hfp/dbcsr that referenced this issue Apr 4, 2024
* Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
* Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.
@hfp hfp mentioned this issue Apr 4, 2024
hfp added a commit to hfp/dbcsr that referenced this issue Apr 16, 2024
* Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
* Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.
@alazzaro
Copy link
Member

Let's leave this ticket open... I think the issue here is when the RT fails to build a kernel, but I'm not sure...

@alazzaro
Copy link
Member

alazzaro commented Apr 16, 2024

(Taking over from #777 (comment))

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

I start to think this is the right solution... But need more time to investigate it (see my previous comment).

@hfp hfp closed this as completed in 31ddf41 Apr 16, 2024
@hfp hfp reopened this Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants