Hotfix guard kernel launches #1364

TorreZuk · 2023-10-17T16:29:54Z

cherry picks guard most kernel GGL launches (#2087)](b281770)
macro looks at hipPeekAtLastError to determine async kernel launch only status
new code pattern in that all kernel launch functions return status

* guard kernel GGL launches and different before and after peekAtLastHipError cause rocblas failure status * log these hip errors to rocblas_cerr as causing failure status

amcamd · 2023-10-17T22:46:36Z

Can you add a section to the file docs/API_Reference_Guide.rst after the section Asynchronous API. It should explain use of hipPeakLastError, something like below, but please edit below:

Kernel launch status error checking
^^^^^^^^^^^^^^^^^^^^^^^
The function hipPeakLastError() is called before and after kernel launches. This will detect if launch parameters are incorrect, for example exceeding max work-group size, max number of work-items. It will also detect if the code is running on the incorrect gpu. Note that hipPeakLastError() does not flush the last error. The use of hipPeakLastError() has the disadvantage that if the previous last error from another kernel launch is the same as the error from the current kernel, then no error is reported. You can prevent this by flushing the last error before calling a rocBLAS function with hipGetLastError(). Note that both hipPeakLastError() and hipGetLastError() run synchronously on the CPU and they only check the kernel launch, not the asynchronous work done by the kernel.

amcamd

Requested additional documentation.

TorreZuk · 2023-10-17T22:55:16Z

Okay can add something like that. I think it checks thread limits but not work item limits. The other launch styles check max workgroups. But 0 work groups is checked. ... the pre-existing error may not be from a kernel launch, but for it to match errors may be very unlikely

amcamd · 2023-10-17T23:00:57Z

OK, it will be best to list only some of the things it is known to check (We do not want to claim it checks something we do not know it checks, and we do not want to document hipPeakLastError() . We do not document what we do not implement :)

TorreZuk · 2023-10-17T23:21:56Z

OK, it will be best to list only some of the things it is known to check (We do not want to claim it checks something we do not know it checks, and we do not want to document hipPeakLastError() . We do not document what we do not implement :)

You can review last commit

TorreZuk and others added 3 commits October 17, 2023 09:34

guard most kernel GGL launches (#2087)

b281770

* guard kernel GGL launches and different before and after peekAtLastHipError cause rocblas failure status * log these hip errors to rocblas_cerr as causing failure status

back out other 6.1 arg name change

29dc755

document hipGetLastError API change handling

2a67892

TorreZuk requested review from amcamd, mahmoodw, daineAMD, bragadeesh, NaveenElumalaiAMD, rkamd, yoichiyoshida, babakpst and nakajee as code owners October 17, 2023 16:29

amcamd approved these changes Oct 17, 2023

View reviewed changes

document launch error checking

c518edc

TorreZuk merged commit 72e5736 into release-staging/rocm-rel-6.0 Oct 18, 2023
1 of 2 checks passed

TorreZuk deleted the hotfix-guard-kernel-launches branch October 18, 2023 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix guard kernel launches #1364

Hotfix guard kernel launches #1364

TorreZuk commented Oct 17, 2023

amcamd commented Oct 17, 2023

amcamd left a comment

TorreZuk commented Oct 17, 2023

amcamd commented Oct 17, 2023

TorreZuk commented Oct 17, 2023

Hotfix guard kernel launches #1364

Hotfix guard kernel launches #1364

Conversation

TorreZuk commented Oct 17, 2023

amcamd commented Oct 17, 2023

amcamd left a comment

Choose a reason for hiding this comment

TorreZuk commented Oct 17, 2023

amcamd commented Oct 17, 2023

TorreZuk commented Oct 17, 2023