Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix guard kernel launches #1364

Merged
merged 4 commits into from
Oct 18, 2023

Conversation

TorreZuk
Copy link
Contributor

  • cherry picks guard most kernel GGL launches (#2087)](b281770)
  • macro looks at hipPeekAtLastError to determine async kernel launch only status
  • new code pattern in that all kernel launch functions return status

TorreZuk and others added 3 commits October 17, 2023 09:34
* guard kernel GGL launches and different before and after peekAtLastHipError cause rocblas failure status
* log these hip errors to rocblas_cerr as causing failure status
@amcamd
Copy link
Contributor

amcamd commented Oct 17, 2023

Can you add a section to the file docs/API_Reference_Guide.rst after the section Asynchronous API. It should explain use of hipPeakLastError, something like below, but please edit below:

Kernel launch status error checking
^^^^^^^^^^^^^^^^^^^^^^^
The function hipPeakLastError() is called before and after kernel launches. This will detect if launch parameters are incorrect, for example exceeding max work-group size, max number of work-items. It will also detect if the code is running on the incorrect gpu. Note that hipPeakLastError() does not flush the last error. The use of hipPeakLastError() has the disadvantage that if the previous last error from another kernel launch is the same as the error from the current kernel, then no error is reported. You can prevent this by flushing the last error before calling a rocBLAS function with hipGetLastError(). Note that both hipPeakLastError() and hipGetLastError() run synchronously on the CPU and they only check the kernel launch, not the asynchronous work done by the kernel.

Copy link
Contributor

@amcamd amcamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested additional documentation.

@TorreZuk
Copy link
Contributor Author

Okay can add something like that. I think it checks thread limits but not work item limits. The other launch styles check max workgroups. But 0 work groups is checked. ... the pre-existing error may not be from a kernel launch, but for it to match errors may be very unlikely

@amcamd
Copy link
Contributor

amcamd commented Oct 17, 2023

OK, it will be best to list only some of the things it is known to check (We do not want to claim it checks something we do not know it checks, and we do not want to document hipPeakLastError() . We do not document what we do not implement :)

@TorreZuk
Copy link
Contributor Author

OK, it will be best to list only some of the things it is known to check (We do not want to claim it checks something we do not know it checks, and we do not want to document hipPeakLastError() . We do not document what we do not implement :)

You can review last commit

@TorreZuk TorreZuk merged commit 72e5736 into release-staging/rocm-rel-6.0 Oct 18, 2023
1 of 2 checks passed
@TorreZuk TorreZuk deleted the hotfix-guard-kernel-launches branch October 18, 2023 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants