Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on AMD MI25, ROCm 1.6.4 #4

Closed
jszuppe opened this issue Nov 10, 2017 · 1 comment
Closed

Tests fail on AMD MI25, ROCm 1.6.4 #4

jszuppe opened this issue Nov 10, 2017 · 1 comment
Labels

Comments

@jszuppe
Copy link
Contributor

jszuppe commented Nov 10, 2017

rocRAND's tests test_hiprand_kernel, test_hiprand_api, and test_rocrand_kernel_philox4x32_10 randomly fail on AMD MI25 on ROCm 1.6.4. Mentioned tests don't fail on ROCm 1.6.3 and on CUDA 8/9. As far as we know right now, they also don't fail on any other device on ROCm 1.6.4. Currently, we suspect the problem is in ROCm, not it rocRAND.

After investigation we think it's some kind of synchronisation bug which shows itself only in very specific situations. Until it's fixed you can use temporary workarounds from branch rocm_164_mi25_workarounds.

Most of the features (including the most popular ones) are not / should not be affected by this bug.

Environment

Hardware:

  • AMD Radeon Instinct MI25
Software version
ROCm 1.6.4
HIP 1.3.17385
HCC clang version 6.0.0 (based on HCC 1.0.17412-f590a25-821e6d8-64e7fc7)
rocRAND master (452ef66)

Workarounds

The possible workarounds for this bug are:

  • adding additional synchronization after kernels and before copying the memory (as presented in branch rocm_164_mi25_workarounds; you can try using hipStreamWaitEvent() or hipStreamSynchronize() which should have less impact on performance),
  • setting environment variable HCC_OPT_FLUSH to 0, or
  • setting HIP_LAUNCH_BLOCKING to 1.

Please comment if you have problems applying the workarounds, or experience similar bug in a different place or on a different device.

@jszuppe jszuppe added the bug label Nov 10, 2017
@jszuppe jszuppe changed the title Runtime errors on AMD MI25, ROCm 1.6.4 Tests fail on AMD MI25, ROCm 1.6.4 Nov 10, 2017
bragadeesh added a commit that referenced this issue Dec 14, 2017
Fix generate_func to pass-by-value
@jszuppe
Copy link
Contributor Author

jszuppe commented Jul 13, 2018

I can't recreate random fails on 1.8.1. Please reopen if you can see this random fails again.

@jszuppe jszuppe closed this as completed Jul 13, 2018
SergiyKostrov pushed a commit to SergiyKostrov/rocRAND that referenced this issue Dec 12, 2022
Add license and contributing.md

Closes ROCm#4

See merge request !15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant