Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8 test failures against openblas #367

Closed
littlewu2508 opened this issue Jan 19, 2022 · 7 comments
Closed

8 test failures against openblas #367

littlewu2508 opened this issue Jan 19, 2022 · 7 comments

Comments

@littlewu2508
Copy link

After resolving OpenMathLib/OpenBLAS#3513 and #363 I performed the test on rocSOLVER-rocm-4.3.0 against openblas-0.3.19, and there are 8 tests failures reported:

[  FAILED  ] 8 tests, listed below:
[  FAILED  ] checkin_lapack/SYGS2.batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGS2.batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })
[  FAILED  ] checkin_lapack/SYGS2.strided_batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGS2.strided_batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })
[  FAILED  ] checkin_lapack/SYGST.batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGST.batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })
[  FAILED  ] checkin_lapack/SYGST.strided_batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) })
[  FAILED  ] checkin_lapack/SYGST.strided_batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) })

with gzip compressed full test log:
rocSOLVER-rocm4.3.0-against-openblas.log.gz

Earlier I have reported 8 tests failed comparing rocBLAS and openblas, maybe they are related with these failures.

Environment

Hardware description
GPU Vega 20 [Radeon VII]
CPU AMD Ryzen 7 5800X
Software version
Linux 5.15.8
ROCK Upstream Kernel
ROCR v4.3.0
rocBLAS rocm-4.3.0 with Tensile asm_full
Host Compiler gcc-11.2
Device Compiler hipcc-4.3.0, llvm-rocm-4.3.0
@cgmb
Copy link
Collaborator

cgmb commented Jan 19, 2022

From your log:

[ RUN      ] checkin_lapack/SYGS2.batched__float/16
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.00523617 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGS2.batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) }) (14 ms)
...
[ RUN      ] checkin_lapack/SYGS2.batched__float/18
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 1.34363e-05 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGS2.batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) }) (15 ms)
...
[ RUN      ] checkin_lapack/SYGS2.strided_batched__float/16
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.00523617 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGS2.strided_batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) }) (10 ms)
...
[ RUN      ] checkin_lapack/SYGS2.strided_batched__float/18
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 1.34363e-05 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGS2.strided_batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) }) (9 ms)
...
[ RUN      ] checkin_lapack/SYGST.batched__float/16
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.00525629 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGST.batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) }) (14 ms)
...
[ RUN      ] checkin_lapack/SYGST.batched__float/18
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 9.74946e-06 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGST.batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) }) (14 ms)
...
[ RUN      ] checkin_lapack/SYGST.strided_batched__float/16
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.00525629 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGST.strided_batched__float/16, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'L' (76, 0x4C) }) (10 ms)
...
[ RUN      ] checkin_lapack/SYGST.strided_batched__float/18
/fast/portage/sci-libs/rocSOLVER-4.3.0/work/rocSOLVER-rocm-4.3.0/clients/include/testing_sygsx_hegsx.hpp:446: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 9.74946e-06 vs 8.34465e-06
[  FAILED  ] checkin_lapack/SYGST.strided_batched__float/18, where GetParam() = ({ 70, 100, 110 }, { '1' (49, 0x31), 'U' (85, 0x55) }) (10 ms)

At first glance, half those failures appear to be minor differences in precision (/18). The other half will require more investigation (/16).

@cgmb
Copy link
Collaborator

cgmb commented Apr 14, 2022

Hi @littlewu2508. It appears that this issue has been resolved in recent versions of rocSOLVER.

I managed to reproduce these test failures using rocSOLVER for rocm-4.3.0 with OpenBLAS 0.3.19. However, when I tried with rocSOLVER for rocm-4.5.2 and rocm-5.1.1 (still with OpenBLAS 0.3.19), the tests all passed. In fact, when I updated to OpenBLAS 0.3.20 (for the laswp fix you helped with), rocSOLVER for rocm-5.1.1 passed the full test suite!

@littlewu2508
Copy link
Author

That's brilliant, I'll test the rocSOLVER rocm-5.x.

It would be perfect if fix can be backported to 4.3.x. Many users stick to 4.3.x version because they use older cards like Fury.

@cgmb
Copy link
Collaborator

cgmb commented Apr 15, 2022

A new 4.3.x release is unlikely, but if you're compiling from source, you could build newer versions of rocBLAS/rocSOLVER on older versions of the rocm-dev stack.

I know for a fact that the versions of rocBLAS and rocSOLVER tagged at rocm-4.5.2 will build using the HIP stack from ROCm 4.3.1. In fact, I think you can probably build the versions of rocBLAS and rocSOLVER tagged at rocm-5.1.1 with the HIP stack from ROCm 4.3.1. Just be sure to rebuild all the {roc,hip}{BLAS,SOLVER} libraries at the new tag, since they all use private (unstable) APIs provided by rocBLAS.

@littlewu2508
Copy link
Author

A new 4.3.x release is unlikely, but if you're compiling from source, you could build newer versions of rocBLAS/rocSOLVER on older versions of the rocm-dev stack.

I know for a fact that the versions of rocBLAS and rocSOLVER tagged at rocm-4.5.2 will build using the HIP stack from ROCm 4.3.1. In fact, I think you can probably build the versions of rocBLAS and rocSOLVER tagged at rocm-5.1.1 with the HIP stack from ROCm 4.3.1. Just be sure to rebuild all the {roc,hip}{BLAS,SOLVER} libraries at the new tag, since they all use private (unstable) APIs provided by rocBLAS.

Thanks for pointing out!

@cgmb
Copy link
Collaborator

cgmb commented Apr 16, 2022

I think you can probably build the versions of rocBLAS and rocSOLVER tagged at rocm-5.1.1 with the HIP stack from ROCm 4.3.1

Just to confirm. Yes, this is possible.

littlewu2508 added a commit to littlewu2508/gentoo that referenced this issue May 2, 2022
According to ROCm/rocSOLVER#367 (comment)
hip and low-level runtimes of rocm does not need to be the same version
with high-level libraries. Loosen dev-util/hip SLOT dependencies

Package-Manager: Portage-3.0.30, Repoman-3.0.3
Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com>
littlewu2508 added a commit to littlewu2508/gentoo that referenced this issue May 2, 2022
According to ROCm/rocSOLVER#367 (comment)
hip and low-level runtimes of rocm does not need to be the same version
with high-level libraries. Loosen dev-util/hip SLOT dependencies

Package-Manager: Portage-3.0.30, Repoman-3.0.3
Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com>
@littlewu2508
Copy link
Author

I tested rocSOLVER-rocm-5.0.2 against openblas-0.3.20 on Radeon RX 6700XT, and all tests have passed. Thanks!

littlewu2508 added a commit to littlewu2508/gentoo that referenced this issue May 2, 2022
According to ROCm/rocSOLVER#367 (comment)
hip and low-level runtimes of rocm does not need to be the same version
with high-level libraries. Loosen dev-util/hip SLOT dependencies
All tests passed on single Radeon RX 6700XT

Package-Manager: Portage-3.0.30, Repoman-3.0.3
Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue May 3, 2022
According to ROCm/rocSOLVER#367 (comment)
hip and low-level runtimes of rocm does not need to be the same version
with high-level libraries. Loosen dev-util/hip SLOT dependencies
All tests passed on single Radeon RX 6700XT

Package-Manager: Portage-3.0.30, Repoman-3.0.3
Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com>
Signed-off-by: Benda Xu <heroxbd@gentoo.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants