Integrate new batched linalg drivers #1163

alugorey · 2023-01-24T20:52:47Z

New PR rebased on ~~rocm5.5_internal_testing~~ rocm5.6_internal_testing

Integrates batched versions of getrs, geqrf, getrf, getri, and gels

pruthvistony · 2023-01-25T02:15:33Z

jenkins retest this please

pruthvistony · 2023-01-25T23:52:10Z

Checking the log - http://rocmhead:8080/job/pytorch/job/pytorch-test-2/520/consoleText

Jan 25 19:11:42 test_qr_batched_cuda_complex128 (main.TestLinalgCUDA) ... ok (0.005s)
Jan 25 19:11:42 test_qr_batched_cuda_complex64 (main.TestLinalgCUDA) ... FAIL (0.003s)
Jan 25 19:11:42 test_qr_batched_cuda_float32 (main.TestLinalgCUDA) ... ok (0.004s)
Jan 25 19:11:42 test_qr_batched_cuda_float64 (main.TestLinalgCUDA) ... ok (0.004s)
Jan 25 19:11:42 test_qr_cuda_complex128 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)
Jan 25 19:11:42 test_qr_cuda_complex64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)
Jan 25 19:11:42 test_qr_cuda_float32 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)
Jan 25 19:11:42 test_qr_cuda_float64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)

Jan 25 19:11:42 ======================================================================
Jan 25 19:11:42 FAIL [0.003s]: test_qr_batched_cuda_complex64 (main.TestLinalgCUDA)
Jan 25 19:11:42 ----------------------------------------------------------------------
Jan 25 19:11:42 Traceback (most recent call last):
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2067, in wrapper
Jan 25 19:11:42 method(*args, **kwargs)
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test
Jan 25 19:11:42 result = test(self, **param_kwargs)
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 866, in dep_fn
Jan 25 19:11:42 return fn(slf, *args, **kwargs)
Jan 25 19:11:42 File "test_linalg.py", line 3640, in test_qr_batched
Jan 25 19:11:42 self.assertEqual(q, exp_q)
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2583, in assertEqual
Jan 25 19:11:42 assert_equal(
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1118, in assert_equal
Jan 25 19:11:42 raise error_metas[0].to_error(msg)
Jan 25 19:11:42 AssertionError: Tensor-likes are not close!
Jan 25 19:11:42
Jan 25 19:11:42 Mismatched elements: 105 / 105 (100.0%)
Jan 25 19:11:42 Greatest absolute difference: nan at index (0, 0, 1) (up to 1e-05 allowed)
Jan 25 19:11:42 Greatest relative difference: nan at index (0, 0, 1) (up to 1.3e-06 allowed)
Jan 25 19:11:42
Jan 25 19:11:42 ----------------------------------------------------------------------
Jan 25 19:11:42 Ran 716 tests in 167.970s
Jan 25 19:11:42
Jan 25 19:11:42 FAILED (failures=4, errors=10, skipped=67)

alugorey · 2023-01-26T20:30:56Z

Checking the log - http://rocmhead:8080/job/pytorch/job/pytorch-test-2/520/consoleText

Jan 25 19:11:42 test_qr_batched_cuda_complex128 (main.TestLinalgCUDA) ... ok (0.005s) Jan 25 19:11:42 test_qr_batched_cuda_complex64 (main.TestLinalgCUDA) ... FAIL (0.003s) Jan 25 19:11:42 test_qr_batched_cuda_float32 (main.TestLinalgCUDA) ... ok (0.004s) Jan 25 19:11:42 test_qr_batched_cuda_float64 (main.TestLinalgCUDA) ... ok (0.004s) Jan 25 19:11:42 test_qr_cuda_complex128 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s) Jan 25 19:11:42 test_qr_cuda_complex64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s) Jan 25 19:11:42 test_qr_cuda_float32 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s) Jan 25 19:11:42 test_qr_cuda_float64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)

Jan 25 19:11:42 ====================================================================== Jan 25 19:11:42 FAIL [0.003s]: test_qr_batched_cuda_complex64 (main.TestLinalgCUDA) Jan 25 19:11:42 ---------------------------------------------------------------------- Jan 25 19:11:42 Traceback (most recent call last): Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2067, in wrapper Jan 25 19:11:42 method(*args, **kwargs) Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test Jan 25 19:11:42 result = test(self, **param_kwargs) Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 866, in dep_fn Jan 25 19:11:42 return fn(slf, *args, **kwargs) Jan 25 19:11:42 File "test_linalg.py", line 3640, in test_qr_batched Jan 25 19:11:42 self.assertEqual(q, exp_q) Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2583, in assertEqual Jan 25 19:11:42 assert_equal( Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1118, in assert_equal Jan 25 19:11:42 raise error_metas[0].to_error(msg) Jan 25 19:11:42 AssertionError: Tensor-likes are not close! Jan 25 19:11:42 Jan 25 19:11:42 Mismatched elements: 105 / 105 (100.0%) Jan 25 19:11:42 Greatest absolute difference: nan at index (0, 0, 1) (up to 1e-05 allowed) Jan 25 19:11:42 Greatest relative difference: nan at index (0, 0, 1) (up to 1.3e-06 allowed) Jan 25 19:11:42 Jan 25 19:11:42 ---------------------------------------------------------------------- Jan 25 19:11:42 Ran 716 tests in 167.970s Jan 25 19:11:42 Jan 25 19:11:42 FAILED (failures=4, errors=10, skipped=67)

Ah, yes. I vaguely remember hitting this in my testing. It's weird because it only fails for complex64. not complex128 or the floats. I'll have to look deeper.

alugorey · 2023-01-27T18:20:58Z

@pruthvistony @jithunnair-amd I have uncovered a bug in an associated API in rocsolver/rocblas.
The aforementioned failing test is failing an assertion because a Tensor returned by torch.linalg.qr is filled with NaNs. I drove down into this API and found the math function apply_orgqr() to be the culprit which is a wrapper around a hipsolver function called hipsolverDnZungqr. This function is also a wrapper for a rocsolver/rocblas function here: https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/library/src/amd_detail/hipsolver.cpp#L1754
The API in question is returning NaNs in the "A" Tensor. causing the assertion above to fail.
Interestingly, this only occurs when the scalar type is complex64. The same test passes when using float32, float64, and complex128.

I have reached out to @jzuniga-amd about it to talk next steps. Until then, this work is blocked. I will create an appropriate JIRA ticket and update relevant ones after I speak with Juan.

pruthvistony · 2023-01-27T19:48:00Z

@alugorey
how about skipping complex64 and try to get this PR merged?
cc @jeffdaily @jithunnair-amd

alugorey · 2023-01-27T19:51:18Z

@alugorey how about skipping complex64 and try to get this PR merged? cc @jeffdaily @jithunnair-amd

I'm fine with that. I think it makes the most sense to finish up this PR and have the fix for the issue I illustrated come as a different PR associated with different JIRAs. The test in question is only tangentially related to the batched drivers and is more an issue with hipsolver's orgqr that we uncovered when switching to non-magma backend.

jithunnair-amd · 2023-01-27T20:25:08Z

@alugorey Absolutely, let's skip the unit tests and create an issue to make sure we track it. And then we should be good to merge this PR.

pruthvistony · 2023-01-28T08:32:00Z

some sccache error and build failed, retriggering it.

pruthvistony · 2023-01-28T08:32:15Z

jenkins retest this please

pruthvistony · 2023-01-29T07:50:54Z

jenkins retest this please

pruthvistony · 2023-01-31T07:28:16Z

@alugorey ,
Many cases are failing - http://rocmhead:8080/job/pytorch/job/pytorch-ci/558/
Failures in test 1 and test 2 are passing on the IFU PR.

alugorey · 2023-02-01T15:58:06Z

@pruthvistony
The failing linalg tests are due to the complex64 type not being supported in the batched drivers. I will skip those as well.

alugorey · 2023-02-01T18:57:44Z

@jithunnair-amd I am waiting to create a JIRA to deal with the complex64 issue as I am still working with @jzuniga-amd to understand the root of the problem. Most recently, he said:

"Complex64 is supported. Now, as we discussed, in your workflow there are some inconsistencies as you call functions that expect complex data with reals, and vice versa. I suppose that if a function that expect a complex matrix receive a real matrix, the autopromotion in C++ will set the imaginary part to zero, and if a function that expect a real matrix receive a complex matrix, then it will ignore the imaginary part... but I am not sure, I have to verify this. If this is not the case, then you will probably need to modify your workflow to ensure correctness of the data types that are passed to the APIs; adding the logic to take care of this within the library routines will be too expensive.
I will run some tests with the data you send me and let you know."

So there is some work to be done on our end to verify whether the data is actually being passed to the tests correctly. Moreover, is this a ROCm specific issue or is it also an issue for CUDA. I will look into it deeper when I have the bandwidth.

jithunnair-amd · 2023-02-02T00:48:43Z

@alugorey Since we're very close to rocm5.5_internal_testing being frozen and this particular feature is not a must-have for that branch, I've cloned rocm5.5_internal_testing to rocm5.6_internal_testing and retargeted this PR to the latter. This should allow you more time to fix the various UT failures we are seeing on the CI for this PR.

For reference, this was the last CI run on rocm5.5_internal_testing for this PR: http://rocmhead:8080/job/pytorch/job/pytorch-ci/563/

alugorey · 2023-02-02T16:10:43Z

@jithunnair-amd @pruthvistony
The test cases are failing due to the backend CI rocm version being 5.4. The fixes provided from the hipsolver team were introduced in rocm5.5 starting at build number 11208.

alugorey · 2023-02-02T18:00:53Z

The following tests are failing in CI with ROCm 5.4 due to that version missing a patch from hipsolver. I have confirmed they all pass using a ROCm5.5 build (at least mainline build No. 11208)

TestMetaCuda.test_dispatch_meta_outplace_ormqr_cuda_complex128
TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_complex64
TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float32
TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float64
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_all_strides_ormqr_cuda_float32
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_complex128
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_complex64
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float64
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float32
TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex128
TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex64
TestMetaCUDA.test_meta_outplace_ormqr_cuda_float32
TestMetaCUDA.test_meta_outplace_ormqr_cuda_float64
TestCommonCUDA.test_dtypes_ormqr_cuda
TestCommonCUDA.test_noncontiguous_samples_ormqr_cuda_complex64
TestBwdGradientsCUDA.test_fn_grad_ormqr_cuda_complex128
TestBwdGradientsCUDA.test_fn_grad_ormqr_cuda_float64

@jithunnair-amd

jithunnair-amd · 2023-02-06T20:58:50Z

@alugorey Let's remove the skips for any UTs that you see passing in your local runs with a ROCm5.5 build.

alugorey · 2023-02-07T15:46:49Z

@alugorey Let's remove the skips for any UTs that you see passing in your local runs with a ROCm5.5 build.

@jithunnair-amd There are no such tests being skipped. The ones that are skipped fail locally due to the complex64 issue. The tests above in my previous ones are the ones that pass locally but not in CI. They are currently not being skipped. I will create a github task to triage the complex64 issue as after speaking with Juan, he seems to think the data being passed to hipsolver is invalid, so of course it will fail. I need to do some digging on that. I will add a task for next sprint.

* Integrate new batched linalg drivers * Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well

* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well

* Integrate new batched linalg drivers * Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well

This reverts commit 5cf7807.

* Revert "Workaround of SWDEV-407984 (#1254)" This reverts commit e3a6481. * Revert "[ROCM] Fix TestLinalgCUDA.test_qr_cuda_complex64." This reverts commit 146e291. * Revert "Integrate new batched linalg drivers (#1163)" This reverts commit 5cf7807. * Updated changes for SWDEV-407984 * Update a missing constant in hipify * NIT related changes

Integrate new batched linalg drivers

28e8657

pruthvistony requested review from jithunnair-amd and pruthvistony January 24, 2023 20:53

alugorey mentioned this pull request Jan 24, 2023

Integrate batched versions of getrs, geqrf, getrf, getri, and gels #1106

Closed

Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

f880ba9

Skip complex types, hipsolver does not support

06fc28c

Skip complex types in other batched tests as well

c176932

jithunnair-amd changed the base branch from rocm5.5_internal_testing to rocm5.6_internal_testing February 2, 2023 00:44

jithunnair-amd merged this pull request into ROCm:rocm5.6_internal_testing Feb 8, 2023

pruthvistony added a commit that referenced this pull request Sep 28, 2023

Revert "Integrate new batched linalg drivers (#1163)"

df27c6f

This reverts commit 5cf7807.

Uh oh!

Integrate new batched linalg drivers #1163

Integrate new batched linalg drivers #1163

Uh oh!

Conversation

alugorey commented Jan 24, 2023 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pruthvistony commented Jan 25, 2023

Uh oh!

pruthvistony commented Jan 25, 2023

Uh oh!

alugorey commented Jan 26, 2023

Uh oh!

alugorey commented Jan 27, 2023

Uh oh!

pruthvistony commented Jan 27, 2023

Uh oh!

alugorey commented Jan 27, 2023

Uh oh!

jithunnair-amd commented Jan 27, 2023

Uh oh!

pruthvistony commented Jan 28, 2023

Uh oh!

pruthvistony commented Jan 28, 2023

Uh oh!

pruthvistony commented Jan 29, 2023

Uh oh!

pruthvistony commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alugorey commented Feb 1, 2023

Uh oh!

alugorey commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alugorey commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alugorey commented Feb 2, 2023

Uh oh!

jithunnair-amd commented Feb 6, 2023

Uh oh!

alugorey commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alugorey commented Jan 24, 2023 •

edited by jithunnair-amd

Loading

pruthvistony commented Jan 31, 2023 •

edited

Loading

alugorey commented Feb 1, 2023 •

edited

Loading

jithunnair-amd commented Feb 2, 2023 •

edited

Loading

alugorey commented Feb 2, 2023 •

edited

Loading