Skip to content

Conversation

@alugorey
Copy link

@alugorey alugorey commented Jan 24, 2023

New PR rebased on rocm5.5_internal_testing rocm5.6_internal_testing

Integrates batched versions of getrs, geqrf, getrf, getri, and gels

@pruthvistony
Copy link
Collaborator

jenkins retest this please

@pruthvistony
Copy link
Collaborator

Checking the log - http://rocmhead:8080/job/pytorch/job/pytorch-test-2/520/consoleText

Jan 25 19:11:42 test_qr_batched_cuda_complex128 (main.TestLinalgCUDA) ... ok (0.005s)
Jan 25 19:11:42 test_qr_batched_cuda_complex64 (main.TestLinalgCUDA) ... FAIL (0.003s)
Jan 25 19:11:42 test_qr_batched_cuda_float32 (main.TestLinalgCUDA) ... ok (0.004s)
Jan 25 19:11:42 test_qr_batched_cuda_float64 (main.TestLinalgCUDA) ... ok (0.004s)
Jan 25 19:11:42 test_qr_cuda_complex128 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)
Jan 25 19:11:42 test_qr_cuda_complex64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)
Jan 25 19:11:42 test_qr_cuda_float32 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)
Jan 25 19:11:42 test_qr_cuda_float64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)

Jan 25 19:11:42 ======================================================================
Jan 25 19:11:42 FAIL [0.003s]: test_qr_batched_cuda_complex64 (main.TestLinalgCUDA)
Jan 25 19:11:42 ----------------------------------------------------------------------
Jan 25 19:11:42 Traceback (most recent call last):
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2067, in wrapper
Jan 25 19:11:42 method(*args, **kwargs)
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test
Jan 25 19:11:42 result = test(self, **param_kwargs)
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 866, in dep_fn
Jan 25 19:11:42 return fn(slf, *args, **kwargs)
Jan 25 19:11:42 File "test_linalg.py", line 3640, in test_qr_batched
Jan 25 19:11:42 self.assertEqual(q, exp_q)
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2583, in assertEqual
Jan 25 19:11:42 assert_equal(
Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1118, in assert_equal
Jan 25 19:11:42 raise error_metas[0].to_error(msg)
Jan 25 19:11:42 AssertionError: Tensor-likes are not close!
Jan 25 19:11:42
Jan 25 19:11:42 Mismatched elements: 105 / 105 (100.0%)
Jan 25 19:11:42 Greatest absolute difference: nan at index (0, 0, 1) (up to 1e-05 allowed)
Jan 25 19:11:42 Greatest relative difference: nan at index (0, 0, 1) (up to 1.3e-06 allowed)
Jan 25 19:11:42
Jan 25 19:11:42 ----------------------------------------------------------------------
Jan 25 19:11:42 Ran 716 tests in 167.970s
Jan 25 19:11:42
Jan 25 19:11:42 FAILED (failures=4, errors=10, skipped=67)

@alugorey
Copy link
Author

Checking the log - http://rocmhead:8080/job/pytorch/job/pytorch-test-2/520/consoleText

Jan 25 19:11:42 test_qr_batched_cuda_complex128 (main.TestLinalgCUDA) ... ok (0.005s) Jan 25 19:11:42 test_qr_batched_cuda_complex64 (main.TestLinalgCUDA) ... FAIL (0.003s) Jan 25 19:11:42 test_qr_batched_cuda_float32 (main.TestLinalgCUDA) ... ok (0.004s) Jan 25 19:11:42 test_qr_batched_cuda_float64 (main.TestLinalgCUDA) ... ok (0.004s) Jan 25 19:11:42 test_qr_cuda_complex128 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s) Jan 25 19:11:42 test_qr_cuda_complex64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s) Jan 25 19:11:42 test_qr_cuda_float32 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s) Jan 25 19:11:42 test_qr_cuda_float64 (main.TestLinalgCUDA) ... skip: cuSOLVER not available (0.002s)

Jan 25 19:11:42 ====================================================================== Jan 25 19:11:42 FAIL [0.003s]: test_qr_batched_cuda_complex64 (main.TestLinalgCUDA) Jan 25 19:11:42 ---------------------------------------------------------------------- Jan 25 19:11:42 Traceback (most recent call last): Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2067, in wrapper Jan 25 19:11:42 method(*args, **kwargs) Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test Jan 25 19:11:42 result = test(self, **param_kwargs) Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 866, in dep_fn Jan 25 19:11:42 return fn(slf, *args, **kwargs) Jan 25 19:11:42 File "test_linalg.py", line 3640, in test_qr_batched Jan 25 19:11:42 self.assertEqual(q, exp_q) Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2583, in assertEqual Jan 25 19:11:42 assert_equal( Jan 25 19:11:42 File "/opt/conda/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1118, in assert_equal Jan 25 19:11:42 raise error_metas[0].to_error(msg) Jan 25 19:11:42 AssertionError: Tensor-likes are not close! Jan 25 19:11:42 Jan 25 19:11:42 Mismatched elements: 105 / 105 (100.0%) Jan 25 19:11:42 Greatest absolute difference: nan at index (0, 0, 1) (up to 1e-05 allowed) Jan 25 19:11:42 Greatest relative difference: nan at index (0, 0, 1) (up to 1.3e-06 allowed) Jan 25 19:11:42 Jan 25 19:11:42 ---------------------------------------------------------------------- Jan 25 19:11:42 Ran 716 tests in 167.970s Jan 25 19:11:42 Jan 25 19:11:42 FAILED (failures=4, errors=10, skipped=67)

Ah, yes. I vaguely remember hitting this in my testing. It's weird because it only fails for complex64. not complex128 or the floats. I'll have to look deeper.

@alugorey
Copy link
Author

@pruthvistony @jithunnair-amd I have uncovered a bug in an associated API in rocsolver/rocblas.
The aforementioned failing test is failing an assertion because a Tensor returned by torch.linalg.qr is filled with NaNs. I drove down into this API and found the math function apply_orgqr() to be the culprit which is a wrapper around a hipsolver function called hipsolverDnZungqr. This function is also a wrapper for a rocsolver/rocblas function here: https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/library/src/amd_detail/hipsolver.cpp#L1754
The API in question is returning NaNs in the "A" Tensor. causing the assertion above to fail.
Interestingly, this only occurs when the scalar type is complex64. The same test passes when using float32, float64, and complex128.

I have reached out to @jzuniga-amd about it to talk next steps. Until then, this work is blocked. I will create an appropriate JIRA ticket and update relevant ones after I speak with Juan.

@pruthvistony
Copy link
Collaborator

@alugorey
how about skipping complex64 and try to get this PR merged?
cc @jeffdaily @jithunnair-amd

@alugorey
Copy link
Author

@alugorey how about skipping complex64 and try to get this PR merged? cc @jeffdaily @jithunnair-amd

I'm fine with that. I think it makes the most sense to finish up this PR and have the fix for the issue I illustrated come as a different PR associated with different JIRAs. The test in question is only tangentially related to the batched drivers and is more an issue with hipsolver's orgqr that we uncovered when switching to non-magma backend.

@jithunnair-amd
Copy link
Collaborator

@alugorey Absolutely, let's skip the unit tests and create an issue to make sure we track it. And then we should be good to merge this PR.

@pruthvistony
Copy link
Collaborator

some sccache error and build failed, retriggering it.

@pruthvistony
Copy link
Collaborator

jenkins retest this please

1 similar comment
@pruthvistony
Copy link
Collaborator

jenkins retest this please

@pruthvistony
Copy link
Collaborator

pruthvistony commented Jan 31, 2023

@alugorey ,
Many cases are failing - http://rocmhead:8080/job/pytorch/job/pytorch-ci/558/
Failures in test 1 and test 2 are passing on the IFU PR.

@alugorey
Copy link
Author

alugorey commented Feb 1, 2023

@pruthvistony
The failing linalg tests are due to the complex64 type not being supported in the batched drivers. I will skip those as well.

@alugorey
Copy link
Author

alugorey commented Feb 1, 2023

@jithunnair-amd I am waiting to create a JIRA to deal with the complex64 issue as I am still working with @jzuniga-amd to understand the root of the problem. Most recently, he said:

"Complex64 is supported. Now, as we discussed, in your workflow there are some inconsistencies as you call functions that expect complex data with reals, and vice versa. I suppose that if a function that expect a complex matrix receive a real matrix, the autopromotion in C++ will set the imaginary part to zero, and if a function that expect a real matrix receive a complex matrix, then it will ignore the imaginary part... but I am not sure, I have to verify this. If this is not the case, then you will probably need to modify your workflow to ensure correctness of the data types that are passed to the APIs; adding the logic to take care of this within the library routines will be too expensive.   
I will run some tests with the data you send me and let you know."

So there is some work to be done on our end to verify whether the data is actually being passed to the tests correctly. Moreover, is this a ROCm specific issue or is it also an issue for CUDA. I will look into it deeper when I have the bandwidth.

@jithunnair-amd jithunnair-amd changed the base branch from rocm5.5_internal_testing to rocm5.6_internal_testing February 2, 2023 00:44
@jithunnair-amd
Copy link
Collaborator

jithunnair-amd commented Feb 2, 2023

@alugorey Since we're very close to rocm5.5_internal_testing being frozen and this particular feature is not a must-have for that branch, I've cloned rocm5.5_internal_testing to rocm5.6_internal_testing and retargeted this PR to the latter. This should allow you more time to fix the various UT failures we are seeing on the CI for this PR.

For reference, this was the last CI run on rocm5.5_internal_testing for this PR: http://rocmhead:8080/job/pytorch/job/pytorch-ci/563/

@alugorey
Copy link
Author

alugorey commented Feb 2, 2023

@jithunnair-amd @pruthvistony
The test cases are failing due to the backend CI rocm version being 5.4. The fixes provided from the hipsolver team were introduced in rocm5.5 starting at build number 11208.

@alugorey
Copy link
Author

alugorey commented Feb 2, 2023

The following tests are failing in CI with ROCm 5.4 due to that version missing a patch from hipsolver. I have confirmed they all pass using a ROCm5.5 build (at least mainline build No. 11208)

TestMetaCuda.test_dispatch_meta_outplace_ormqr_cuda_complex128
TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_complex64
TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float32
TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float64
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_all_strides_ormqr_cuda_float32
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_complex128
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_complex64
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float64
TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float32
TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex128
TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex64
TestMetaCUDA.test_meta_outplace_ormqr_cuda_float32
TestMetaCUDA.test_meta_outplace_ormqr_cuda_float64
TestCommonCUDA.test_dtypes_ormqr_cuda
TestCommonCUDA.test_noncontiguous_samples_ormqr_cuda_complex64
TestBwdGradientsCUDA.test_fn_grad_ormqr_cuda_complex128
TestBwdGradientsCUDA.test_fn_grad_ormqr_cuda_float64

@jithunnair-amd

@jithunnair-amd
Copy link
Collaborator

@alugorey Let's remove the skips for any UTs that you see passing in your local runs with a ROCm5.5 build.

@alugorey
Copy link
Author

alugorey commented Feb 7, 2023

@alugorey Let's remove the skips for any UTs that you see passing in your local runs with a ROCm5.5 build.

@jithunnair-amd There are no such tests being skipped. The ones that are skipped fail locally due to the complex64 issue. The tests above in my previous ones are the ones that pass locally but not in CI. They are currently not being skipped. I will create a github task to triage the complex64 issue as after speaking with Juan, he seems to think the data being passed to hipsolver is invalid, so of course it will fail. I need to do some digging on that. I will add a task for next sprint.

@jithunnair-amd jithunnair-amd merged this pull request into ROCm:rocm5.6_internal_testing Feb 8, 2023
pruthvistony pushed a commit that referenced this pull request Feb 9, 2023
* Integrate new batched linalg drivers

* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
pruthvistony pushed a commit that referenced this pull request May 23, 2023
* Integrate new batched linalg drivers

* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
alugorey added a commit to alugorey/pytorch that referenced this pull request Jun 7, 2023
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
alugorey added a commit to alugorey/pytorch that referenced this pull request Jun 12, 2023
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
alugorey added a commit to alugorey/pytorch that referenced this pull request Jul 21, 2023
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
pruthvistony pushed a commit that referenced this pull request Sep 12, 2023
* Integrate new batched linalg drivers

* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
pruthvistony added a commit that referenced this pull request Sep 28, 2023
pruthvistony added a commit that referenced this pull request Sep 29, 2023
* Revert "Workaround of SWDEV-407984 (#1254)"

This reverts commit e3a6481.

* Revert "[ROCM] Fix TestLinalgCUDA.test_qr_cuda_complex64."

This reverts commit 146e291.

* Revert "Integrate new batched linalg drivers (#1163)"

This reverts commit 5cf7807.

* Updated changes for SWDEV-407984

* Update a missing constant in hipify

* NIT related changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants