-
Couldn't load subscription status.
- Fork 74
Integrate new batched linalg drivers #1163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate new batched linalg drivers #1163
Conversation
|
jenkins retest this please |
|
Checking the log - http://rocmhead:8080/job/pytorch/job/pytorch-test-2/520/consoleText Jan 25 19:11:42 test_qr_batched_cuda_complex128 (main.TestLinalgCUDA) ... ok (0.005s) Jan 25 19:11:42 ====================================================================== |
Ah, yes. I vaguely remember hitting this in my testing. It's weird because it only fails for complex64. not complex128 or the floats. I'll have to look deeper. |
|
@pruthvistony @jithunnair-amd I have uncovered a bug in an associated API in rocsolver/rocblas. I have reached out to @jzuniga-amd about it to talk next steps. Until then, this work is blocked. I will create an appropriate JIRA ticket and update relevant ones after I speak with Juan. |
|
@alugorey |
I'm fine with that. I think it makes the most sense to finish up this PR and have the fix for the issue I illustrated come as a different PR associated with different JIRAs. The test in question is only tangentially related to the batched drivers and is more an issue with hipsolver's orgqr that we uncovered when switching to non-magma backend. |
|
@alugorey Absolutely, let's skip the unit tests and create an issue to make sure we track it. And then we should be good to merge this PR. |
|
some sccache error and build failed, retriggering it. |
|
jenkins retest this please |
1 similar comment
|
jenkins retest this please |
|
@alugorey , |
|
@pruthvistony |
|
@jithunnair-amd I am waiting to create a JIRA to deal with the complex64 issue as I am still working with @jzuniga-amd to understand the root of the problem. Most recently, he said: "Complex64 is supported. Now, as we discussed, in your workflow there are some inconsistencies as you call functions that expect complex data with reals, and vice versa. I suppose that if a function that expect a complex matrix receive a real matrix, the autopromotion in C++ will set the imaginary part to zero, and if a function that expect a real matrix receive a complex matrix, then it will ignore the imaginary part... but I am not sure, I have to verify this. If this is not the case, then you will probably need to modify your workflow to ensure correctness of the data types that are passed to the APIs; adding the logic to take care of this within the library routines will be too expensive. So there is some work to be done on our end to verify whether the data is actually being passed to the tests correctly. Moreover, is this a ROCm specific issue or is it also an issue for CUDA. I will look into it deeper when I have the bandwidth. |
|
@alugorey Since we're very close to rocm5.5_internal_testing being frozen and this particular feature is not a must-have for that branch, I've cloned rocm5.5_internal_testing to rocm5.6_internal_testing and retargeted this PR to the latter. This should allow you more time to fix the various UT failures we are seeing on the CI for this PR. For reference, this was the last CI run on rocm5.5_internal_testing for this PR: http://rocmhead:8080/job/pytorch/job/pytorch-ci/563/ |
|
@jithunnair-amd @pruthvistony |
|
The following tests are failing in CI with ROCm 5.4 due to that version missing a patch from hipsolver. I have confirmed they all pass using a ROCm5.5 build (at least mainline build No. 11208) |
|
@alugorey Let's remove the skips for any UTs that you see passing in your local runs with a ROCm5.5 build. |
@jithunnair-amd There are no such tests being skipped. The ones that are skipped fail locally due to the complex64 issue. The tests above in my previous ones are the ones that pass locally but not in CI. They are currently not being skipped. I will create a github task to triage the complex64 issue as after speaking with Juan, he seems to think the data being passed to hipsolver is invalid, so of course it will fail. I need to do some digging on that. I will add a task for next sprint. |
* Integrate new batched linalg drivers * Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
* Integrate new batched linalg drivers * Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
* Integrate new batched linalg drivers * Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
This reverts commit 5cf7807.
* Revert "Workaround of SWDEV-407984 (#1254)" This reverts commit e3a6481. * Revert "[ROCM] Fix TestLinalgCUDA.test_qr_cuda_complex64." This reverts commit 146e291. * Revert "Integrate new batched linalg drivers (#1163)" This reverts commit 5cf7807. * Updated changes for SWDEV-407984 * Update a missing constant in hipify * NIT related changes
New PR rebased on
rocm5.5_internal_testingrocm5.6_internal_testingIntegrates batched versions of getrs, geqrf, getrf, getri, and gels