-
Couldn't load subscription status.
- Fork 74
Workaround of SWDEV-407984 #1254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround of SWDEV-407984 #1254
Conversation
|
Tested locally on Navi 32 (After reverting hipblasLt). No regressions. The complete list of UTs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hipblasLT integration revert is already done. - #1253
So this PR needs to include only changes related to - SWDEV-406932
@xinyazhang Please rebase this PR branch so it reflects only the changes relevant to geqrf issue |
7e6faac to
5a9b0d5
Compare
|
@jithunnair-amd @pruthvistony Done. All geqrf related local tests passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Workaround of SWDEV-407984 * Use >= 57000 and < 50800 to match all ROCM 5.7.x releases * Removed ROCM_VERSION < 50800
This reverts commit e3a6481.
* Revert "Workaround of SWDEV-407984 (#1254)" This reverts commit e3a6481. * Revert "[ROCM] Fix TestLinalgCUDA.test_qr_cuda_complex64." This reverts commit 146e291. * Revert "Integrate new batched linalg drivers (#1163)" This reverts commit 5cf7807. * Updated changes for SWDEV-407984 * Update a missing constant in hipify * NIT related changes
…rt needed, instead of max_len (#1254) This PR switches the generate_permute_indices to move to using exact sizes per expert needed, instead of max_len. Thus, we now return a tensor of size sum(m_sizes) instead of max_len. This may resolve the current issue [here](pytorch/torchtitan#1237). Testing: Ran both unit testing with dynamic padding, both pass. Verified resolves Nans in running in llama4 (credit @raymin0223). pytorch/torchtitan#1237 (comment) ~~~ permuted_indices_gpu=tensor([ 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int32), permuted_indices_cpu=tensor([ 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], dtype=torch.int32) m_sizes=tensor([32, 32, 32, 32], device='cuda:0', dtype=torch.int32) Success tokens_per_expert_group = tensor([4, 0, 2, 3, 1, 0, 0, 5], device='cuda:0', dtype=torch.int32) total_tokens_per_expert = tensor([5, 0, 2, 8], device='cuda:0') m_sizes = tensor([8, 8, 8, 8], device='cuda:0', dtype=torch.int32) m_offsets = tensor([ 8, 16, 24, 32], device='cuda:0', dtype=torch.int32) permuted_indices = tensor([ 0, 1, 2, 3, 9, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, -1, -1, -1, -1, -1, -1, 6, 7, 8, 10, 11, 12, 13, 14], device='cuda:0', dtype=torch.int32) Expert 1 has zero tokens and 8 slots with all -1 All tests passed successfully! ~~~
Also Fixes #SWDEV-406932