Skip to content

Conversation

@xinyazhang
Copy link

Also Fixes #SWDEV-406932

@xinyazhang
Copy link
Author

xinyazhang commented Jul 10, 2023

Tested locally on Navi 32 (After reverting hipblasLt). No regressions.

(py_3.8) xinyazha@f18cdb1110e2:~/rocm-pytorch/test$ sh ~/t_geqrf.sh
[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
/home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:65: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.grad is deprecated as of PyTorch 2.0 and will be deleted in a future vers
ion of PyTorch >= 2.3. Please use torch.func.grad instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('grad')
.../home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  return self.op(*args, **kwargs)
....../home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:73: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vjp is deprecated as of PyTorch 2.0 and will be deleted in a future
 version of PyTorch >= 2.3. Please use torch.func.vjp instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vjp')
............../home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:61: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vmap is deprecated as of PyTorch 2.0 and will be deleted in
 a future version of PyTorch >= 2.3. Please use torch.vmap instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vmap', 'torch.vmap')
..../home/xinyazha/rocm-pytorch/torch/autograd/__init__.py:319: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can prioritize it
s implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/xinyazha/rocm-pytorch/torch/autograd/__init__.py:319: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::tril_. Please file us an issue on GitHub so that we can prioritize its im
plementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
................./home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so
 that we can prioritize its implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  return self.op(*args, **kwargs)
....
----------------------------------------------------------------------
Ran 48 tests in 244.121s

OK
/home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:61: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vmap is deprecated as of PyTorch 2.0 and will be deleted in a future vers
ion of PyTorch >= 2.3. Please use torch.vmap instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vmap', 'torch.vmap')
../home/xinyazha/rocm-pytorch/test/functorch/common_utils.py:32: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  out = op(*pytree.tree_unflatten(new_args, args_spec), **kwarg_values)
.[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
../home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can pr
ioritize its implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  return self.op(*args, **kwargs)
/home/xinyazha/rocm-pytorch/torch/_functorch/vmap.py:621: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can prioritize its impl
ementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  batched_outputs = func(*batched_inputs, **kwargs)
/home/xinyazha/rocm-pytorch/test/functorch/common_utils.py:266: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can prioritize it
s implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  return op(*args, **kwargs)
..
----------------------------------------------------------------------
Ran 7 tests in 6.299s

OK
..............
----------------------------------------------------------------------
Ran 14 tests in 97.597s

OK
......test_linalg.py:3537: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  Q, R = torch.qr(A, some=some)
test_linalg.py:3555: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2438.)
  torch.qr(A, some=some, out=(Q_out, R_out))
....
----------------------------------------------------------------------
Ran 10 tests in 5.419s

OK
.............
----------------------------------------------------------------------
Ran 13 tests in 52.167s

OK
.......[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
............................................
----------------------------------------------------------------------
Ran 51 tests in 367.304s

OK
...[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
./home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:769: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
..........
----------------------------------------------------------------------
Ran 14 tests in 131.775s

OK
....[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
./home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:769: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
...........
----------------------------------------------------------------------
Ran 16 tests in 249.297s

OK
....../home/xinyazha/rocm-pytorch/torch/testing/_internal/common_jit.py:160: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  results = func(*inputs, **kwargs)
..
----------------------------------------------------------------------
Ran 8 tests in 105.055s

OK
............[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
../home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  return self.op(*args, **kwargs)
......
----------------------------------------------------------------------
Ran 20 tests in 25.719s

OK

The complete list of UTs

PYTORCH_TEST_WITH_ROCM=1 python functorch/test_ops.py  TestOperatorsCUDA.test_grad_linalg_qr_cuda_float32  TestOperatorsCUDA.test_grad_ormqr_cuda_float32  TestOperatorsCUDA.test_grad_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_grad_q
r_cuda_float32  TestOperatorsCUDA.test_grad_svd_lowrank_cuda_float32  TestOperatorsCUDA.test_jvp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_jvp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_jvp_qr_cuda_float32  TestOperatorsCUDA.te
st_jvp_svd_lowrank_cuda_float32  TestOperatorsCUDA.test_jvpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_jvpvjp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_jvpvjp_qr_cuda_float32  TestOperatorsCUDA.test_jvpvjp_svd_lowrank_cuda_f
loat32  TestOperatorsCUDA.test_vjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vjp_ormqr_cuda_float32  TestOperatorsCUDA.test_vjp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_vjp_qr_cuda_float32  TestOperatorsCUDA.test_vjp_svd_lowr
ank_cuda_float32  TestOperatorsCUDA.test_vjpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vjpvjp_ormqr_cuda_float32  TestOperatorsCUDA.test_vjpvjp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_vjpvjp_qr_cuda_float32  TestOperators
CUDA.test_vjpvjp_svd_lowrank_cuda_float32  TestOperatorsCUDA.test_vjpvmap_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vjpvmap_qr_cuda_float32  TestOperatorsCUDA.test_vmap_autograd_grad_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vma
p_autograd_grad_linalg_qr_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_ormqr_cuda_float32  TestOperatorsCUDA.test_vmap_autograd_grad_ormqr_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_pca_lowrank_cuda_float32  TestOp
eratorsCUDA.test_vmap_autograd_grad_pca_lowrank_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_qr_cuda_float32  TestOperatorsCUDA.test_vmap_autograd_grad_qr_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_svd_lowrank_cuda
_float32  TestOperatorsCUDA.test_vmap_autograd_grad_svd_lowrank_cuda_float64  TestOperatorsCUDA.test_vmapjvpall_has_batch_rule_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpall_has_batch_rule_qr_cuda_float32  TestOperatorsCUDA.tes
t_vmapjvpall_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpall_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpvjp_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_has_batch_
rule_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_has_batch_rule_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_ormqr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_qr_cuda_f
loat32  TestOperatorsCUDA.test_vmapvjpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjpvjp_qr_cuda_float32
PYTORCH_TEST_WITH_ROCM=1 python functorch/test_vmap.py  TestVmapOperatorsOpInfoCUDA.test_op_has_batch_rule_geqrf_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_op_has_batch_rule_linalg_qr_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_op_
has_batch_rule_qr_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_geqrf_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_linalg_qr_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_ormqr_cuda_floa
t32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_qr_cuda_float32
PYTORCH_TEST_WITH_ROCM=1 python test_decomp.py  TestDecompCUDA.test_comprehensive_geqrf_cuda_complex128  TestDecompCUDA.test_comprehensive_geqrf_cuda_complex64  TestDecompCUDA.test_comprehensive_geqrf_cuda_float32  TestDecompCUDA.test_com
prehensive_geqrf_cuda_float64  TestDecompCUDA.test_comprehensive_linalg_qr_cuda_complex128  TestDecompCUDA.test_comprehensive_linalg_qr_cuda_complex64  TestDecompCUDA.test_comprehensive_linalg_qr_cuda_float32  TestDecompCUDA.test_comprehe
nsive_linalg_qr_cuda_float64  TestDecompCUDA.test_comprehensive_ormqr_cuda_float32  TestDecompCUDA.test_comprehensive_ormqr_cuda_float64  TestDecompCUDA.test_comprehensive_qr_cuda_complex128  TestDecompCUDA.test_comprehensive_qr_cuda_comp
lex64  TestDecompCUDA.test_comprehensive_qr_cuda_float32  TestDecompCUDA.test_comprehensive_qr_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_linalg.py  TestLinalgCUDA.test_geqrf_cuda_float32  TestLinalgCUDA.test_geqrf_cuda_float64  TestLinalgCUDA.test_householder_product_cuda_float32  TestLinalgCUDA.test_householder_product_cuda_float64  Te
stLinalgCUDA.test_ormqr_cuda_float32  TestLinalgCUDA.test_ormqr_cuda_float64  TestLinalgCUDA.test_qr_cuda_complex128  TestLinalgCUDA.test_qr_cuda_complex64  TestLinalgCUDA.test_qr_cuda_float32  TestLinalgCUDA.test_qr_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_meta.py  TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_complex128  TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_complex64  TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float32  Test
MetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float64  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_all_strides_ormqr_cuda_float32  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_complex128  TestMetaCUDA.test_dispatch
_symbolic_meta_outplace_ormqr_cuda_complex64  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float32  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float64  TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex12
8  TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex64  TestMetaCUDA.test_meta_outplace_ormqr_cuda_float32  TestMetaCUDA.test_meta_outplace_ormqr_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops.py  TestCommonCUDA.test_dtypes_ormqr_cuda  TestCommonCUDA.test_noncontiguous_samples_geqrf_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_geqrf_cuda_float32  TestCommonCUDA.test_nonconti
guous_samples_linalg_qr_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_linalg_qr_cuda_float32  TestCommonCUDA.test_noncontiguous_samples_ormqr_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_ormqr_cuda_float32  TestCo
mmonCUDA.test_noncontiguous_samples_pca_lowrank_cuda_float32  TestCommonCUDA.test_noncontiguous_samples_qr_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_qr_cuda_float32  TestCommonCUDA.test_noncontiguous_samples_svd_lowrank_cu
da_float32  TestCommonCUDA.test_variant_consistency_eager_geqrf_cuda_complex64  TestCommonCUDA.test_variant_consistency_eager_geqrf_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_linalg_qr_cuda_complex64  TestCommonCUDA.test_
variant_consistency_eager_linalg_qr_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_ormqr_cuda_complex64  TestCommonCUDA.test_variant_consistency_eager_ormqr_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_pca_lowr
ank_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_qr_cuda_complex64  TestCommonCUDA.test_variant_consistency_eager_qr_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_svd_lowrank_cuda_float32  TestCompositeComplia
nceCUDA.test_backward_linalg_qr_cuda_float32  TestCompositeComplianceCUDA.test_backward_ormqr_cuda_float32  TestCompositeComplianceCUDA.test_backward_qr_cuda_float32  TestCompositeComplianceCUDA.test_forward_ad_linalg_qr_cuda_float32  Tes
tCompositeComplianceCUDA.test_forward_ad_qr_cuda_float32  TestCompositeComplianceCUDA.test_operator_geqrf_cuda_float32  TestCompositeComplianceCUDA.test_operator_linalg_qr_cuda_float32  TestCompositeComplianceCUDA.test_operator_ormqr_cuda
_float32  TestCompositeComplianceCUDA.test_operator_pca_lowrank_cuda_float32  TestCompositeComplianceCUDA.test_operator_qr_cuda_float32  TestCompositeComplianceCUDA.test_operator_svd_lowrank_cuda_float32  TestFakeTensorCUDA.test_fake_auto
cast_ormqr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_amp_linalg_qr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_amp_ormqr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_amp_qr_cuda_float32 
 TestFakeTensorCUDA.test_fake_crossref_backward_no_amp_linalg_qr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_no_amp_ormqr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_no_amp_qr_cuda_float32  TestFakeTensor
CUDA.test_fake_ormqr_cuda_float32  TestFakeTensorCUDA.test_pointwise_ops_ormqr_cuda_float32  TestMathBitsCUDA.test_conj_view_geqrf_cuda_complex64  TestMathBitsCUDA.test_conj_view_linalg_qr_cuda_complex64  TestMathBitsCUDA.test_conj_view_o
rmqr_cuda_complex64  TestMathBitsCUDA.test_conj_view_qr_cuda_complex64  TestMathBitsCUDA.test_neg_view_geqrf_cuda_float64  TestMathBitsCUDA.test_neg_view_linalg_qr_cuda_float64  TestMathBitsCUDA.test_neg_view_ormqr_cuda_float64  TestMathB
itsCUDA.test_neg_view_pca_lowrank_cuda_float64  TestMathBitsCUDA.test_neg_view_qr_cuda_float64  TestMathBitsCUDA.test_neg_view_svd_lowrank_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops_fwd_gradients.py  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_linalg_qr_cuda_complex128  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_linalg_qr_cuda_float64  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad
_ormqr_cuda_complex128  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_qr_cuda_complex128  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_qr_cuda_float64  TestFwdGradientsCUDA.t
est_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_AD_linalg_qr_cuda_complex128  TestFwdGradientsCUDA.test_forward_mode_AD_linalg_qr_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_AD_ormqr_cuda_com
plex128  TestFwdGradientsCUDA.test_forward_mode_AD_pca_lowrank_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_AD_qr_cuda_complex128  TestFwdGradientsCUDA.test_forward_mode_AD_qr_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_A
D_svd_lowrank_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops_gradients.py  TestBwdGradientsCUDA.test_fn_grad_linalg_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_grad_linalg_qr_cuda_float64  TestBwdGradientsCUDA.test_fn_grad_ormqr_cuda_complex128  TestBwd
GradientsCUDA.test_fn_grad_ormqr_cuda_float64  TestBwdGradientsCUDA.test_fn_grad_pca_lowrank_cuda_float64  TestBwdGradientsCUDA.test_fn_grad_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_grad_qr_cuda_float64  TestBwdGradientsCUDA.test_
fn_grad_svd_lowrank_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_linalg_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_gradgrad_linalg_qr_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_ormqr_cuda_complex128  TestBwdGradients
CUDA.test_fn_gradgrad_ormqr_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_pca_lowrank_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_gradgrad_qr_cuda_float64  TestBwdGradientsCUD
A.test_fn_gradgrad_svd_lowrank_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops_jit.py  TestJitCUDA.test_variant_consistency_jit_geqrf_cuda_complex64  TestJitCUDA.test_variant_consistency_jit_geqrf_cuda_float32  TestJitCUDA.test_variant_consistency_jit_linalg_qr_cuda_complex64
  TestJitCUDA.test_variant_consistency_jit_linalg_qr_cuda_float32  TestJitCUDA.test_variant_consistency_jit_ormqr_cuda_complex64  TestJitCUDA.test_variant_consistency_jit_ormqr_cuda_float32  TestJitCUDA.test_variant_consistency_jit_qr_cud
a_complex64  TestJitCUDA.test_variant_consistency_jit_qr_cuda_float32
PYTORCH_TEST_WITH_ROCM=1 python test_schema_check.py  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_geqrf_cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_geqrf_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.t
est_schema_correctness_geqrf_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_geqrf_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_linalg_qr_cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schem
a_correctness_linalg_qr_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_linalg_qr_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_linalg_qr_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_
correctness_ormqr_cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_ormqr_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_ormqr_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness
_ormqr_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_pca_lowrank_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_pca_lowrank_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_
cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_cuda_float64  TestSc
hemaCheckModeOpInfoCUDA.test_schema_correctness_svd_lowrank_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_svd_lowrank_cuda_float64

Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hipblasLT integration revert is already done. - #1253
So this PR needs to include only changes related to - SWDEV-406932

@jithunnair-amd
Copy link
Collaborator

hipblasLT integration revert is already done. - #1253
So this PR needs to include only changes related to - SWDEV-406932

@xinyazhang Please rebase this PR branch so it reflects only the changes relevant to geqrf issue

@xinyazhang xinyazhang force-pushed the xinyazhang/geqrf_batchsize0 branch from 7e6faac to 5a9b0d5 Compare July 12, 2023 21:34
@xinyazhang
Copy link
Author

@jithunnair-amd @pruthvistony Done. All geqrf related local tests passed

@jithunnair-amd jithunnair-amd merged commit a75ea71 into rocm5.7_internal_testing Jul 13, 2023
Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pruthvistony pushed a commit that referenced this pull request Sep 12, 2023
* Workaround of SWDEV-407984

* Use >= 57000 and < 50800 to match all ROCM 5.7.x releases

* Removed ROCM_VERSION < 50800
pruthvistony added a commit that referenced this pull request Sep 28, 2023
pruthvistony added a commit that referenced this pull request Sep 29, 2023
* Revert "Workaround of SWDEV-407984 (#1254)"

This reverts commit e3a6481.

* Revert "[ROCM] Fix TestLinalgCUDA.test_qr_cuda_complex64."

This reverts commit 146e291.

* Revert "Integrate new batched linalg drivers (#1163)"

This reverts commit 5cf7807.

* Updated changes for SWDEV-407984

* Update a missing constant in hipify

* NIT related changes
akashveramd pushed a commit that referenced this pull request Jun 13, 2025
…rt needed, instead of max_len (#1254)

This PR switches the generate_permute_indices to move to using exact
sizes per expert needed, instead of max_len.
Thus, we now return a tensor of size sum(m_sizes) instead of max_len. 
This may resolve the current issue
[here](pytorch/torchtitan#1237).

Testing:
Ran both unit testing with dynamic padding, both pass.
Verified resolves Nans in running in llama4 (credit @raymin0223).
pytorch/torchtitan#1237 (comment)

~~~
permuted_indices_gpu=tensor([ 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34,
35, 48, 49, 50, 51, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, 6, 7,
20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8, 9, 10, 11, 24, 25, 26, 27,
40, 41, 42, 43, 56, 57, 58, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47,
60, 61, 62, 63, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1], device='cuda:0', dtype=torch.int32), 
permuted_indices_cpu=tensor([ 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34,
35, 48, 49, 50, 51, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, 6, 7,
20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8, 9, 10, 11, 24, 25, 26, 27,
40, 41, 42, 43, 56, 57, 58, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47,
60, 61, 62, 63, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1], dtype=torch.int32)
m_sizes=tensor([32, 32, 32, 32], device='cuda:0', dtype=torch.int32)
Success
tokens_per_expert_group = tensor([4, 0, 2, 3, 1, 0, 0, 5],
device='cuda:0', dtype=torch.int32)
total_tokens_per_expert = tensor([5, 0, 2, 8], device='cuda:0')
m_sizes = tensor([8, 8, 8, 8], device='cuda:0', dtype=torch.int32)
m_offsets = tensor([ 8, 16, 24, 32], device='cuda:0', dtype=torch.int32)
permuted_indices = tensor([ 0, 1, 2, 3, 9, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, 4, 5,
        -1, -1, -1, -1, -1, -1,  6,  7,  8, 10, 11, 12, 13, 14],
       device='cuda:0', dtype=torch.int32)
Expert 1 has zero tokens and 8 slots with all -1
All tests passed successfully!
~~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants