Workaround of SWDEV-407984 #1254

xinyazhang · 2023-07-10T18:28:06Z

Also Fixes #SWDEV-406932

xinyazhang · 2023-07-10T18:33:00Z

Tested locally on Navi 32 (After reverting hipblasLt). No regressions.

(py_3.8) xinyazha@f18cdb1110e2:~/rocm-pytorch/test$ sh ~/t_geqrf.sh
[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
/home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:65: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.grad is deprecated as of PyTorch 2.0 and will be deleted in a future vers
ion of PyTorch >= 2.3. Please use torch.func.grad instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('grad')
.../home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  return self.op(*args, **kwargs)
....../home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:73: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vjp is deprecated as of PyTorch 2.0 and will be deleted in a future
 version of PyTorch >= 2.3. Please use torch.func.vjp instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vjp')
............../home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:61: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vmap is deprecated as of PyTorch 2.0 and will be deleted in
 a future version of PyTorch >= 2.3. Please use torch.vmap instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vmap', 'torch.vmap')
..../home/xinyazha/rocm-pytorch/torch/autograd/__init__.py:319: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can prioritize it
s implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/xinyazha/rocm-pytorch/torch/autograd/__init__.py:319: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::tril_. Please file us an issue on GitHub so that we can prioritize its im
plementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
................./home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so
 that we can prioritize its implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  return self.op(*args, **kwargs)
....
----------------------------------------------------------------------
Ran 48 tests in 244.121s

OK
/home/xinyazha/rocm-pytorch/torch/_functorch/deprecated.py:61: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vmap is deprecated as of PyTorch 2.0 and will be deleted in a future vers
ion of PyTorch >= 2.3. Please use torch.vmap instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vmap', 'torch.vmap')
../home/xinyazha/rocm-pytorch/test/functorch/common_utils.py:32: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  out = op(*pytree.tree_unflatten(new_args, args_spec), **kwarg_values)
.[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
../home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can pr
ioritize its implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  return self.op(*args, **kwargs)
/home/xinyazha/rocm-pytorch/torch/_functorch/vmap.py:621: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can prioritize its impl
ementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  batched_outputs = func(*batched_inputs, **kwargs)
/home/xinyazha/rocm-pytorch/test/functorch/common_utils.py:266: UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::ormqr. Please file us an issue on GitHub so that we can prioritize it
s implementation. (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  return op(*args, **kwargs)
..
----------------------------------------------------------------------
Ran 7 tests in 6.299s

OK
..............
----------------------------------------------------------------------
Ran 14 tests in 97.597s

OK
......test_linalg.py:3537: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  Q, R = torch.qr(A, some=some)
test_linalg.py:3555: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2438.)
  torch.qr(A, some=some, out=(Q_out, R_out))
....
----------------------------------------------------------------------
Ran 10 tests in 5.419s

OK
.............
----------------------------------------------------------------------
Ran 13 tests in 52.167s

OK
.......[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
............................................
----------------------------------------------------------------------
Ran 51 tests in 367.304s

OK
...[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
./home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:769: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
..........
----------------------------------------------------------------------
Ran 14 tests in 131.775s

OK
....[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
./home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:769: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
...........
----------------------------------------------------------------------
Ran 16 tests in 249.297s

OK
....../home/xinyazha/rocm-pytorch/torch/testing/_internal/common_jit.py:160: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  results = func(*inputs, **kwargs)
..
----------------------------------------------------------------------
Ran 8 tests in 105.055s

OK
............[W Module.cpp:1349] Warning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (function operator())
../home/xinyazha/rocm-pytorch/torch/testing/_internal/opinfo/core.py:1097: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at /home/xinyazha/rocm-pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2426.)
  return self.op(*args, **kwargs)
......
----------------------------------------------------------------------
Ran 20 tests in 25.719s

OK

The complete list of UTs

PYTORCH_TEST_WITH_ROCM=1 python functorch/test_ops.py  TestOperatorsCUDA.test_grad_linalg_qr_cuda_float32  TestOperatorsCUDA.test_grad_ormqr_cuda_float32  TestOperatorsCUDA.test_grad_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_grad_q
r_cuda_float32  TestOperatorsCUDA.test_grad_svd_lowrank_cuda_float32  TestOperatorsCUDA.test_jvp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_jvp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_jvp_qr_cuda_float32  TestOperatorsCUDA.te
st_jvp_svd_lowrank_cuda_float32  TestOperatorsCUDA.test_jvpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_jvpvjp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_jvpvjp_qr_cuda_float32  TestOperatorsCUDA.test_jvpvjp_svd_lowrank_cuda_f
loat32  TestOperatorsCUDA.test_vjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vjp_ormqr_cuda_float32  TestOperatorsCUDA.test_vjp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_vjp_qr_cuda_float32  TestOperatorsCUDA.test_vjp_svd_lowr
ank_cuda_float32  TestOperatorsCUDA.test_vjpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vjpvjp_ormqr_cuda_float32  TestOperatorsCUDA.test_vjpvjp_pca_lowrank_cuda_float32  TestOperatorsCUDA.test_vjpvjp_qr_cuda_float32  TestOperators
CUDA.test_vjpvjp_svd_lowrank_cuda_float32  TestOperatorsCUDA.test_vjpvmap_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vjpvmap_qr_cuda_float32  TestOperatorsCUDA.test_vmap_autograd_grad_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vma
p_autograd_grad_linalg_qr_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_ormqr_cuda_float32  TestOperatorsCUDA.test_vmap_autograd_grad_ormqr_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_pca_lowrank_cuda_float32  TestOp
eratorsCUDA.test_vmap_autograd_grad_pca_lowrank_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_qr_cuda_float32  TestOperatorsCUDA.test_vmap_autograd_grad_qr_cuda_float64  TestOperatorsCUDA.test_vmap_autograd_grad_svd_lowrank_cuda
_float32  TestOperatorsCUDA.test_vmap_autograd_grad_svd_lowrank_cuda_float64  TestOperatorsCUDA.test_vmapjvpall_has_batch_rule_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpall_has_batch_rule_qr_cuda_float32  TestOperatorsCUDA.tes
t_vmapjvpall_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpall_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapjvpvjp_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_has_batch_
rule_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_has_batch_rule_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_ormqr_cuda_float32  TestOperatorsCUDA.test_vmapvjp_qr_cuda_f
loat32  TestOperatorsCUDA.test_vmapvjpvjp_linalg_qr_cuda_float32  TestOperatorsCUDA.test_vmapvjpvjp_qr_cuda_float32
PYTORCH_TEST_WITH_ROCM=1 python functorch/test_vmap.py  TestVmapOperatorsOpInfoCUDA.test_op_has_batch_rule_geqrf_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_op_has_batch_rule_linalg_qr_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_op_
has_batch_rule_qr_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_geqrf_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_linalg_qr_cuda_float32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_ormqr_cuda_floa
t32  TestVmapOperatorsOpInfoCUDA.test_vmap_exhaustive_qr_cuda_float32
PYTORCH_TEST_WITH_ROCM=1 python test_decomp.py  TestDecompCUDA.test_comprehensive_geqrf_cuda_complex128  TestDecompCUDA.test_comprehensive_geqrf_cuda_complex64  TestDecompCUDA.test_comprehensive_geqrf_cuda_float32  TestDecompCUDA.test_com
prehensive_geqrf_cuda_float64  TestDecompCUDA.test_comprehensive_linalg_qr_cuda_complex128  TestDecompCUDA.test_comprehensive_linalg_qr_cuda_complex64  TestDecompCUDA.test_comprehensive_linalg_qr_cuda_float32  TestDecompCUDA.test_comprehe
nsive_linalg_qr_cuda_float64  TestDecompCUDA.test_comprehensive_ormqr_cuda_float32  TestDecompCUDA.test_comprehensive_ormqr_cuda_float64  TestDecompCUDA.test_comprehensive_qr_cuda_complex128  TestDecompCUDA.test_comprehensive_qr_cuda_comp
lex64  TestDecompCUDA.test_comprehensive_qr_cuda_float32  TestDecompCUDA.test_comprehensive_qr_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_linalg.py  TestLinalgCUDA.test_geqrf_cuda_float32  TestLinalgCUDA.test_geqrf_cuda_float64  TestLinalgCUDA.test_householder_product_cuda_float32  TestLinalgCUDA.test_householder_product_cuda_float64  Te
stLinalgCUDA.test_ormqr_cuda_float32  TestLinalgCUDA.test_ormqr_cuda_float64  TestLinalgCUDA.test_qr_cuda_complex128  TestLinalgCUDA.test_qr_cuda_complex64  TestLinalgCUDA.test_qr_cuda_float32  TestLinalgCUDA.test_qr_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_meta.py  TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_complex128  TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_complex64  TestMetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float32  Test
MetaCUDA.test_dispatch_meta_outplace_ormqr_cuda_float64  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_all_strides_ormqr_cuda_float32  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_complex128  TestMetaCUDA.test_dispatch
_symbolic_meta_outplace_ormqr_cuda_complex64  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float32  TestMetaCUDA.test_dispatch_symbolic_meta_outplace_ormqr_cuda_float64  TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex12
8  TestMetaCUDA.test_meta_outplace_ormqr_cuda_complex64  TestMetaCUDA.test_meta_outplace_ormqr_cuda_float32  TestMetaCUDA.test_meta_outplace_ormqr_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops.py  TestCommonCUDA.test_dtypes_ormqr_cuda  TestCommonCUDA.test_noncontiguous_samples_geqrf_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_geqrf_cuda_float32  TestCommonCUDA.test_nonconti
guous_samples_linalg_qr_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_linalg_qr_cuda_float32  TestCommonCUDA.test_noncontiguous_samples_ormqr_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_ormqr_cuda_float32  TestCo
mmonCUDA.test_noncontiguous_samples_pca_lowrank_cuda_float32  TestCommonCUDA.test_noncontiguous_samples_qr_cuda_complex64  TestCommonCUDA.test_noncontiguous_samples_qr_cuda_float32  TestCommonCUDA.test_noncontiguous_samples_svd_lowrank_cu
da_float32  TestCommonCUDA.test_variant_consistency_eager_geqrf_cuda_complex64  TestCommonCUDA.test_variant_consistency_eager_geqrf_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_linalg_qr_cuda_complex64  TestCommonCUDA.test_
variant_consistency_eager_linalg_qr_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_ormqr_cuda_complex64  TestCommonCUDA.test_variant_consistency_eager_ormqr_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_pca_lowr
ank_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_qr_cuda_complex64  TestCommonCUDA.test_variant_consistency_eager_qr_cuda_float32  TestCommonCUDA.test_variant_consistency_eager_svd_lowrank_cuda_float32  TestCompositeComplia
nceCUDA.test_backward_linalg_qr_cuda_float32  TestCompositeComplianceCUDA.test_backward_ormqr_cuda_float32  TestCompositeComplianceCUDA.test_backward_qr_cuda_float32  TestCompositeComplianceCUDA.test_forward_ad_linalg_qr_cuda_float32  Tes
tCompositeComplianceCUDA.test_forward_ad_qr_cuda_float32  TestCompositeComplianceCUDA.test_operator_geqrf_cuda_float32  TestCompositeComplianceCUDA.test_operator_linalg_qr_cuda_float32  TestCompositeComplianceCUDA.test_operator_ormqr_cuda
_float32  TestCompositeComplianceCUDA.test_operator_pca_lowrank_cuda_float32  TestCompositeComplianceCUDA.test_operator_qr_cuda_float32  TestCompositeComplianceCUDA.test_operator_svd_lowrank_cuda_float32  TestFakeTensorCUDA.test_fake_auto
cast_ormqr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_amp_linalg_qr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_amp_ormqr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_amp_qr_cuda_float32 
 TestFakeTensorCUDA.test_fake_crossref_backward_no_amp_linalg_qr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_no_amp_ormqr_cuda_float32  TestFakeTensorCUDA.test_fake_crossref_backward_no_amp_qr_cuda_float32  TestFakeTensor
CUDA.test_fake_ormqr_cuda_float32  TestFakeTensorCUDA.test_pointwise_ops_ormqr_cuda_float32  TestMathBitsCUDA.test_conj_view_geqrf_cuda_complex64  TestMathBitsCUDA.test_conj_view_linalg_qr_cuda_complex64  TestMathBitsCUDA.test_conj_view_o
rmqr_cuda_complex64  TestMathBitsCUDA.test_conj_view_qr_cuda_complex64  TestMathBitsCUDA.test_neg_view_geqrf_cuda_float64  TestMathBitsCUDA.test_neg_view_linalg_qr_cuda_float64  TestMathBitsCUDA.test_neg_view_ormqr_cuda_float64  TestMathB
itsCUDA.test_neg_view_pca_lowrank_cuda_float64  TestMathBitsCUDA.test_neg_view_qr_cuda_float64  TestMathBitsCUDA.test_neg_view_svd_lowrank_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops_fwd_gradients.py  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_linalg_qr_cuda_complex128  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_linalg_qr_cuda_float64  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad
_ormqr_cuda_complex128  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_qr_cuda_complex128  TestFwdGradientsCUDA.test_fn_fwgrad_bwgrad_qr_cuda_float64  TestFwdGradientsCUDA.t
est_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_AD_linalg_qr_cuda_complex128  TestFwdGradientsCUDA.test_forward_mode_AD_linalg_qr_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_AD_ormqr_cuda_com
plex128  TestFwdGradientsCUDA.test_forward_mode_AD_pca_lowrank_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_AD_qr_cuda_complex128  TestFwdGradientsCUDA.test_forward_mode_AD_qr_cuda_float64  TestFwdGradientsCUDA.test_forward_mode_A
D_svd_lowrank_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops_gradients.py  TestBwdGradientsCUDA.test_fn_grad_linalg_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_grad_linalg_qr_cuda_float64  TestBwdGradientsCUDA.test_fn_grad_ormqr_cuda_complex128  TestBwd
GradientsCUDA.test_fn_grad_ormqr_cuda_float64  TestBwdGradientsCUDA.test_fn_grad_pca_lowrank_cuda_float64  TestBwdGradientsCUDA.test_fn_grad_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_grad_qr_cuda_float64  TestBwdGradientsCUDA.test_
fn_grad_svd_lowrank_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_linalg_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_gradgrad_linalg_qr_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_ormqr_cuda_complex128  TestBwdGradients
CUDA.test_fn_gradgrad_ormqr_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_pca_lowrank_cuda_float64  TestBwdGradientsCUDA.test_fn_gradgrad_qr_cuda_complex128  TestBwdGradientsCUDA.test_fn_gradgrad_qr_cuda_float64  TestBwdGradientsCUD
A.test_fn_gradgrad_svd_lowrank_cuda_float64
PYTORCH_TEST_WITH_ROCM=1 python test_ops_jit.py  TestJitCUDA.test_variant_consistency_jit_geqrf_cuda_complex64  TestJitCUDA.test_variant_consistency_jit_geqrf_cuda_float32  TestJitCUDA.test_variant_consistency_jit_linalg_qr_cuda_complex64
  TestJitCUDA.test_variant_consistency_jit_linalg_qr_cuda_float32  TestJitCUDA.test_variant_consistency_jit_ormqr_cuda_complex64  TestJitCUDA.test_variant_consistency_jit_ormqr_cuda_float32  TestJitCUDA.test_variant_consistency_jit_qr_cud
a_complex64  TestJitCUDA.test_variant_consistency_jit_qr_cuda_float32
PYTORCH_TEST_WITH_ROCM=1 python test_schema_check.py  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_geqrf_cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_geqrf_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.t
est_schema_correctness_geqrf_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_geqrf_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_linalg_qr_cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schem
a_correctness_linalg_qr_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_linalg_qr_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_linalg_qr_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_
correctness_ormqr_cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_ormqr_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_ormqr_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness
_ormqr_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_pca_lowrank_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_pca_lowrank_cuda_float64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_
cuda_complex128  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_cuda_complex64  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_qr_cuda_float64  TestSc
hemaCheckModeOpInfoCUDA.test_schema_correctness_svd_lowrank_cuda_float32  TestSchemaCheckModeOpInfoCUDA.test_schema_correctness_svd_lowrank_cuda_float64

pruthvistony

hipblasLT integration revert is already done. - #1253
So this PR needs to include only changes related to - SWDEV-406932

aten/src/ATen/cuda/CUDABlas.cpp

jithunnair-amd · 2023-07-12T21:19:02Z

hipblasLT integration revert is already done. - #1253
So this PR needs to include only changes related to - SWDEV-406932

@xinyazhang Please rebase this PR branch so it reflects only the changes relevant to geqrf issue

xinyazhang · 2023-07-13T17:10:13Z

@jithunnair-amd @pruthvistony Done. All geqrf related local tests passed

pruthvistony

LGTM

* Workaround of SWDEV-407984 * Use >= 57000 and < 50800 to match all ROCM 5.7.x releases * Removed ROCM_VERSION < 50800

This reverts commit e3a6481.

* Revert "Workaround of SWDEV-407984 (#1254)" This reverts commit e3a6481. * Revert "[ROCM] Fix TestLinalgCUDA.test_qr_cuda_complex64." This reverts commit 146e291. * Revert "Integrate new batched linalg drivers (#1163)" This reverts commit 5cf7807. * Updated changes for SWDEV-407984 * Update a missing constant in hipify * NIT related changes

@raymin0223

…rt needed, instead of max_len (#1254) This PR switches the generate_permute_indices to move to using exact sizes per expert needed, instead of max_len. Thus, we now return a tensor of size sum(m_sizes) instead of max_len. This may resolve the current issue [here](pytorch/torchtitan#1237). Testing: Ran both unit testing with dynamic padding, both pass. Verified resolves Nans in running in llama4 (credit @raymin0223). pytorch/torchtitan#1237 (comment) ~~~ permuted_indices_gpu=tensor([ 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int32), permuted_indices_cpu=tensor([ 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], dtype=torch.int32) m_sizes=tensor([32, 32, 32, 32], device='cuda:0', dtype=torch.int32) Success tokens_per_expert_group = tensor([4, 0, 2, 3, 1, 0, 0, 5], device='cuda:0', dtype=torch.int32) total_tokens_per_expert = tensor([5, 0, 2, 8], device='cuda:0') m_sizes = tensor([8, 8, 8, 8], device='cuda:0', dtype=torch.int32) m_offsets = tensor([ 8, 16, 24, 32], device='cuda:0', dtype=torch.int32) permuted_indices = tensor([ 0, 1, 2, 3, 9, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, 5, -1, -1, -1, -1, -1, -1, 6, 7, 8, 10, 11, 12, 13, 14], device='cuda:0', dtype=torch.int32) Expert 1 has zero tokens and 8 slots with all -1 All tests passed successfully! ~~~

pruthvistony requested changes Jul 12, 2023

View reviewed changes

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

pruthvistony requested review from jeffdaily and jithunnair-amd July 12, 2023 19:41

Workaround of SWDEV-407984

5a9b0d5

xinyazhang force-pushed the xinyazhang/geqrf_batchsize0 branch from 7e6faac to 5a9b0d5 Compare July 12, 2023 21:34

Use >= 57000 and < 50800 to match all ROCM 5.7.x releases

d6c9244

jithunnair-amd approved these changes Jul 13, 2023

View reviewed changes

jithunnair-amd merged commit a75ea71 into rocm5.7_internal_testing Jul 13, 2023

pruthvistony reviewed Jul 13, 2023

View reviewed changes

pruthvistony pushed a commit that referenced this pull request Sep 12, 2023

Workaround of SWDEV-407984 (#1254)

e3a6481

* Workaround of SWDEV-407984 * Use >= 57000 and < 50800 to match all ROCM 5.7.x releases * Removed ROCM_VERSION < 50800

pruthvistony added a commit that referenced this pull request Sep 28, 2023

Revert "Workaround of SWDEV-407984 (#1254)"

b330df0

This reverts commit e3a6481.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Workaround of SWDEV-407984 #1254

Workaround of SWDEV-407984 #1254

Uh oh!

xinyazhang commented Jul 10, 2023

Uh oh!

xinyazhang commented Jul 10, 2023 •

edited

Loading

Uh oh!

pruthvistony left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jithunnair-amd commented Jul 12, 2023

Uh oh!

xinyazhang commented Jul 13, 2023

Uh oh!

pruthvistony left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Workaround of SWDEV-407984 #1254

Workaround of SWDEV-407984 #1254

Uh oh!

Conversation

xinyazhang commented Jul 10, 2023

Uh oh!

xinyazhang commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pruthvistony left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jithunnair-amd commented Jul 12, 2023

Uh oh!

xinyazhang commented Jul 13, 2023

Uh oh!

pruthvistony left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xinyazhang commented Jul 10, 2023 •

edited

Loading

pruthvistony left a comment •

edited

Loading