Closed
Description
1752 / 16 = 109.5
so it is not a multiple of 16 and there are incorrect results. 1760 / 16 = 110
and 1744 / 16 = 110
is a multiple of 16 and runs correctly.
This is probably an issue with (16, 16) stmatrix store given the multiple of 16 requirement.
To Reproduce:
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 1752 4720 584 NN --verbose --validate
Error Message:
===== Matmul Parameters ========
MMA macro: Hopper_64_256_16
CircularBufferOptions:
circular_buffer_smem_write: true
circular_buffer_smem_read: false
smem_circular_buffer_stage: 4
smem_circular_buffer_prefetch_gap: 1
SupportedVectorization:
a: 8
b: 8
epilogue: 8
MatMulTileOptions: warp tile [64, 256, 64], CTA tile [128, 256, 64]
Async global mem load: true
Indexing mode: int32_t
Tile rasterization order: column-major
Grid swizzle factor: 1
Tiling strategy: OneTilePerCTA
Buffering loop level: CTATiles
Circular buffering strategy: WarpSpecialized
__cluster_dims__(1, 1, 1)
Use shared memory epilogue: 1
Promote re-use of prologue shared memory: 1
Split-K factor: 1
====================================
Traceback (most recent call last):
File "/opt/pytorch/nvfuser/doc/dev/python_scheduling/profile_matmul.py", line 209, in <module>
main()
File "/opt/pytorch/nvfuser/doc/dev/python_scheduling/profile_matmul.py", line 198, in main
baseline_result, nvf_result = test_matmul_nvf(
^^^^^^^^^^^^^^^^
File "/opt/pytorch/nvfuser/doc/dev/python_scheduling/profile_matmul.py", line 135, in test_matmul_nvf
assert torch.allclose(
^^^^^^^^^^^^^^^
AssertionError