Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results when problem size M is not divisible by 16. #3963

Open
rdspring1 opened this issue Feb 25, 2025 · 2 comments · May be fixed by #4100
Open

Incorrect results when problem size M is not divisible by 16. #3963

rdspring1 opened this issue Feb 25, 2025 · 2 comments · May be fixed by #4100
Assignees
Labels

Comments

@rdspring1
Copy link
Collaborator

rdspring1 commented Feb 25, 2025

1752 / 16 = 109.5 so it is not a multiple of 16 and there are incorrect results. 1760 / 16 = 110 and 1744 / 16 = 110 is a multiple of 16 and runs correctly.

This is probably an issue with (16, 16) stmatrix store given the multiple of 16 requirement.

To Reproduce:
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 1752 4720 584 NN --verbose --validate

Error Message:

===== Matmul Parameters ========

MMA macro: Hopper_64_256_16
CircularBufferOptions:
  circular_buffer_smem_write: true
  circular_buffer_smem_read: false
  smem_circular_buffer_stage: 4
  smem_circular_buffer_prefetch_gap: 1
SupportedVectorization:
  a: 8
  b: 8
  epilogue: 8
MatMulTileOptions: warp tile [64, 256, 64], CTA tile [128, 256, 64]
Async global mem load: true
Indexing mode: int32_t
Tile rasterization order: column-major
Grid swizzle factor: 1
Tiling strategy: OneTilePerCTA
Buffering loop level: CTATiles
Circular buffering strategy: WarpSpecialized
__cluster_dims__(1, 1, 1)
Use shared memory epilogue: 1
Promote re-use of prologue shared memory: 1
Split-K factor: 1
====================================

Traceback (most recent call last):
  File "/opt/pytorch/nvfuser/doc/dev/python_scheduling/profile_matmul.py", line 209, in <module>
    main()
  File "/opt/pytorch/nvfuser/doc/dev/python_scheduling/profile_matmul.py", line 198, in main
    baseline_result, nvf_result = test_matmul_nvf(
                                  ^^^^^^^^^^^^^^^^
  File "/opt/pytorch/nvfuser/doc/dev/python_scheduling/profile_matmul.py", line 135, in test_matmul_nvf
    assert torch.allclose(
           ^^^^^^^^^^^^^^^
AssertionError
@rdspring1 rdspring1 self-assigned this Feb 25, 2025
@rdspring1 rdspring1 changed the title Incorrect results when problem size M is not divisible by 32. Incorrect results when problem size M is not divisible by 16. Feb 25, 2025
@jacobhinkle
Copy link
Collaborator

For the time being should we just disable smem epilogue when N is not divisible by 16?

@jacobhinkle
Copy link
Collaborator

Why does the problem size affect the stmatrix? Our TMA stores should be the M,N size of the CTA tile right? So in this case 128x256 (which will be chopped into 64x64 subtiles for the TMA). Still, I would think that stmatrix which is just filling in a 128x256 smem buffer would not cause this. Then again, the TMA load should fill OOB with zeros (or nan if misconfigured), so I would not expect TMA to cause the problem either..

jacobhinkle added a commit that referenced this issue Mar 18, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
This just disables stmatrix when the problem size is not a multiple of
M=16 N=16. Note that we still use TMA for the epilogue when possible.

Fixes #3963. This is actually a work-around but might be the final fix
unless we have important perf to gain from a proper fix for
non-divisible sizes.
@jacobhinkle jacobhinkle linked a pull request Mar 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants