Skip to content

Conversation

@jataylo
Copy link

@jataylo jataylo commented Jun 16, 2025

Ensure fused nodes that allocate buffers come before kernels that usethose buffers

In one example we observed:

  • op8 creates buf10 which mutates buf8
  • triton_poi_fused_index_put_lift_fresh_2 kernel tries to use buf8 and buf9
  • op6_op7_op16 (fused node) creates buf8 and buf9

But the standard topological sort didn't ensure that the fused node creating buf8 and buf9 came before the kernel using them.

After this PR we will identify op8 performs a mutation on buf8, find the node that is responsible for creating the buffer (op6_op7_op16) and add an explicit dependency so now op8 depends on op6_op7_op16 and orders graph accordingly.

Note this issue is not seen in PT2.7, not clear as to why. We will hold back on upstreaming this until we observe a similar issue on nightly.

Reproducer code (simplified from megatron)
https://gist.github.com/jataylo/10bedef08323441c588d2965ad963ae8

Execute with

torchrun --nproc_per_node 1 repro.py

Before PR

[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 466, in __call__
[rank0]:     return self.current_callable(inputs)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2128, in run
[rank0]:     return model(new_inputs)
[rank0]:   File "/tmp/torchinductor_root/gp/cgpe6weswyihhm442ugdhqxypbr7urxgk3adfr25onncik6tvthr.py", line 423, in call
[rank0]:     triton_poi_fused_index_put_lift_fresh_2.run(buf9, buf8, 256, grid=grid(256), stream=stream0)
[rank0]: UnboundLocalError: local variable 'buf9' referenced before assignment

Note the simpler repro fails for both CUDA/ROCm and shows a logic issue across PT2.6, more details in gist.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jun 16, 2025

Jenkins build for 031bef105e88333bdde283491951000086ed5722 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jataylo jataylo marked this pull request as ready for review June 23, 2025 09:06
@jataylo
Copy link
Author

jataylo commented Jun 24, 2025

@jithunnair-amd

@jithunnair-amd jithunnair-amd merged commit 8b22352 into ROCm:release/2.6 Jun 27, 2025
1 of 6 checks passed
@jithunnair-amd jithunnair-amd changed the title [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute [release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants