Skip to content

Conversation

@amd-sriram
Copy link
Collaborator

update the param_id calculation so that it works on both CPX and SPX modes

Motivation

These 4 tests failed in MI308 machine because they had virtual GPUs

test_learning_pipelining_without_interleaving 

test_learning_pipelining_with_interleaving 

test_learning_async_pipelining_without_interleaving 

test_learning_async_pipelining_with_interleaving 

Technical Details

On CPX, local device ids are not laid out in contiguous DP groups. So the computed param_id doesn’t match the stage that initialized the module

So the solution is to use

param_id = parallel_state.get_pipeline_model_parallel_rank() + vm_id * pipeline_model_parallel_world_size

Test Plan

The indivigual tests were tested with this

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_with_interleaving

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_without_interleaving

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_with_interleaving

python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_without_interleaving
And all the test were tested with this.

python3 run_test.py --include run_transformer

Test Result

Tested with this docker
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:24_ubuntu22.04_py3.10_pytorch_lw_rocm7.0_internal_testing_d36b5258

Attached log file
Run_Transformer_test.txt

Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434

@amd-sriram amd-sriram self-assigned this Aug 11, 2025
@amd-sriram amd-sriram merged commit 4b03581 into master Aug 11, 2025
@amd-sriram amd-sriram deleted the fix_transformer_cpx branch August 11, 2025 16:16
amd-sriram added a commit to ROCm/pytorch that referenced this pull request Aug 11, 2025
update the param_id calculation so that it works on both CPX and SPX modes - ROCm/apex#271
@amd-sriram
Copy link
Collaborator Author

! cherry-pick release/1.8.0

amd-sriram added a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX modes (#271)

PRs:
- ROCm/apex#271

Fixes:
- https://example.com/issue-271
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
update the param_id calculation so that it works on both CPX and SPX
modes - ROCm/apex#271

Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants