update the param_id calculation so that it works on both CPX and SPX modes #271

amd-sriram · 2025-08-11T16:15:49Z

update the param_id calculation so that it works on both CPX and SPX modes

Motivation

These 4 tests failed in MI308 machine because they had virtual GPUs

test_learning_pipelining_without_interleaving 

test_learning_pipelining_with_interleaving 

test_learning_async_pipelining_without_interleaving 

test_learning_async_pipelining_with_interleaving

Technical Details

On CPX, local device ids are not laid out in contiguous DP groups. So the computed param_id doesn’t match the stage that initialized the module

So the solution is to use

param_id = parallel_state.get_pipeline_model_parallel_rank() + vm_id * pipeline_model_parallel_world_size

Test Plan

The indivigual tests were tested with this

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_with_interleaving

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_without_interleaving

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_with_interleaving

python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_without_interleaving
And all the test were tested with this.

python3 run_test.py --include run_transformer

Test Result

Tested with this docker
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:24_ubuntu22.04_py3.10_pytorch_lw_rocm7.0_internal_testing_d36b5258

Attached log file
Run_Transformer_test.txt

Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434

…modes

update the param_id calculation so that it works on both CPX and SPX modes - ROCm/apex#271

amd-sriram · 2025-08-11T17:01:06Z

! cherry-pick release/1.8.0

…modes (#271)

…modes (#271) (#272)

Commit Messages: - update the param_id calculation so that it works on both CPX and SPX modes (#271) PRs: - ROCm/apex#271 Fixes: - https://example.com/issue-271

update the param_id calculation so that it works on both CPX and SPX modes - ROCm/apex#271 Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434

update the param_id calculation so that it works on both CPX and SPX …

af7fe70

…modes

amd-sriram requested review from jithunnair-amd and pruthvistony August 11, 2025 16:15

amd-sriram self-assigned this Aug 11, 2025

amd-sriram merged commit 4b03581 into master Aug 11, 2025

amd-sriram deleted the fix_transformer_cpx branch August 11, 2025 16:16

amd-sriram added a commit to ROCm/pytorch that referenced this pull request Aug 11, 2025

[rocm7.1_internal_testing] Update related_commits

fdf7673

update the param_id calculation so that it works on both CPX and SPX modes - ROCm/apex#271

amd-sriram mentioned this pull request Aug 11, 2025

[rocm7.1_internal_testing] Update related_commits ROCm/pytorch#2486

Merged

amd-sriram added a commit that referenced this pull request Aug 12, 2025

update the param_id calculation so that it works on both CPX and SPX …

483e3dc

…modes (#271)

amd-sriram added a commit that referenced this pull request Aug 12, 2025

update the param_id calculation so that it works on both CPX and SPX …

0c76dcb

…modes (#271)

amd-sriram added a commit that referenced this pull request Aug 12, 2025

update the param_id calculation so that it works on both CPX and SPX …

3f26640

…modes (#271) (#272)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update the param_id calculation so that it works on both CPX and SPX modes #271

update the param_id calculation so that it works on both CPX and SPX modes #271

Uh oh!

amd-sriram commented Aug 11, 2025

Uh oh!

amd-sriram commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

update the param_id calculation so that it works on both CPX and SPX modes #271

update the param_id calculation so that it works on both CPX and SPX modes #271

Uh oh!

Conversation

amd-sriram commented Aug 11, 2025

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

amd-sriram commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants