Skip to content

Conversation

@amd-sriram
Copy link
Collaborator

update the param_id calculation so that it works on both CPX and SPX modes

Motivation

These 4 tests failed in MI308 machine because they had virtual GPUs

test_learning_pipelining_without_interleaving 

test_learning_pipelining_with_interleaving 

test_learning_async_pipelining_without_interleaving 

test_learning_async_pipelining_with_interleaving 

Technical Details

On CPX, local device ids are not laid out in contiguous DP groups. So the computed param_id doesn’t match the stage that initialized the module

So the solution is to use

param_id = parallel_state.get_pipeline_model_parallel_rank() + vm_id * pipeline_model_parallel_world_size

Test Plan

The indivigual tests were tested with this

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_with_interleaving

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_without_interleaving

`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_with_interleaving

python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_without_interleaving
And all the test were tested with this.

python3 run_test.py --include run_transformer

Test Result

Tested with this docker
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:24_ubuntu22.04_py3.10_pytorch_lw_rocm7.0_internal_testing_d36b5258

Attached log file
Run_Transformer_test.txt

Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434

@amd-sriram amd-sriram self-assigned this Aug 12, 2025
@amd-sriram amd-sriram merged commit 3f26640 into release/1.8.0 Aug 12, 2025
@amd-sriram amd-sriram deleted the 4b03581_release_1.8.0 branch August 12, 2025 06:51
amd-sriram added a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX modes (#271) (#272)

PRs:
- ROCm/apex#272

Fixes:
- https://example.com/issue-271
- https://example.com/issue-272
amd-sriram added a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX modes (#271) (#272)

PRs:
- ROCm/apex#272

Fixes:
- https://example.com/issue-272
- https://example.com/issue-271
amd-sriram added a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX modes (#271) (#272)
- reset parameters for FusedDenseGeluDense similar to FusedDense to make the test_gelu pass (#269) (#270)

Co-authored-by: Sriram Kumar <skishore@amd.com>

PRs:
- ROCm/apex#272

Fixes:
- https://ontrack-internal.amd.com/browse/SWDEV-540029
- https://example.com/issue-272
- https://example.com/issue-271
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX
modes (#271) (#272)

PRs:
- ROCm/apex#272

Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Aug 12, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX
modes (#271) (#272)
- reset parameters for FusedDenseGeluDense similar to FusedDense to make
the test_gelu pass (#269) (#270)


PRs:
- ROCm/apex#272
- ROCm/apex#269

Fixes:
- https://ontrack-internal.amd.com/browse/SWDEV-540029
- https://ontrack-internal.amd.com/browse/SWDEV-548434
tvukovic-amd pushed a commit to ROCm/pytorch that referenced this pull request Aug 20, 2025
Commit Messages:
- update the param_id calculation so that it works on both CPX and SPX
modes (#271) (#272)
- reset parameters for FusedDenseGeluDense similar to FusedDense to make
the test_gelu pass (#269) (#270)


PRs:
- ROCm/apex#272
- ROCm/apex#269

Fixes:
- https://ontrack-internal.amd.com/browse/SWDEV-540029
- https://ontrack-internal.amd.com/browse/SWDEV-548434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants