update the param_id calculation so that it works on both CPX and SPX modes #271
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
update the param_id calculation so that it works on both CPX and SPX modes
Motivation
These 4 tests failed in MI308 machine because they had virtual GPUs
Technical Details
On CPX, local device ids are not laid out in contiguous DP groups. So the computed param_id doesn’t match the stage that initialized the module
So the solution is to use
param_id = parallel_state.get_pipeline_model_parallel_rank() + vm_id * pipeline_model_parallel_world_sizeTest Plan
The indivigual tests were tested with this
`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_with_interleaving
`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_async_pipelining_without_interleaving
`python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_with_interleaving
python3 tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py -k test_learning_pipelining_without_interleavingAnd all the test were tested with this.
python3 run_test.py --include run_transformerTest Result
Tested with this docker
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:24_ubuntu22.04_py3.10_pytorch_lw_rocm7.0_internal_testing_d36b5258
Attached log file
Run_Transformer_test.txt
Fixes : https://ontrack-internal.amd.com/browse/SWDEV-548434