-
Notifications
You must be signed in to change notification settings - Fork 12
feat: drafted a solution to safely share the parallelization degrees … #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…across the config
rrutmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not sure about the usefulness of a new component mesh_definition. All required information should already be present in device_mesh.
In addition, all functionality regarding benchmarking is not related to the topic of this PR and should maybe be moved to a separate PR.
| ), | ||
| # Device mesh | ||
| ComponentEntity("device_mesh", "default", get_device_mesh, DeviceMeshConfig), | ||
| ComponentEntity("device_mesh", "parallel_degree", get_parallel_degree, ParallelDegreeConfig), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel_degree does not share the same interface with device_mesh. Should we use a different component type here?
| if experiment_path.is_dir(): | ||
| present_files = list(experiment_path.iterdir()) | ||
| if len(present_files) == 1 and expected_config_file_path not in present_files: | ||
| raise RunningEnvError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we raise an error anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slurm might already place some error and std logs locally into the experiment folder. This would have thrown an exception in this case.
| local_rank: ${cuda_env:LOCAL_RANK} | ||
| global_rank: ${cuda_env:RANK} | ||
| world_size: ${cuda_env:WORLD_SIZE} | ||
| mesh_definition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what do we need a new component mesh_definition? Can't we just access the values defined in device_mesh across the config?
| device_type: cuda | ||
| data_parallel_replicate_degree: 1 | ||
| data_parallel_shard_degree: ${settings.cuda_env.world_size} # i.e., fully sharded | ||
| data_parallel_shard_degree: ${settings.mesh_definition.dp_degree} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since mesh_definition only contains one entry dp_degree, does this mean that we do not support data_parallel_replicate_degree anymore?
rrutmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing mesh definition and changing the component type of dp_degree, the PR looks good to us. @le1nux Please have a final look, then you can merge it
|
All tests run through again. |
…across the config
What does this PR do?
This PR ..
General Changes
Breaking Changes
Checklist before submitting final PR
python tests/tests.py)CHANGELOG_DEV.md)