Skip to content

Conversation

@le1nux
Copy link
Member

@le1nux le1nux commented Oct 14, 2025

…across the config

What does this PR do?

This PR ..

General Changes

  • ..

Breaking Changes

  • ..

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux marked this pull request as ready for review October 23, 2025 17:19
Copy link
Collaborator

@rrutmann rrutmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not sure about the usefulness of a new component mesh_definition. All required information should already be present in device_mesh.

In addition, all functionality regarding benchmarking is not related to the topic of this PR and should maybe be moved to a separate PR.

),
# Device mesh
ComponentEntity("device_mesh", "default", get_device_mesh, DeviceMeshConfig),
ComponentEntity("device_mesh", "parallel_degree", get_parallel_degree, ParallelDegreeConfig),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel_degree does not share the same interface with device_mesh. Should we use a different component type here?

if experiment_path.is_dir():
present_files = list(experiment_path.iterdir())
if len(present_files) == 1 and expected_config_file_path not in present_files:
raise RunningEnvError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we raise an error anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm might already place some error and std logs locally into the experiment folder. This would have thrown an exception in this case.

local_rank: ${cuda_env:LOCAL_RANK}
global_rank: ${cuda_env:RANK}
world_size: ${cuda_env:WORLD_SIZE}
mesh_definition:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what do we need a new component mesh_definition? Can't we just access the values defined in device_mesh across the config?

device_type: cuda
data_parallel_replicate_degree: 1
data_parallel_shard_degree: ${settings.cuda_env.world_size} # i.e., fully sharded
data_parallel_shard_degree: ${settings.mesh_definition.dp_degree}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since mesh_definition only contains one entry dp_degree, does this mean that we do not support data_parallel_replicate_degree anymore?

Copy link
Collaborator

@rrutmann rrutmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing mesh definition and changing the component type of dp_degree, the PR looks good to us. @le1nux Please have a final look, then you can merge it

@le1nux
Copy link
Member Author

le1nux commented Oct 26, 2025

All tests run through again.

@le1nux le1nux merged commit f2f5014 into main Oct 26, 2025
3 checks passed
@le1nux le1nux deleted the parallelization_degree_config_improvements branch October 26, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants