Skip to content

Conversation

@le1nux
Copy link
Member

@le1nux le1nux commented Oct 13, 2025

What does this PR do?

This PR fixes the MFU and throughput calculations by taking the dp degree into account instead of the world size. When we use parallelization strategies on top of FSDP, then the world size is different from the data parallel degree. This needs to be reflected in throughput and MFU metric calculations, as done by this PR.

Breaking Changes

  • Configs need to be adapted to correctly use dp degree rather than world size.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

Copy link
Collaborator

@therealdavidos therealdavidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

Copy link
Collaborator

@rrutmann rrutmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. However, I’m wondering if we’re duplicating information between mesh_definition and device_mesh

@le1nux
Copy link
Member Author

le1nux commented Oct 14, 2025

Looks good to me. However, I’m wondering if we’re duplicating information between mesh_definition and device_mesh

referencing now the mesh definition.

@le1nux le1nux merged commit 9c40714 into benchmark_tooling Oct 14, 2025
3 checks passed
@le1nux le1nux deleted the metrics_dp_degree_fix branch October 14, 2025 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants