Skip to content

simplify esm2 CP collator creation#1383

Closed
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/esm2-cp-collator
Closed

simplify esm2 CP collator creation#1383
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/esm2-cp-collator

Conversation

@pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Dec 16, 2025

Makes create_cp_dataloader essentially just wrap create_thd_dataloader, modifying the collator and wrapping the dataloader in-place


def _get_group_local_rank(group: torch.distributed.ProcessGroup | None = None) -> int:
"""Rank of the current process within `group`."""
if group is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it

cp_rank = device_mesh.get_local_rank("cp")

"""
self.dataloader = dataloader
self.cp_rank = cp_rank
self.cp_rank = _get_group_local_rank(cp_group) if cp_rank is None else cp_rank
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just be cp_rank = device_mesh.get_local_rank("cp")

@pstjohn pstjohn force-pushed the pstjohn/esm2-cp-collator branch from aa43e98 to dd04039 Compare December 22, 2025 16:43
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

pass device mesh rather than process_group to cp dataset

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

fix train_ddp_cp

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

add test for cp_dataloader that doesn't require datacenter hardware

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

remove logger level with hydra

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

fix cp test api

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

bump fp8 test tolerance

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
@pstjohn pstjohn force-pushed the pstjohn/esm2-cp-collator branch from ea2fae7 to d10fb25 Compare December 22, 2025 19:00
@pstjohn
Copy link
Collaborator Author

pstjohn commented Dec 23, 2025

squashed this into #1382

@pstjohn pstjohn closed this Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants