Skip to content

Conversation

@awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jul 22, 2024

Fixes #251

Fixes:

I validated this fix using Multi-Node training in Lightning AI.

@awaelchli awaelchli changed the title WIP: Fix index error when resuming dataset on world size > 0 WIP: Fix index errors on world size > 0 Jul 22, 2024
chunk_indexes_per_nodes: Any = [[] for _ in range(distributed_env.num_nodes)]
process_per_node = distributed_env.world_size // distributed_env.num_nodes
for rank, chunks_per_rank in enumerate(chunks_per_ranks):
chunk_indexes_per_nodes[0 if distributed_env.num_nodes == 1 else rank // process_per_node].extend(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since chunks are now associated by worker and not rank (#237), the grouping by node here was still wrong. I extracted this grouping to a new function so I can more easily test it.


chunks_per_ranks = [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]
shuffled_indexes = _intra_node_chunk_shuffle(_DistributedEnv(8, 7, 2), chunks_per_ranks, 42, 2)
assert shuffled_indexes == [5, 2, 0, 7, 6, 1, 3, 4, 13, 10, 8, 15, 14, 9, 11, 12]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the test was asserting the final output list which by itself is not very meaningful (apart from asserting it doesn't change from one version to the next).

I extended the test to show the other important properities:

  1. The shuffle is consistent across all ranks
  2. The shuffle is different from one epoch to the next.

@awaelchli awaelchli changed the title WIP: Fix index errors on world size > 0 Fix index errors on world size > 0 Jul 22, 2024
@awaelchli awaelchli marked this pull request as ready for review July 22, 2024 12:02
@awaelchli awaelchli requested a review from tchaton as a code owner July 22, 2024 12:02
Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great !

@awaelchli awaelchli merged commit 25c9df3 into main Jul 22, 2024
@awaelchli awaelchli deleted the bugfix/resume-index-error branch July 22, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index error in replay chunks on world size > 0

2 participants