-
Notifications
You must be signed in to change notification settings - Fork 80
Fix index errors on world size > 0 #252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
…ing-ai/litdata into bugfix/resume-index-error
for more information, see https://pre-commit.ci
| chunk_indexes_per_nodes: Any = [[] for _ in range(distributed_env.num_nodes)] | ||
| process_per_node = distributed_env.world_size // distributed_env.num_nodes | ||
| for rank, chunks_per_rank in enumerate(chunks_per_ranks): | ||
| chunk_indexes_per_nodes[0 if distributed_env.num_nodes == 1 else rank // process_per_node].extend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since chunks are now associated by worker and not rank (#237), the grouping by node here was still wrong. I extracted this grouping to a new function so I can more easily test it.
|
|
||
| chunks_per_ranks = [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]] | ||
| shuffled_indexes = _intra_node_chunk_shuffle(_DistributedEnv(8, 7, 2), chunks_per_ranks, 42, 2) | ||
| assert shuffled_indexes == [5, 2, 0, 7, 6, 1, 3, 4, 13, 10, 8, 15, 14, 9, 11, 12] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the test was asserting the final output list which by itself is not very meaningful (apart from asserting it doesn't change from one version to the next).
I extended the test to show the other important properities:
- The shuffle is consistent across all ranks
- The shuffle is different from one epoch to the next.
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great !
Fixes #251
Fixes:
I validated this fix using Multi-Node training in Lightning AI.