Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] Tools for distributed data preprocessing #492

Open
adammoody opened this issue Sep 9, 2023 · 1 comment
Open

[ENHANCEMENT] Tools for distributed data preprocessing #492

adammoody opened this issue Sep 9, 2023 · 1 comment
Labels
stale No activity in 60 days on issue or PR

Comments

@adammoody
Copy link

adammoody commented Sep 9, 2023

For the BigScience effort, I developed preprocess_data_dist.py to parallelize data preprocessing of large datasets. As an example, by using 32 compute nodes, this reduces the time it takes to preprocess the 1TiB OSCAR dataset from days to an hour. It produces indexed_dataset files that are identical to those created by the preprocess_data.py script. An overview of the usage is written up in the "distributed data preprocessing" subsection here:

https://github.com/adammoody/Megatron-DeepSpeed-1/tree/distdata#data-preprocessing-distributed

A pointer to the actual script:

https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/preprocess_data_dist.py

There is a related, simpler script that merges multiple indexed_dataset files into a single file. This is handy if one has preprocessed a large dataset in pieces using some other method and wants to efficiently join those into a single dataset:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/merge_preprocessed_data.py

Additionally, I have an indexed_json.py class that creates an index for JSONL files that allows for random access to different lines in the file. It uses multiple processes to read different regions of the source file and collectively scans and records the location of newline characters that delimit lines in the source file:

https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/indexed_json.py

All three of these rely on a new DistData class that abstracts a number of collective operations helpful for accessing shared files. The version contributed to BigScience only uses torch.distributed, though I also have a version that can optionally use mpi4py. I personally prefer to use mpi4py, which is easier to launch on my system and is a bit more robust.

https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/distdata.py

These scripts are particularly useful for those who have access to systems with multiple compute nodes that are connected to a high-performance parallel file system. Since people running Megatron-LM likely have access to such resources, I expect this could be useful to other Megatron-LM users.

The indexed_dataset file format has been updated recently, and the above work needs to be refreshed to match to that new file format.

Would this be of interest?

Copy link

github-actions bot commented Nov 8, 2023

Marking as stale. No activity in 60 days.

@github-actions github-actions bot added the stale No activity in 60 days on issue or PR label Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale No activity in 60 days on issue or PR
Projects
None yet
Development

No branches or pull requests

1 participant