[ENHANCEMENT] Tools for distributed data preprocessing #492

adammoody · 2023-09-09T00:41:27Z

For the BigScience effort, I developed preprocess_data_dist.py to parallelize data preprocessing of large datasets. As an example, by using 32 compute nodes, this reduces the time it takes to preprocess the 1TiB OSCAR dataset from days to an hour. It produces indexed_dataset files that are identical to those created by the preprocess_data.py script. An overview of the usage is written up in the "distributed data preprocessing" subsection here:

https://github.com/adammoody/Megatron-DeepSpeed-1/tree/distdata#data-preprocessing-distributed

A pointer to the actual script:

https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/preprocess_data_dist.py

There is a related, simpler script that merges multiple indexed_dataset files into a single file. This is handy if one has preprocessed a large dataset in pieces using some other method and wants to efficiently join those into a single dataset:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/merge_preprocessed_data.py

Additionally, I have an indexed_json.py class that creates an index for JSONL files that allows for random access to different lines in the file. It uses multiple processes to read different regions of the source file and collectively scans and records the location of newline characters that delimit lines in the source file:

https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/indexed_json.py

All three of these rely on a new DistData class that abstracts a number of collective operations helpful for accessing shared files. The version contributed to BigScience only uses torch.distributed, though I also have a version that can optionally use mpi4py. I personally prefer to use mpi4py, which is easier to launch on my system and is a bit more robust.

https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/distdata.py

These scripts are particularly useful for those who have access to systems with multiple compute nodes that are connected to a high-performance parallel file system. Since people running Megatron-LM likely have access to such resources, I expect this could be useful to other Megatron-LM users.

The indexed_dataset file format has been updated recently, and the above work needs to be refreshed to match to that new file format.

Would this be of interest?

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-08T18:20:20Z

Marking as stale. No activity in 60 days.

github-actions bot added the stale No activity in 60 days on issue or PR label Nov 8, 2023

adammoody mentioned this issue Nov 17, 2023

Tools to parallelize Megatron tokenization NVIDIA/NeMo#7907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] Tools for distributed data preprocessing #492

[ENHANCEMENT] Tools for distributed data preprocessing #492

adammoody commented Sep 9, 2023 •

edited

Loading

github-actions bot commented Nov 8, 2023

[ENHANCEMENT] Tools for distributed data preprocessing #492

[ENHANCEMENT] Tools for distributed data preprocessing #492

Comments

adammoody commented Sep 9, 2023 • edited Loading

github-actions bot commented Nov 8, 2023

adammoody commented Sep 9, 2023 •

edited

Loading