You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the BigScience effort, I developed preprocess_data_dist.py to parallelize data preprocessing of large datasets. As an example, by using 32 compute nodes, this reduces the time it takes to preprocess the 1TiB OSCAR dataset from days to an hour. It produces indexed_dataset files that are identical to those created by the preprocess_data.py script. An overview of the usage is written up in the "distributed data preprocessing" subsection here:
There is a related, simpler script that merges multiple indexed_dataset files into a single file. This is handy if one has preprocessed a large dataset in pieces using some other method and wants to efficiently join those into a single dataset:
Additionally, I have an indexed_json.py class that creates an index for JSONL files that allows for random access to different lines in the file. It uses multiple processes to read different regions of the source file and collectively scans and records the location of newline characters that delimit lines in the source file:
All three of these rely on a new DistData class that abstracts a number of collective operations helpful for accessing shared files. The version contributed to BigScience only uses torch.distributed, though I also have a version that can optionally use mpi4py. I personally prefer to use mpi4py, which is easier to launch on my system and is a bit more robust.
These scripts are particularly useful for those who have access to systems with multiple compute nodes that are connected to a high-performance parallel file system. Since people running Megatron-LM likely have access to such resources, I expect this could be useful to other Megatron-LM users.
The indexed_dataset file format has been updated recently, and the above work needs to be refreshed to match to that new file format.
Would this be of interest?
The text was updated successfully, but these errors were encountered:
For the BigScience effort, I developed
preprocess_data_dist.py
to parallelize data preprocessing of large datasets. As an example, by using 32 compute nodes, this reduces the time it takes to preprocess the 1TiB OSCAR dataset from days to an hour. It producesindexed_dataset
files that are identical to those created by thepreprocess_data.py
script. An overview of the usage is written up in the "distributed data preprocessing" subsection here:https://github.com/adammoody/Megatron-DeepSpeed-1/tree/distdata#data-preprocessing-distributed
A pointer to the actual script:
https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/preprocess_data_dist.py
There is a related, simpler script that merges multiple
indexed_dataset
files into a single file. This is handy if one has preprocessed a large dataset in pieces using some other method and wants to efficiently join those into a single dataset:https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/merge_preprocessed_data.py
Additionally, I have an
indexed_json.py
class that creates an index for JSONL files that allows for random access to different lines in the file. It uses multiple processes to read different regions of the source file and collectively scans and records the location of newline characters that delimit lines in the source file:https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/indexed_json.py
All three of these rely on a new
DistData
class that abstracts a number of collective operations helpful for accessing shared files. The version contributed to BigScience only uses torch.distributed, though I also have a version that can optionally use mpi4py. I personally prefer to use mpi4py, which is easier to launch on my system and is a bit more robust.https://github.com/adammoody/Megatron-DeepSpeed-1/blob/distdata/tools/distdata.py
These scripts are particularly useful for those who have access to systems with multiple compute nodes that are connected to a high-performance parallel file system. Since people running Megatron-LM likely have access to such resources, I expect this could be useful to other Megatron-LM users.
The
indexed_dataset
file format has been updated recently, and the above work needs to be refreshed to match to that new file format.Would this be of interest?
The text was updated successfully, but these errors were encountered: