Ordering of deduplicated datasets? #76

zplizzi · 2023-03-17T18:48:54Z

I just wanted to confirm that the 3 versions you have of the deduplicated data have the same data ordering?

https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps (tokenized + sharded)
https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated (jsonl)
https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile (raw, parquet)

I was hoping to use the jsonl one but wanted to ensure it will accurately replicate the data ordering in your tokenized dataset that you used for training.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2023-03-20T14:55:25Z

Hi! The latter two datasets have the exact same ordering of data points.

preprocess_data.py in the GPT-NeoX repo should be deterministic in creating shuffle order and tokenizing data, but there’s the off chance that if you rerun (maybe due to very strange multiprocessing issues, etc.)

So, if you’d like to replicate the tokenized Deduplicated Pile in our exact data order, in the Megatron .bin and .idx format, the tokenized + sharded version is the safest bet—If you do prefer to tokenize it yourself though, I could probably provide you a checksum of the tokenized files to confirm things look ok!

zplizzi · 2023-03-20T22:28:25Z

Thank you!

stabilize-ai · 2023-04-15T00:10:14Z

going from @zplizzi 's original message, do the 2nd and 3rd links below contain the same examples as the ones used for pythia training and in the same order ?

https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps (tokenized + sharded)
https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated (jsonl)
https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile (raw, parquet)

haileyschoelkopf · 2023-04-22T19:29:46Z

The 2nd and 3rd links do not. They theoretically do contain the same data if you run prepare_data.py on the 2nd link, but due to multiprocessing weirdness I can't guarantee the shuffle would end up the same, which is why we supply the first link

zplizzi closed this as completed Mar 20, 2023

zplizzi mentioned this issue Mar 23, 2023

What tool do you use for your data preprocessing/binarization? #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordering of deduplicated datasets? #76

Ordering of deduplicated datasets? #76

zplizzi commented Mar 17, 2023

haileyschoelkopf commented Mar 20, 2023

zplizzi commented Mar 20, 2023

stabilize-ai commented Apr 15, 2023

haileyschoelkopf commented Apr 22, 2023

Ordering of deduplicated datasets? #76

Ordering of deduplicated datasets? #76

Comments

zplizzi commented Mar 17, 2023

haileyschoelkopf commented Mar 20, 2023

zplizzi commented Mar 20, 2023

stabilize-ai commented Apr 15, 2023

haileyschoelkopf commented Apr 22, 2023