-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordering of deduplicated datasets? #76
Comments
Hi! The latter two datasets have the exact same ordering of data points.
So, if you’d like to replicate the tokenized Deduplicated Pile in our exact data order, in the Megatron |
Thank you! |
going from @zplizzi 's original message, do the 2nd and 3rd links below contain the same examples as the ones used for pythia training and in the same order ? |
The 2nd and 3rd links do not. They theoretically do contain the same data if you run prepare_data.py on the 2nd link, but due to multiprocessing weirdness I can't guarantee the shuffle would end up the same, which is why we supply the first link |
I just wanted to confirm that the 3 versions you have of the deduplicated data have the same data ordering?
I was hoping to use the
jsonl
one but wanted to ensure it will accurately replicate the data ordering in your tokenized dataset that you used for training.The text was updated successfully, but these errors were encountered: