Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordering of deduplicated datasets? #76

Closed
zplizzi opened this issue Mar 17, 2023 · 4 comments
Closed

Ordering of deduplicated datasets? #76

zplizzi opened this issue Mar 17, 2023 · 4 comments

Comments

@zplizzi
Copy link

zplizzi commented Mar 17, 2023

I just wanted to confirm that the 3 versions you have of the deduplicated data have the same data ordering?

I was hoping to use the jsonl one but wanted to ensure it will accurately replicate the data ordering in your tokenized dataset that you used for training.

@haileyschoelkopf
Copy link
Collaborator

Hi! The latter two datasets have the exact same ordering of data points.

preprocess_data.py in the GPT-NeoX repo should be deterministic in creating shuffle order and tokenizing data, but there’s the off chance that if you rerun (maybe due to very strange multiprocessing issues, etc.)

So, if you’d like to replicate the tokenized Deduplicated Pile in our exact data order, in the Megatron .bin and .idx format, the tokenized + sharded version is the safest bet—If you do prefer to tokenize it yourself though, I could probably provide you a checksum of the tokenized files to confirm things look ok!

@zplizzi
Copy link
Author

zplizzi commented Mar 20, 2023

Thank you!

@stabilize-ai
Copy link

going from @zplizzi 's original message, do the 2nd and 3rd links below contain the same examples as the ones used for pythia training and in the same order ?

@haileyschoelkopf
Copy link
Collaborator

The 2nd and 3rd links do not. They theoretically do contain the same data if you run prepare_data.py on the 2nd link, but due to multiprocessing weirdness I can't guarantee the shuffle would end up the same, which is why we supply the first link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants