Question For Training Dataset #9

dumpmemory · 2022-11-08T07:52:37Z

Except the database and index data on huggingface, the train_data.json in the repo could thought to be an example right? Would you mind releasing the full version of train and test dataset for reproducing the result ?

bling0830 · 2022-11-11T09:15:46Z

We explain how to retrieve the library and how to handle training data in index-server/README.md. The pile is used as training corpus.

We used the first 400,000 texts of pile_00.json and the first 400,000 texts of pile_29.json for a total of 800,000 texts as training corpus. Do the truncate operation for text over 1025.

dumpmemory · 2022-11-11T16:02:35Z

Thanks for your support.

dumpmemory closed this as completed Nov 9, 2022

dumpmemory reopened this Nov 10, 2022

dumpmemory closed this as completed Nov 11, 2022

dumpmemory mentioned this issue Dec 28, 2022

Unable to reproduce Langboat/ReGPT-125M-200G‘s PPL result. #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question For Training Dataset #9

Question For Training Dataset #9

dumpmemory commented Nov 8, 2022 •

edited

bling0830 commented Nov 11, 2022

dumpmemory commented Nov 11, 2022

Question For Training Dataset #9

Question For Training Dataset #9

Comments

dumpmemory commented Nov 8, 2022 • edited

bling0830 commented Nov 11, 2022

dumpmemory commented Nov 11, 2022

dumpmemory commented Nov 8, 2022 •

edited