Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New oasst export dataset loader #1854

Merged
merged 8 commits into from Feb 26, 2023
Merged

New oasst export dataset loader #1854

merged 8 commits into from Feb 26, 2023

Conversation

andreaskoepf
Copy link
Collaborator

  • support jsonl and jsonl.gz input files
  • use oasst export classes for parsing (classes reside now in oasst_shared)
  • extract all usable leaf-nodes (last assistant replies of conversation threads)
  • allow filtering by language and top_k (ranking results)
  • split into train/eval (while ensuring that no conversation threads of the same tree end up both in train & eval)

@sanagno sanagno added the ml label Feb 25, 2023
@andreaskoepf andreaskoepf merged commit 8eda31a into main Feb 26, 2023
@andreaskoepf andreaskoepf deleted the oasst_dataset_loader branch February 26, 2023 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants