Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new SFT datasets #853

Merged
merged 5 commits into from Jan 20, 2023
Merged

new SFT datasets #853

merged 5 commits into from Jan 20, 2023

Conversation

theblackcat102
Copy link
Collaborator

Added translation and additional datasets to the pool:

translation:

  • WMT 2019
  • TED Talk translation
  • DiveMT : expert written translation dataset

safety:

  • Prosocial dialogue in both dialogue generation and explanation format

others:

  • @Rallio67 instruction tuning dataset instruct_tuning
  • Other summarization dataset : debate_sum, tldr_news

WMT and TED translation can be a big ( few GB per pair ), so becareful not to include too many language pairs.

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

Copy link
Collaborator

@sanagno sanagno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I also agree that we should upload datasets into the hub. Perhaps we can have a dataset_preprocess folder where we create the hub datasets and then we simply download the final versions.

@sanagno sanagno merged commit 8838698 into main Jan 20, 2023
@sanagno sanagno deleted the sft-dataset-update branch January 20, 2023 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants