Add more datasets and some fixes #1455

theblackcat102 · 2023-02-11T04:04:22Z

Mainly added OA private datasets :

essay_instruction : essay writing
private_tuning : new instruction generated dataset
translated instruction : translated instruction of the above
eli5 : askh, eli5, asks dataset

Added collate function for mixing different tasks during in batch to make sure the attention learns to attention relevant previous context and question
Fix bugs from other PR

sanagno · 2023-02-11T09:19:10Z

Thanks!
We can merge, I would propose some simple changes.

Add a flag to choose whether to use the new mixing in the collator
Add a script to split the OAPrivate data for sft, rm and rl
I would merge all the data utils from the sft, rm and rl, let me know if you agree I can do it today-tomorrow.

sanagno · 2023-02-11T09:22:55Z

model/supervised_finetuning/custom_datasets/dialogue_collator.py

+
+            # Add a way for the model to terminate generation
+            # When we predict the start of a new expected question, we want to be able to stop generation
+            messages.append(self.tokenizer.eos_token)


For this collator to work, we need to replace this eos token with the tag. Let me know if I am wrong.

I would change it anyway, for both this collator and the default collator

sanagno

We can merge for now, and I can make some changes now

theblackcat102 added 13 commits February 1, 2023 22:14

[feature] Add mix conversation augmentation

f8eba68

[feature] Add OA translated QA

9be4c92

[feature] mix generation from different tasks

1041564

[fix] Custom collate_fn for training

8b20805

[feature] Add OA private RM dataset

0be4d88

[feature] Add rallio new instruction dataset v3

7421615

[feature] Add missing hindi and spanish prompt for translation

af1c62c

[fix] transformers import error

a39cbab

[fix] patch translated history conversation

2c35ff6

Merge branch 'main' into add-dataset

a1b90bf

[fix] add comments for translation data

3434760

[merge] Fix conflict

bcebbbc

[fix] Fix other PR merge bug

9e69117

theblackcat102 requested a review from sanagno as a code owner February 11, 2023 04:04

theblackcat102 added the ml label Feb 11, 2023

sanagno reviewed Feb 11, 2023

View reviewed changes

sanagno approved these changes Feb 11, 2023

View reviewed changes

sanagno merged commit 0610865 into main Feb 11, 2023

sanagno deleted the add-dataset branch February 11, 2023 09:24

tomohideshibata mentioned this pull request Feb 12, 2023

Fix a bitwise operator #1413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more datasets and some fixes #1455

Add more datasets and some fixes #1455

theblackcat102 commented Feb 11, 2023

sanagno commented Feb 11, 2023

sanagno Feb 11, 2023

sanagno Feb 11, 2023

sanagno left a comment

Add more datasets and some fixes #1455

Add more datasets and some fixes #1455

Conversation

theblackcat102 commented Feb 11, 2023

sanagno commented Feb 11, 2023

sanagno Feb 11, 2023

Choose a reason for hiding this comment

sanagno Feb 11, 2023

Choose a reason for hiding this comment

sanagno left a comment

Choose a reason for hiding this comment