Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

Open
Harry-mic opened this issue Nov 16, 2023 · 4 comments

Comments

@Harry-mic
Copy link

Hello,Thanks for your awesome work and code!

However, I encountered some confusion while trying to understand how you generated TGRT Self Instruction. You mentioned in the article that you first handwrite 20 instruction types and then generated some topics from these types. Finally, instructions were generated by the “instruction type - topic" pair.

Therefore, my first question is:
How many topics have you generated with each instruction type? I see in Appendix G that your prompt generates 10 topics for each instruction type.

My second question is :
How many instructions will be generated for each "instruction type - topic" pair? Because you finally get 99,121 synthetic prompts from TGRT Self-Instruct, if every "instruction type - topic" pair generates only one instruction, does it mean you at least generate 99,121 topics?

Thanks a lot for your help!

@Edward-Sun
Copy link
Contributor

Hi Harryis,

Yes. We generated around 120k synthetic topics (after filtering on the topics) from TGRT Self-Instruct, generated the corresponding 120k prompts, and did some filtering on the prompts to get the final 99k prompts.

As can be seen from the code, we randomly sample topics from existing topics to construct new topics. So ideally we will get the same amount of topics of each instruction type, but that number will be different due to filtering.

@Harry-mic
Copy link
Author

Harry-mic commented Nov 18, 2023

Thanks a lot for your reply!

It's clear for me that you use 20 instruction types to generate 120K synthetic topics, which means every instruction type will generate about 120k/20 =6k topics. However, How do you "randomly sample topics from existing topics to construct new topics"? As I know, the topic generation prompt as below doesn't involve selecting existing topics, it only involves one of the 20 instuction types:


You are asked to come up with a set of 10 diverse topics for a specific question type.

Here are the requirements:

1. Try not to repeat the words for each topic to maximize diversity.
2. Each topic should contain up to three words.
3. Each topic should be a noun phrase, its first word should be capitalized.
4. The topics should be closely related to the given question type: {}.

List of 10 topics:

Otherwise you use ICL as topic examples in "List of 10 topics:", and then the topic examples are iterative.

Thanks a lot for your help!

@Edward-Sun
Copy link
Contributor

Hi Harryis,

We generate the topics in several rounds (called generation_epoch in the code), where in each round, we sample all topics from the previous rounds as the seed to produce new topics.

@Harry-mic
Copy link
Author

Oh,I get it!Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants