About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

Harry-mic · 2023-11-16T09:37:12Z

Hello,Thanks for your awesome work and code！

However, I encountered some confusion while trying to understand how you generated TGRT Self Instruction. You mentioned in the article that you first handwrite 20 instruction types and then generated some topics from these types. Finally, instructions were generated by the “instruction type - topic" pair.

Therefore, my first question is:
How many topics have you generated with each instruction type? I see in Appendix G that your prompt generates 10 topics for each instruction type.

My second question is :
How many instructions will be generated for each "instruction type - topic" pair? Because you finally get 99,121 synthetic prompts from TGRT Self-Instruct, if every "instruction type - topic" pair generates only one instruction, does it mean you at least generate 99,121 topics?

Thanks a lot for your help!

Edward-Sun · 2023-11-17T11:08:58Z

Hi Harryis,

Yes. We generated around 120k synthetic topics (after filtering on the topics) from TGRT Self-Instruct, generated the corresponding 120k prompts, and did some filtering on the prompts to get the final 99k prompts.

As can be seen from the code, we randomly sample topics from existing topics to construct new topics. So ideally we will get the same amount of topics of each instruction type, but that number will be different due to filtering.

Harry-mic · 2023-11-18T01:49:04Z

Thanks a lot for your reply!

It's clear for me that you use 20 instruction types to generate 120K synthetic topics, which means every instruction type will generate about 120k/20 =6k topics. However, How do you "randomly sample topics from existing topics to construct new topics"? As I know, the topic generation prompt as below doesn't involve selecting existing topics, it only involves one of the 20 instuction types:


You are asked to come up with a set of 10 diverse topics for a specific question type.

Here are the requirements:

1. Try not to repeat the words for each topic to maximize diversity.
2. Each topic should contain up to three words.
3. Each topic should be a noun phrase, its first word should be capitalized.
4. The topics should be closely related to the given question type: {}.

List of 10 topics:

Otherwise you use ICL as topic examples in "List of 10 topics:", and then the topic examples are iterative.

Thanks a lot for your help!

Edward-Sun · 2023-11-19T08:32:44Z

Hi Harryis,

We generate the topics in several rounds (called generation_epoch in the code), where in each round, we sample all topics from the previous rounds as the seed to produce new topics.

Harry-mic · 2023-11-20T07:45:31Z

Oh,I get it!Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

Harry-mic commented Nov 16, 2023

Edward-Sun commented Nov 17, 2023

Harry-mic commented Nov 18, 2023 •

edited

Loading

Edward-Sun commented Nov 19, 2023

Harry-mic commented Nov 20, 2023

About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

Comments

Harry-mic commented Nov 16, 2023

Edward-Sun commented Nov 17, 2023

Harry-mic commented Nov 18, 2023 • edited Loading

Edward-Sun commented Nov 19, 2023

Harry-mic commented Nov 20, 2023

Harry-mic commented Nov 18, 2023 •

edited

Loading