Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc. #261

huu4ontocord · 2023-01-02T07:12:23Z

See https://www.lesswrong.com/posts/EHbJ69JDs4suovpLw/testing-palm-prompts-on-gpt3.

Try doing 2, 3 or 4 shot inference on something like JT or neox 20B or galactica.

After we find a promising model and configuration, we can scrape the net for jokes and paragraphs with logical inferences to create dialog data.

Human: Tell me a joke about {extract keywords from joke}
Assistant: {joke}
Human: Explain the joke.
Assisant: {explanation}

See also https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf

huu4ontocord · 2023-01-05T00:46:18Z

Adding expalantions at the end of existing instruction dataset answers where the answers are classificaitons (see p3, natural instructions, etc):

For exmple,

This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...[your created answer]

This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...generated answer

We can also follow this up with explanations for other "hard" things like:

explain riddles, poems (metaphors), analogies, songs

smytjf11 · 2023-01-05T03:39:45Z

Going with the movie reviews idea, could we use the Rotten Tomatoes dataset to generate prompts, maybe supplement with one of the models fine tuned on it as well?

https://huggingface.co/datasets/rotten_tomatoes

momegas · 2023-01-17T18:21:45Z

The idea is to create a dataset with explanations. Like for example take the movie dataset and do this:
This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...[your created answer]
Am I right?
I'm interested in picking this up. How large should the dataset be?

huu4ontocord · 2023-01-22T04:00:11Z

@momegas yes. if it is very compute intensive, it doesn't need to be large. maybe see if you can get it to work first. And then we can discuss size. we can run it on some extra compute.

mikegarts · 2023-03-20T09:12:48Z

Sound like a very cool task and I would love to give it a try if it is still relevant :) @ontocord

kkie02 · 2023-03-23T15:12:04Z

@ontocord I'd like to have a try, can you tell me your name in Discord? Maybe we can talk a little bit more there.
@mikegarts Maybe we can work on it together? More data is better for this project.
My name in Discord is QiKo

mikegarts · 2023-03-25T14:33:13Z

@kkie02 Sure, I'm in discord as mikegarts. Feel free to ping me.
Btw I just opened a pr with somewhat relevant instruction dataset #2209 but would love to cooperate on further work.

Regarding #261. This is an re-produce of the dataset from LogicInference Dataset in paper: https://openreview.net/pdf?id=HAGeIS_Lcg9. I think it will helpful for improving logic inference ability of the model. The github page of LogicInference Dataset: https://github.com/google-research/google-research/tree/master/logic_inference_dataset. This dataset is aimed to offer more dataset for Open Assistant project, depending on their demands, there three columns: INSTRUCTION, RESPONSE, SOURCE. The results in this dataset is a little different from which was introduced in the original paper: 1.For all three splits (IID/OOD/length), only IID is used. In the original paper, it seems that model can reach better performance with data generated by this split method. 2.In the original paper, there are two form of responses: LOGICINFERENCEb (with the answer at the beginning) and LOGICINFERENCEe (with the answer at the end). This dataset uses LOGICINFERENCEe, that means: for all questions, the model will first do logic inference, and give the final answer at the end. 3.The original paper, some parameters in generate_dataset.py are: N_INFERENCE_PROBLEMS = 5000 N_VARIATIONS = 25 N_EXAMPLES = 200000 TRAIN_RATIO = 0.9 LENGTH_SPLIT_THRESHOLD = 4 RANDOM_SEED = 0 I choose some new parameters: N_INFERENCE_PROBLEMS = 10000 N_VARIATIONS = 25 N_EXAMPLES = 55000 TRAIN_RATIO = 1 LENGTH_SPLIT_THRESHOLD = 4 RANDOM_SEED = 1111 The original script generated 4814 different inference problems and extended all those inference problems to around 200,000 Q-A pairs. My settings generated 5491 different inference problems and extended them to around 54,607 Instruction-Response pairs. I think for Open Assistant projects, maybe the number of different inference problems is more important, and generated many similar Instruction-Response pairs will only add training time and doesn't make much sense. --------- Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com> Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>

echo0x22 · 2023-05-10T20:20:10Z

Going to work in this field, but with more specific tasks (semantics, logic, reasoning)
#3122

andreaskoepf · 2023-06-14T08:37:49Z

Closing old data issue.

huu4ontocord added the data label Jan 2, 2023

huu4ontocord changed the title ~~Creating augmented data using few-shot prompts for jokes explanations and logical inferences~~ Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc. Jan 5, 2023

huu4ontocord assigned momegas Jan 22, 2023

momegas removed their assignment Feb 1, 2023

kkie02 mentioned this issue Apr 5, 2023

add Logic Inference Dataset #2337

Merged

andreaskoepf closed this as completed Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc. #261

Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc. #261

huu4ontocord commented Jan 2, 2023

huu4ontocord commented Jan 5, 2023 •

edited

Loading

smytjf11 commented Jan 5, 2023

momegas commented Jan 17, 2023

huu4ontocord commented Jan 22, 2023

mikegarts commented Mar 20, 2023 •

edited

Loading

kkie02 commented Mar 23, 2023

mikegarts commented Mar 25, 2023

echo0x22 commented May 10, 2023

andreaskoepf commented Jun 14, 2023

Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc. #261

Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc. #261

Comments

huu4ontocord commented Jan 2, 2023

huu4ontocord commented Jan 5, 2023 • edited Loading

smytjf11 commented Jan 5, 2023

momegas commented Jan 17, 2023

huu4ontocord commented Jan 22, 2023

mikegarts commented Mar 20, 2023 • edited Loading

kkie02 commented Mar 23, 2023

mikegarts commented Mar 25, 2023

echo0x22 commented May 10, 2023

andreaskoepf commented Jun 14, 2023

huu4ontocord commented Jan 5, 2023 •

edited

Loading

mikegarts commented Mar 20, 2023 •

edited

Loading