# writing prompt augmentation data task

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/writing-prompt/writing_prompt.ipynb)

# Pipeline

The goal of this task was to auto-generate question/answer samples from writingPrompts to feed openAssistant. To do that we should standardize the way a prompt was written. Our choice was to set prompt templates which might turn the generation process feasible. Here are the templates we applied:

* Base template: every prompt would have this sample.
> User: write me a story about: {stripped_prompt} -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}

where `stripped_promt` is the cleared prompt output by regex pattern to take out parts of a prompt that would not fit the template. And `story` is the actual answer to a prompt.

* General constraints: a prompt whose constraint was found by regex pattern would have this also.
> Base template, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}

where `stripped_constraint` is the constraint found.

* Answer beginning constraints: this constraint was imposed by the way the answer should start.  
> Base template, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}

where `beginning` is the first sentence of a story.

* Answer end constraints: this constraint was imposed by the way the answer should end.  
> Base template, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}

where `ending` is the last sentence of a story.

* Answer middle constraints: this constraint was imposed by the way the answer should have in its middle text.  
> Base template, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}

where `middle` is a summary of a story without the first and last sentence brought by a generative model

To get the samples we used the following pipeline:

* **Get data**: download from kaggle
* **Pre-processing**: load data from entails source/taget (aka: prompt/story) by every split (train/valid/test) merging into one pandas dataframe, enhancing tit with tabular info about the sample tags.
* **Triage prompts**: we pick prompts sorted by frequency, and we built regex pattern for some of them to extract a striped prompt and the related constraint.
* **Split stories**: after removing story beginning and ending sentences, we applied a sentence sliding window to get stories middle summaries.

## Get data from Kaggle


In [None]:
# helper functions
import json


def save_credentials(d):
    with open("/root/.kaggle/kaggle.json", "w") as outfile:
        json.dump(d, outfile)

In [None]:
# uncomment the following instructions, in case you want to save a .kaggle.json
# d = {}
# d['username'] = 'user'
# d['key'] = 'key'
#!mkdir ~/.kaggle
# save_credentials(d)
!mv ~/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

mv: cannot stat '/mnt/home/fabraz/kaggle.json': No such file or directory


In [None]:
#!pip install kaggle

In [None]:
!kaggle datasets download -d ratthachat/writing-prompts

/bin/bash: kaggle: command not found


In [None]:
!unzip writing-prompts.zip

Archive:  writing-prompts.zip
  inflating: writingPrompts/README   
  inflating: writingPrompts/test.wp_source  
  inflating: writingPrompts/test.wp_target  
  inflating: writingPrompts/train.wp_source  
  inflating: writingPrompts/train.wp_target  
  inflating: writingPrompts/valid.wp_source  
  inflating: writingPrompts/valid.wp_target  


## Pre-processing

In [1]:
import pandas as pd
from IPython.display import display, HTML

In [3]:
# helper functions
import re


def load_file(path, names):
    with open(path, "r") as f:
        lines = f.readlines()
    return pd.DataFrame(lines, columns=names)


def load_data():
    tags = {
        "WP": "Writing Prompt",
        "SP": "Simple Prompt",
        "EU": "Established Universe",
        "CW": "Constrained Writing",
        "TT": "Theme Thursday",
        "PM": "Prompt Me",
        "MP": "Media Prompt",
        "IP": "Image Prompt",
        "PI": "Prompt Inspired",
        "OT": "Off Topic",
        "RF": "Reality Fiction",
    }

    dfConcat = pd.DataFrame()
    for split in ["train", "valid", "test"]:
        df = load_file(f"writingPrompts/{split}.wp_source", ["prompt"])
        for tag in tags.keys():
            df[tag.lower()] = df["prompt"].map(lambda x: check_tag(x, tag.lower()))
        df["tagCounter"] = df.iloc[:, [2, -1]].sum(axis=1)
        df["splitLineIndex"] = df.index
        story = load_file(f"writingPrompts/{split}.wp_target", ["story"])
        df["story"] = story["story"]
        df["split"] = split
        dfConcat = pd.concat([dfConcat, df])
    return dfConcat


def check_tag(item, tag):
    r = re.compile(r"[\(\{\[]\s*[\w]{2}\s*[\]\}\)]\s*")
    m = r.findall(item.lower())
    if len(m) > 0:
        for group in m:
            if tag in group:
                return 1
    return 0


def show_data(df):
    html_string = "<"
    html_string += "html><"
    html_string += "head><title>HTML Pandas Dataframe with CSS</title></head"
    html_string += "><"
    html_string += 'link rel="stylesheet" type="text/css" href="df_style.css"/'
    html_string += "><"
    html_string += """body>
                    {table}
                  </body>
                </html
                """
    html_string += ">"
    df = df.replace("\<newline\>|\< newline \>|\<new line\>", "\n", regex=True)
    df.style.set_properties(**{"text-align": "left"}).set_table_styles(
        [dict(selector="th", props=[("text-align", "left")])]
    )
    html = df.to_html()
    html_string = html_string.format(table=html)
    html_string = (
        html_string.replace(r"\n", "<br>")
        .replace("<td>", '<td style="text-align:left">')
        .replace("<th>", '<th style="text-align:left">')
    )
    display(HTML(html_string))


def get_samples(df, n, constraint=None, show=True):
    samples = zip(df["prompt"].iloc[:n, 0].index, df["prompt"].iloc[:n, 0], df["story"].iloc[:n, 0])
    df = pd.DataFrame(samples, columns=["index", "prompt", "story"])
    if constraint is not None:
        df = df[df["prompt"].str.contains(constraint)]
    return df

In [4]:
!head -n2 writingPrompts/test.wp_source

[ WP ] Leonardo DiCaprio in a fit of rage begins to torpedo his own career by deliberately acting poorly and taking on bad films . He finally wins an oscar for starring in Paul Blart : Mall Cop 3 .
[ CW ] Kill the writer in first-person narrative .


In [5]:
ds = load_data()

In [6]:
ds.head(3)

Unnamed: 0,prompt,wp,sp,eu,cw,tt,pm,mp,ip,pi,ot,rf,tagCounter,splitLineIndex,story,split
0,[ WP ] You 've finally managed to discover the...,1,0,0,0,0,0,0,0,0,0,0,0,0,"So many times have I walked on ruins , the rem...",train
1,"[ WP ] The moon is actually a giant egg , and ...",1,0,0,0,0,0,0,0,0,0,0,0,1,"-Week 18 aboard the Depth Reaver , Circa 2023-...",train
2,[ WP ] You find a rip in time walking through ...,1,0,0,0,0,0,0,0,0,0,0,0,2,"I was feckin ' sloshed , mate . First time I e...",train


In [7]:
print(ds.shape)

(303358, 16)


In [8]:
ds[ds["split"] == "test"].iloc[:2, [13, 0, 14, -1]].columns

Index(['splitLineIndex', 'prompt', 'story', 'split'], dtype='object')

### Samples

#### Train

In [9]:
show_data(ds[ds["split"] == "train"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);

Unnamed: 0,splitLineIndex,prompt,story,split
0,0,"[ WP ] You 've finally managed to discover the secret to immortality . Suddenly , Death appears before you , hands you a business card , and says , `` When you realize living forever sucks , call this number , I 've got a job offer for you . ''","So many times have I walked on ruins , the remainings of places that I loved and got used to.. At first I was scared , each time I could feel my city , my current generation collapse , break into the black hole that thrives within it , I could feel humanity , the way I 'm able to feel my body.. After a few hundred years , the pattern became obvious , no longer the war and damage that would devastate me over and over again in the far past was effecting me so dominantly . It 's funny , but I felt as if after gaining what I desired so long , what I have lived for my entire life , only then , when I achieved immortality I started truly aging . 5 world wars have passed , and now they feel like a simple sickeness that would pass by every so often , I could no longer evaluate the individual human as a being of its own , the importance of mortals is merely the same as the importance of my skin cells ; They are a part of a mechanism so much more advanced , a mechanism that is so dear to my fallen heart a mechanism that I have seen fall and rise so many times , a mechanism that when lost all of which it had , had me loosing my will to live , for the first time in all of my thousands years of existence . Acceptance , something so important . a skill that has proved itself worthy dozens of times , an ability that looks so easy to achieve , a gift , that I was n't able to aquire in all my years , until now . When the ashes on the ground flew into the now empty air upon humanity 's fall , I felt as if all of it 's weight was crushing me . Ignorance took over and I searched years for a hope , a sign of the very same patterns that I used to watch reappear every hundred years , the very core of my will to exist that was now no more that I so strongly wish was . If you have ever wondered if silence can drive people crazy , it can.. I ca n't feel my legs , I have walked for days , just to hear the sound of gravel , crushed bones , crushed buildings and crushed civilizations under my steps to keep my sanity.. until I remembered , the day in my far past . The day of my rebirth , I took out of my pocket a small plastic box , with nine buttons and a small glass window . I could n't believe this was our past , I could n't believe how far we have been able to progress and yet , be destroyed by our own violence . I slowly dialed the number I was given , exactly 1729 years ago . I dropped a tear , a tear that was too slow to hit the ground as I got sucked into the darkness that emerged around me . A chill went through my spine as I saw my destiny rise above me , I could see the white teeth under the dark cloack ... `` You have finally arrived '' He projected into my mind , with the most chilling cold and unhuman voice . `` I 'm ready to obey '' I answered . I knew who was sitting infront of me , and it was time for me to obey him , after all these years of playing god , even I came to it . Funny is n't it ? Even by achieving immortality , death , is inescapable .",train
1,1,"[ WP ] The moon is actually a giant egg , and it has just started to hatch .","-Week 18 aboard the Depth Reaver , Circa 2023- I walk about the dull gray halls , the artificial gravity making my steps feel almost as if they were on land . Almost . I glance out a window as I pass it by . There 's the sun , and there 's the moon right there . And , of course , there 's the Earth . I kinda miss it . Then again , space is pretty cool . It 's got some brilliant views , and the wifi is surprisingly good . Even countless miles away from the Earth , I can crush Silver noobs on CS GO . I pass by Dale Malkowitz , the head scientist on board . `` Evening , Dale , '' I say . `` What up , Danny ? '' he replies cordially . `` Nothin ' much . A little bored , I guess . '' He shakes his head in disbelief . `` I really , *really* do n't understand how you can be bored in space . '' `` Well hey , '' I say slightly defensively , `` Aside from the views , it 's kinda ... dull . And empty . And stuff . '' `` Whatever you say , Wittell , '' he says , not unkindly . Then he walks off . A few moments pass , and then I decide to look out the window right by me . As my eyes scan the inky blackness of space ( again ) , I notice something odd about the moon 's surface . It 's slightly ... cracked . `` Hey , Malkowitz ? '' I call out , `` You might wan na check this out ! '' He walks over to me casually , probably expecting nothing . `` What ? '' he asks , `` What do you see ? '' I point at the moon . His brow furrows . `` Huh ... I guess there 's something up with the surface . I 'll have to look into tha- '' Suddenly , the surface cracks a little more . We glance at each other , and then back at the moon , and then at each other again , and then back at the moon again . `` What 's going on ? '' I ask , alarmed . He 's silent for a minute or two , mouth hanging open . Then , he calls out : `` Janice ! Terry ! Johnny ! Get over here ! Something 's up with the moon . '' The other crewmates enter , unsure of what to expect . As their eyes lay upon the moon 's surface cracks , they widen . And , by coincidence , more cracks appear at that very moment . And then more . And more . And more . And more ... Little bits of the moon begin to float away , torn free of the rest of the surface . We all stare , speechless . And then ... it happens . It *happens* . The side of the moon facing us is ... torn away by a ... Human ... hand ? And we see ... A giant ... human face ? ! Surprisingly , I can hear my thoughts over my racing heart . *I ca n't help but feel as if I recognize that face ... from the ... * *Internet . * Suddenly , the great face 's lips move . Of course , none of us can actually *hear* it speak , because of the laws of space and whatnot . However , I can read its lips , and it appears to be saying : `` Are you sure about that ? ''",train


#### Valid

In [10]:
show_data(ds[ds["split"] == "valid"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);

Unnamed: 0,splitLineIndex,prompt,story,split
0,0,"[ WP ] Every person in the world undergoes a `` goodness '' test . It 's designed to give a score from 1 to 200 , where 1 is pure evil , and 200 is an angel in human body . Then the world is divided into 200 zones , where people can live among their own kind .","Clancy Marguerian , 154 , private first class of the 150+ army , sits in his foxhole . Tired cold , wet and hungry , the only thing preventing him from laying down his rifle and walking towards the enemy lines in surrender is the knowledge that however bad he has it here , life as a 50-100 POW is surely much worse . He 's fighting to keep his eyes open and his rifle ready when the mortar shells start landing near him . He hunkers lower . After a few minutes under the barrage , Marguerian hears hurried footsteps , a grunt , and a thud as a soldier leaps into the foxhole . The man 's uniform is tan , he must be a 50-100 . The two men snarl and grab at eachother , grappling in the small foxhole . Abruptly , their faces come together . `` Clancy ? '' `` Rob ? '' Rob Hall , 97 , Corporal in the 50-100 army grins , as the situation turns from life or death struggle , to a meeting of two college friends . He lets go of Marguerian 's collar . `` Holy shit Clancy , you 're the last person I expected to see here '' `` Yeah '' `` Shit man , I did n't think I 'd ever see 'Mr . volunteers every saturday morning at the food shelf ' , not after The Reorganization at least '' `` Yeah Rob , it is something is n't it '' `` Man , I 'm sorry I tried to kill you there , hey , I heard you guys were out of food , here , you can share my dinner '' Clancy marvels , even after all this : The Reorganization , the coalitions , the war , Rob is still his old , chatty self . The two men sit , Rob chatting away , Clancy forcing out pleasantries . They pass Rob 's rations between them . `` Clancy my man , I heard a group of terrorist 5 's took have formed some kind of cult , and they 're rallying all the < 50 in their own coalition '' `` Oh yeah ? '' `` Yeah , I mean , that sucks and everything , cause those are some scary dudes , but I heard that there 's going to be a truce between our countries in a few days , why do n't we just hang out here , pretty soon we wo n't even be enemies anymore ! '' `` Yeah , Rob , that sounds like a plan '' `` Man , I 'm so glad I found you again , in a few days , this war will be over , and things will be cool between us and , hey , remember Sarah ? I heard she 's a 151 , maybe I 'll look her up , I 'll be sure to visit you too once I can get a pass to sector 150-155 , it 'll probably be tough though , even before the war , you had to do sooo much paperwork to be allowed to visit , I wonder if passes will even be reinstated after the truce ends , hey , did I ever tell you about the time ... '' Rob babbles as he dozes off , grinning up at Clancy . When Clancy is sure that his friend is asleep , he slits Rob 's throat with his bayonet . Clancy climbs out of the foxhole , and stumbles his way back to battalion HQ .",valid
1,1,[ WP ] Space mining is on the rise . The Space tanker Exxon Valdez 2.0 crash and spill its cargo . Write a news story covering the event .,"„… and the little duckling will never be able to walk again. ” The artificial intelligence paused a moment for dramatic effect before continuing with its broadcast with a different voice . “ What a hearth breaking story , Frank . But now to another story that may leave you feel equally dirty . The automated space tanker Exxon Valdez 2.0 collided with an asteroid on its way to the Jupiter moon Ganymede . According to the ship owner the ship is out of control and leaking its content into space. ” “ That ’ s right , Fred . And the content of the ship has it in it , as they say ” , the computer said in first voice again , “ The whole tanker was filled with ‘ biological waste products ’ coming from research and mining stations in the Kuiper Belt. ” “ Biological waste products ? You don ’ t mean ... ” “ Yes , Fred ! ” Dramatic pause . “ I am talking about poop . Lots of it . And apparently it ’ s spilling everywhere. ” “ Better call the plumbers , Frank. ” “ Not any time soon , Fred . A spokesperson of the ship owner stated and I quote – ‘ Space is kind of big and empty , we expect no one to care , so why should we ? ’ Apparently they will just build a new ship and be done with it. ” “ That ’ s one way not to deal with the problem . But why doesn ’ t the ship fly home ? Shouldn ’ t the AI on board be able to handle such a problem ? ” “ Well , the issue is that the part in charge to deal with asteroid impacts like that has been impacted by the asteroid. ” “ Ouch . Talk about a bad run. ” “ True , especially if you take the name of the ship in consideration. ” “ Oh ? Exxon Valdez 2.0 it was , isn ’ t that right , Frank ? ” “ You ’ re absolutely right , Fred . Did you know the ship was named after an infamous ship of the twentieth century back on old Earth ? Apparently the Exxon Valdez of old was used for transporting petroleum across the oceans of Earth . Petroleum , as some of our listeners might not know , was a brownish black , gooey liquid comprised of biological matter which was transformed under high pressure for millions of years . Quite ironically the Exxon Valdez was infamous for crashing and spilling its cargo. ” “ Well , talk about making a bad name for yourself . Now both ships will go down in history for spilling black gooey stuff where it doesn ’ t belong . Who had that bright idea for such a name anyway ? ” “ Well , Fred , the company made its first plunder by holding a naming contest on the internet. ” “ Oh , will they ever learn ? ” “ Apparently not , Fred . Predictably someone tried to make a joke out of it . A niche side of history role players got wind of the contest and made it its goal to get it named after the infamous Exxon Valdez . Apparently they thought it would be funny , and given the content both ships were ferrying around , they might have a point. ” “ Funny , indeed , Frank . What ’ s the name of the side ? ” “ Well , Fred , it ’ s called Reddit . The people there mostly talk in outdated lingo and memes and watch cat pictures back from a time when the internet only was local on Earth. ” “ Truly a herald of the dark ages. ” “ You might be right about that , Fred . I assume they just thought it was funny . I guess this happens , when you let the internet decide on things. ” “ Well , Frank , when you think about the content both ships were ferrying around , they might have been right . Embarrassing for the company , but funny for everyone else. ” “ It might get worse than that , Fred . Environmentalists are up in arms . They claim that the human waste products spilling out of the ship might collide with Jupiter ’ s moon Europa within the next few millennia and might contaminate the biospheres with Earth life . Apparently there are a lot of bacteria and the likes in poop and some might be able survive the harsh conditions of space and end up impacting on the restricted moon. ” “ Oh dear , Frank , does the Monolith know about it yet ? I am sure it won ’ t let us hear the end of it. ”",valid


#### Test

In [11]:
show_data(ds[ds["split"] == "test"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);

Unnamed: 0,splitLineIndex,prompt,story,split
0,0,[ WP ] Leonardo DiCaprio in a fit of rage begins to torpedo his own career by deliberately acting poorly and taking on bad films . He finally wins an oscar for starring in Paul Blart : Mall Cop 3 .,"The wet marble floor pressed on his cheek like a thousand hands slapping his face frozen in time . Smattering piss of rain ignored his indignant mumblings . His eyes fluttered . Pins and needs ran from finger to shoulder as he pushed back against the floor , contorting his aching body into a cross legged position . Last night was bad . He gathered that . His routine dullness of though crept inwards from the edges of his mind toward the black mist that veiled his most recent memories . He struggled to recall whatever he could n't recall but only for a moment before he decided it probably was n't worth the effort . He glanced around the room for a few minutes before concluding that he probably did n't know where he was . His investigation was n't entirely fruitless , he discovered a mostly full bottle of vodka . It was cheap but would definitely get the job done . Taking a few swigs made it childishly easy to ignore that gigantic black cloud of fog blotting out whatever the hell he did before he woke up . There was a mirror in the room and for want of anything more interesting to study he gazed at himself . It was a game he 'd play with himself , glancing at the mirror and seeing if he could recognize the person looking back . If he did n't know better he 'd have guessed he was a very successful mattress salesman , or perhaps a bum who had managed to score some luck gambling . His face was portly and unshaven , in that limbo place where it had been too many days without being clean and too few days to become a beard . His stomach was round but firm , like a basketball stuffed under a shirt and then semi deflated . The hair was long and unruly , receding far into the past . But his eyes were the giveaway . Looking closely enough at them he could still see an intensity . It was n't the sharp kind he carried in his youth but rather like a rusted dagger . Still sharp enough to cut . `` DiCaprio . '' The curse rasped out of him in a choke . After all these years spent working on the hallmark channel and tv series based on mediocre movies he was still there . Despite his best efforts to bury himself under all of the alchol and drugs he was still in there . He thought for sure after the bankruptcy he 'd be done , but no that god damned rerelease of Titanic the royalties started pouring in and he could n't get rid of the money . Not even the live action version of the nut job could destroy him . Cursing he hurled the bottle at the mirror but his wet hands slipped and instead of a shattering crash there was only a thud as the bottle bounced off the dry wall and rolled on the floor . His rage thwarted by his impotence he slumped against the floor and finally noticed why there was rain coming into this room . The window was smashed . He looked at the bottle , confused . No , he had n't done that . At least not with the vodka . He looked back at the glass etched around the window sill and his eyes hung on the red that stained the jagged teeth . The headache crept back towards the front of his mind while the bloody glass pinned his eyes in place . What the fuck happened last night ?",test
1,1,[ CW ] Kill the writer in first-person narrative .,"It 's been three days since my boyfriend pissed off the neighbors . They had to be pissed , he called the police on them . The neighbors had been harboring a runaway criminal . We did n't live in a bad neighborhood , there were families and good people living here with solid steady jobs . They cared about their yards and such . But , there was a bad egg , our neighbors to the south of us were shady . We could hear them yelling at their dog many times a week . Strange smoke often came out of their house , and the lights in the garage were on at odd hours . We never had proof until now that our concerns are legitimate . The car the escaped criminal was driving had been parked at the neighbor 's house and my boyfriend decided he should turn them in . This lead to the police parking in front of *our* house , and watching them through our bedroom window for hours until they caught him . They had to know it was us . And it freaked me out . I had started tucking my pink taser in my jacket pocket when I took my miniature Yorkie out to go potty . My neighbor to the north , Jay , seemed to notice my tension , so when he saw me step outside , he 'd come out and chat with me . He 'd ask me about work , and talk to me about his latest construction jobs . Jay always pretend to be grabbing something out of his massive pick-up truck . It usually followed the same pattern - he grabs something out of his truck , sees me out with my dog , then starts in on how it baffles him how such a tiny dog was smarter than most of the people he worked with . We 'd both gripe about our jobs and laugh about stupid customers , chase the puppy down when she tried to go after squirrels , and then part ways until the next potty break . The sun was beginning to set when my dog started doing her potty dance by the door . I put on my jacket , slipped my taser in my pocket , and opened the door . She bolted out the door and went straight for the squirrel sniffing around the sidewalk . `` NO ! BAD GIRL , COME HERE ! '' The squirrel started running across the road and her tiny legs skittered out of it . I ran after her , swearing as I tripped over a crack in the road . I felt a snap in my ankle and I went down . The roar of a large pick-up engine was too close and I did n't know what to look at - my little dog bouncing across the neighbor 's lawn , or the tires that were n't slowing down fast enough . I chose neither and closed my eyes . The last thing I heard was the clatter of of work boots and Jay voice cracking , `` Oh god , oh god , oh god ... ''",test


## Augmentation 

In [12]:
from tqdm import tqdm

### Triage Prompts

1. Take the prompts list order by frequency
2. Define regex patterns for prompt and constraint
3. Generate prompts

In [13]:
df_rep = ds.groupby(["prompt", "split"]).size().reset_index().rename(columns={0: "records"})

In [14]:
df_rep = df_rep[df_rep["records"] > 20].sort_values(["records"], ascending=False)
# _str = df_rep[df_rep['records']>20].sort_values(['records'], ascending=False).iloc[1,0]

In [17]:
topPrompts20Reps = df_rep[df_rep["records"] > 20].sort_values(["records"], ascending=False)["prompt"].tolist()

In [19]:
topPrompts20Reps[:5]

['[ WP ] Write the letter that you always wanted to , but never did .\n',
 "[ WP ] There is no prompt . Just write a story you 've always been thinking about or one you 've been thinking about sharing . Anything goes .\n",
 "[ WP ] This is the prologue ( or the first chapter ) of the novel you 've always wanted to write .\n",
 '[ WP ] Write a short story where the first sentence has 20 words , 2nd sentence has 19 , 3rd has 18 etc . Story ends with a single word .\n',
 "[ WP ] Killing Hitler has become a sport amongst time travelers . Points are awarded for creativity and difficulty . You are last year 's champion , how did you win ?\n"]

In [15]:
# df_rep[df_rep["split"] == "valid"].iloc[1:3, 0]
# topPrompts20Reps += df_rep[df_rep["split"] == "valid"].iloc[1:3, 0].to_list()

In [21]:
print(f"We found {len(topPrompts20Reps)} prompts having more than 20 stories")

We found 1015 prompts having more than 20 stories


In [22]:
PROMPT_PATTERNS = "(Lucifer\snever[\s\w,]+)|\
([\. \w,]+)\.\s+Tell me|\
(All injuries[\. \w,]+)\.|\
(?<!\])(At your[\. \w,]+)\.|\
Daily Prompt \: ([\. \w,]+)|\
In 100 words or less , ([\. \w,]+)\.|\
(Last words/thoughts[\. \w,]+)\.|\
(Magic is Hereditary.*) \[|\
word limit (\) [\. \w,\/]+) \.|\
(Make me love the person you love)|\
(Pack a punch) in 150 words|\
(The last man on earth[\. \w,\/]+kill himself)|\
(The year is 2352 [\. \w,\/'-]+)\.|\
(A person dies[\. \w,\/]+)\.?|\
^[wW]rite a story([\. \w,\/]+) |\
^[wW]rite about ([\. \w,\/-]+)\.?|\
^Writing Prompt (?:\: [wW]rite|\
\[ WP \]) ([\. \w,\/']+) ?|\
^(You 're a[\. \w,\/']+)|\
(You 're moments[\. \w,\/']+)\.|\
(Describe the room you [\. \w\/']+)|\
 (Get me hooked \. [ \w,\/']+)|\
[\. \w\/',\`]+ , (tell a horror story)|\
(Make me cry)|\
(Make me hate your character)|\
(Most responses on here have a twist[\. \w\/',\`;]+)|\
(Pick your favorite[\(\)\. \w\/',\`;]+beginning)|\
(Start your story[\(\)\. \w\/',\`;]+meanings \.)|\
(The [\. \w\/',\`;]+ reader)|\
(Two people[\. \w,\/']+bench)|\
Write (a gruesome story)|\
Write (a möb[\. \w,\/']+story) that|\
(Write the letter [ ,\w]+) |\
There is no prompt[ \.\w]+(you[ \.\w']+\.)|\
(A peaceful alien race[ \.\w'-]+)\.|\
(This is the prologue[\(\) \.\w'-]+)\.|\
Write a short story where (the first[\(\) \.\w'-,]+)\.|\
(Write the first and last paragraph[\(\) \.\w'-,]+)\.|\
(Killing Hitler has[\(\) \.\w'-,\?]+)|\
(You live in a city full[\(\) \.\w'-,\?\#]+)|\
\`\` She said she loved him . [\`'\(\) \.\w'-,\?\#]+\.|\
(A soldier on the front dies[\(\) \.\w'-,\?\#]+)|\
(You discover a grand hall[\(\) \.\w'-,\?\#]+)|\
(A boy asks a girl out . It 's high[\(\) \.\w'-,\?\#]+)|\
(When everyone turns 18 , they receive a pet[\(\) \.\w'-,\?\#]+)|\
(To get in Heaven , you have to [\/\(\) \.\w'-,\?\#]+)|\
(You are born without emotions [;\/\(\) \.\w'-,\?\#]+)|\
(You are a teenager with the ability[\`;\/\(\) \.\w'-,\?\#]+)|\
(You live in a world where every person [\`;\/\(\) \.\w'-,\?\#]+)"


CONST_PATTERNS = "Daily Prompt \: [\. \w,]+\[ ([\. \w,\:]+)|\
(In 100 words or less) , ([\. \w,\:]+) \.|\
Make a story \( ([\. \w,\:]+) |\
Pack a punch (in 150 words)|\
Describe the room you [\. \w\/']+([\. \w,\:\/]+)\.|\
Get me hooked \. Reel me in \. ([\. \w\/',\`]+)\.|\
 ([\. \w\/',\`]+) , tell a horror story|\
Make me cry ([ \w\/',\`]+).?|\
(in 150 words or less)|\
Pick your favorite[\(\)\. \w\/',\`;]+beginning \. ([ \w\/',\`]+)|\
Start your story[\(\)\. \w\/',\`;]+meanings \.([ \w\/',\`]+\.)|\
The [\. \w\/',\`;]+ reader ,([\. \w\/',\`;]+)|\
Two people[\. \w,\/']+bench \. ([\. \w,\:]+)|\
Write a gruesome story ([\. \w,\:]+)|\
Write a möb[\. \w,\/']+story (that[\. \w,\/']+)"

### Add summary columns to data

In [23]:
#!pip install spacy -qqq

We aim to augment data as following:
* Prompt: 
  * whole
  * + constraints
* Story:
  * whole
  * beginning
  * middle - sliding window summarized
  * end

#### Summarization

In [24]:
#!pip install transformers

In [25]:
# @markdown utils
from transformers.utils.logging import set_verbosity

set_verbosity(40)

import warnings

# ignore hf pipeline complaints
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

In [26]:
import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)

In [27]:
params = {
    "max_length": 1024,
    "min_length": 8,
    "no_repeat_ngram_size": 3,
    "early_stopping": False,
    "repetition_penalty": 3.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
}  # parameters for text generation out of model

#### Interpolation

In [28]:
import spacy

In [29]:
# helper functions

import re


def extract_prompt_parts(prompt, pattern):
    """
    takes a prompt and some parts that matches to patern
    """
    pattern = pattern.replace("\\\n", "\\")
    if m := re.search(pattern, prompt, re.IGNORECASE):
        if len(m.groups()) > 0:
            return m.group(0)
    return None


from spacy.lang.en import English


def get_sentences(_str):
    chunks = _str.split("\n")
    sentences = []
    nlp = English()
    nlp.add_pipe("sentencizer")
    for chunk in chunks:
        doc = nlp(chunk)
        sentences += [sent.text.strip() for sent in doc.sents]
    return sentences


from itertools import islice


def window(seq, n=2):
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield " ".join(result)
    for elem in it:
        result = result[1:] + (elem,)
        yield " ".join(result)


def extract_story_parts(story):
    sentences = get_sentences(story)
    beginning = sentences.pop(0)
    middles = window(sentences, 4)
    ending = sentences.pop(-1)
    return beginning, middles, ending


def clear_prompt(prompt):
    return re.sub(r"^[Ww]rite ", "", prompt)


def get_sample_dict(split, id, text):
    return {"split": split, "splitLineIndex": id, "text": text}


def generate_instruction_diologs(df):
    dialogs = []
    """User: What is this story about: {story} -> Rosey: I think it's about: {striped_prompt}"""
    dialogBase = """User: write me a story about: {stripped_prompt}"""
    dialog1 = """ -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}"""
    dialog2 = """, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}"""
    dialog3 = """, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}"""
    dialog4 = """, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}"""
    dialog5 = """, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}"""

    df_rep = df.groupby(["prompt"]).size().reset_index().rename(columns={0: "records"})
    df_rep.sort_values(["records"], ascending=False, inplace=True)
    pbar = tqdm()
    pbar.reset(total=len(df_rep))
    for prompt in df_rep.iloc[:, 0]:
        strippedPrompt = extract_prompt_parts(prompt, PROMPT_PATTERNS)
        if strippedPrompt is None:
            continue
        strippedPrompt = clear_prompt(strippedPrompt)
        strippedConstraint = extract_prompt_parts(prompt, CONST_PATTERNS)

        for row in df[df["prompt"] == prompt].itertuples():
            try:
                story = (
                    row.story.replace("<newline>", "\n")
                    .replace("< newline >", "\n")
                    .replace("<new line>", "\n")
                    .strip()
                )
                beginning, middles, ending = extract_story_parts(story)
                dialogBeg = dialogBase.format(stripped_prompt=strippedPrompt)
                dialog = dialogBeg + dialog1.format(story=story, stripped_prompt=strippedPrompt)
                dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                if strippedConstraint is not None:
                    dialog = dialogBeg + dialog2.format(
                        stripped_prompt=strippedPrompt, stripped_constraint=strippedConstraint, story=story
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                dialog = dialogBeg + dialog3.format(stripped_prompt=strippedPrompt, story=story, beggining=beginning)
                dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                dialog = dialogBeg + dialog4.format(stripped_prompt=strippedPrompt, story=story, ending=ending)
                dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                middlesSumarizered = summarizer(middles, **params)
                for middle, sumarizedMiddle in zip(middles, middlesSumarizered):
                    # dialogs.append(dialogBeg + dialog5.format(stripped_prompt=strippedPrompt, story=story, middle=middle))
                    dialog = dialogBeg + dialog5.format(
                        stripped_prompt=strippedPrompt, story=story, middle=sumarizedMiddle[0]["summary_text"]
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                pbar.update()
            except Exception as e:
                print(f"{row.split}/{row.splitIndex}")
                raise e
        pbar.refresh()
    return dialogs


def filter_data(
    dataset,
    negativeTagFilter=None,
    positiveTagFilter=None,
    patternFilter=None,
):
    """
    > filter_data(dataset['train'],negativeTagFilter=['ip'], positiveTagFilter=['pm'] )
    """
    prompt = dataset["prompt"]
    if negativeTagFilter is not None:
        prompt = prompt[(prompt[negativeTagFilter] < 1).any(axis=1)]
    if positiveTagFilter is not None:
        prompt = prompt[prompt[positiveTagFilter].gt(0).all(axis=1)]
    if patternFilter is not None:
        prompt = prompt[prompt["prompt"].str.contains(patternFilter)]
    story = dataset["story"]
    story = story.iloc[prompt.index]
    return {"prompt": prompt, "story": story}


def generate_instruction_diologs(prompt, df):
    dialogs = []
    """User: What is this story about: {story} -> Rosey: I think it's about: {striped_prompt}"""
    dialogBase = """User: write me a story about: {stripped_prompt}"""
    dialog1 = """ -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}"""
    dialog2 = """, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}"""
    dialog3 = """, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}"""
    dialog4 = """, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}"""
    dialog5 = """, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}"""

    strippedPrompt = extract_prompt_parts(prompt, PROMPT_PATTERNS)
    if strippedPrompt is not None:
        strippedPrompt = clear_prompt(strippedPrompt)
        strippedConstraint = extract_prompt_parts(prompt, CONST_PATTERNS)
        pbar = tqdm(ascii=True, desc="stories")
        pbar.reset(total=len(df[df["prompt"] == prompt]))
        for row in df[df["prompt"] == prompt].itertuples():
            try:
                story = (
                    row.story.replace("<newline>", "\n")
                    .replace("< newline >", "\n")
                    .replace("<new line>", "\n")
                    .strip()
                )
                dialogBeg = dialogBase.format(stripped_prompt=strippedPrompt)
                dialog = dialogBeg + dialog1.format(story=story, stripped_prompt=strippedPrompt)
                dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                if strippedConstraint is not None:
                    dialog = dialogBeg + dialog2.format(
                        stripped_prompt=strippedPrompt, stripped_constraint=strippedConstraint, story=story
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                beginning, middles, ending = extract_story_parts(story)
                if beginning is not None:
                    beginning, middles, ending = extract_story_parts(story)
                    dialog = dialogBeg + dialog3.format(
                        stripped_prompt=strippedPrompt, story=story, beggining=beginning
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                    dialog = dialogBeg + dialog4.format(stripped_prompt=strippedPrompt, story=story, ending=ending)
                    dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                    middlesSumarizered = summarizer(middles, **params)
                    for middle, sumarizedMiddle in zip(middles, middlesSumarizered):
                        # dialogs.append(dialogBeg + dialog5.format(stripped_prompt=strippedPrompt, story=story, middle=middle))
                        dialog = dialogBeg + dialog5.format(
                            stripped_prompt=strippedPrompt, story=story, middle=sumarizedMiddle[0]["summary_text"]
                        )
                        dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                pbar.update()
            except Exception as e:
                print(f"{row.split}/{row.splitLineIndex}")
                raise e
            pbar.refresh()
    return dialogs

### Generate 

It saves parquet every `step` samples to avoid losing work. 

In [None]:
## filter dataset to take only prompts with frequency greater than 20 stories.
dialogs = []
i = 0
start = 0
step = 10
for index in range(start, len(topPrompts20Reps), step):
    pbar = tqdm(ascii=True, desc="prompt")
    pbar.reset(total=len(topPrompts20Reps[index : index + step]))
    for prompt in topPrompts20Reps[index : index + step]:
        tmpDialogs = generate_instruction_diologs(prompt, ds)
        if tmpDialogs is not None:
            dialogs += tmpDialogs
        pbar.update()
    if len(dialogs) > 0:
        pd.DataFrame(dialogs).to_parquet("writing-prompts-aug.parquet")
    pbar.refresh()

In [None]:
df = pd.read_parquet("writing-prompts-aug.parquet")

In [None]:
for split in list(set(df.split)):
    df_aux = df[df["split"] == split].iloc[:, 1:]
    df_aux.reset_index(inplace=True)
    df_aux.iloc[:, 1:].to_parquet(f"{split}.parquet")