# Build Story Qualtrics Block
This notebook cleans the story set for the study and preps it to be easily imported into the qualtrics survey for study 1.

Details about the story dataset can be found [here](https://github.com/MWiechmann/NAI_story_data).

In short though: The story set for the study consists of 320 stories generated by Euterpe (A model offered by [Novel AI (NAI)](https://novelai.net/#/) based on  Fairseq GPT-13B).

Stories were generated with [nrt](https://github.com/wbrown/novelai-research-tool) by giving Euterpe a short prompt to establish the genre (High Fantasy, Hard Sci-Fi, Historical Romance or Horror). For each story, Euterpe ran through 30 generations with a maximum length of 40 characters for one of NAI's 8 default presets (presets are NAI's sets of recommended parameter settings). The preset used for this dataset included every default preset except for ProWriter (was not a default preset when I sampled the stories) and Moonlit Chronicler (nrt did not support top-A sampling yet). The results were 320 stories (40 per preset) that would each take about 5 minutes to read through.

## Read In Story Data

In [7]:
import pandas as pd
import itertools, re

story_df = pd.read_csv("NAI_story_data/NAI_story_data.csv", index_col=0).reset_index(drop=True)

story_df.head(3)

Unnamed: 0,prompt_label,preset_label,result,prompt,memory,authors_note,responses,model,prefix,temperature,...,bad_words_ids,logit_bias,ban_brackets,use_cache,use_string,return_full_text,trim_spaces,num_logprobs,generate_until_sentence,order
0,High Fantasy,Ace of Spade (14/02/2022),There were no signs of life anywhere around t...,The sun was high in the sky when they arrived ...,[ Author: ; Tags: ; Genre: High Fantasy ],,[' There were no signs of life anywhere around...,euterpe-v2,vanilla,1.15,...,[],[],True,True,False,False,True,5,True,"[3, 2, 1, 0]"
1,High Fantasy,Ace of Spade (14/02/2022),"\n""That's not good,"" said Mira as she gazed ar...",The sun was high in the sky when they arrived ...,[ Author: ; Tags: ; Genre: High Fantasy ],,"['\n""That\'s not good,"" said Mira as she gazed...",euterpe-v2,vanilla,1.15,...,[],[],True,True,False,False,True,5,True,"[3, 2, 1, 0]"
2,High Fantasy,Ace of Spade (14/02/2022),"They rode past without slowing.\n""It looks ab...",The sun was high in the sky when they arrived ...,[ Author: ; Tags: ; Genre: High Fantasy ],,"[' They rode past without slowing.\n""It looks ...",euterpe-v2,vanilla,1.15,...,[],[],True,True,False,False,True,5,True,"[3, 2, 1, 0]"


## Clean Story Data

In [8]:
# Put starting prompt and result together
story_df["full_story"] = story_df["prompt"] + story_df["result"]

# Check stories with incomplete sentences
matches_incomp_sent = story_df["full_story"].str.findall(r"(?<=[\.\?\!\"])[^\.\?\!\"]*$")
# mask for incommplete sentences - exclude results ending with '. That is just ending of direct speech
mask_incomp_sent = (matches_incomp_sent.apply(lambda x:x[0]) != "") & (matches_incomp_sent.apply(lambda x:x[0]) != "'")
corrected_stories = story_df["full_story"][mask_incomp_sent].str.replace(r"(?<=[\.\?\!\"])[^\.\?\!\"]*$", "", regex = True)

story_df.update(corrected_stories)
story_df.reset_index(drop = True, inplace=True)

# mask_asterism = story_df["full_story"].str.contains(r"⁂.*", flags = re.DOTALL)
# story_df["full_story"][mask_asterism]

# Remove everything after a ⁂, as it would indicate a new story
# Note to self: Ban the asterism token when generating stories for the next story!
story_df["full_story"] = story_df["full_story"].str.replace(r"⁂.*", "", flags = re.DOTALL, regex = True) 

### Remove unusually short or long stories
After cleaning stories removing everything after a ⁂, some stories might have been cut substantially. To make sure stories are not too inconsistent, probably best to remove extreme outliers.
For outlier detection, we will be using Tukey's rule for extreme outliers (no more than IQRx3.0 from Q1 or Q3).

In [9]:
# Determining quartiles and IQR
story_df["word_count"] = story_df["full_story"].str.split().apply(len)
words_Q1 = story_df["word_count"].quantile(0.25)
words_Q3 = story_df["word_count"].quantile(0.75)
words_iqr = words_Q3-words_Q1
# Determining upper and lower bounds for outlier detection
words_outlier_lower = words_Q1 - (words_iqr*3)
words_outlier_upper = words_Q3 + (words_iqr*3)
print(f"25th Percentile (Q1): {words_Q1}\t75th Percentile (Q3): {words_Q3}\nIQR: {words_iqr}")
print(f"will sort out stories with less than {words_outlier_lower} or more than {words_outlier_upper} words.")

# Function to identify outlier stories
def determine_word_outlier(row):
    if (row["word_count"] < words_outlier_lower) or (row["word_count"] > words_outlier_upper):
        return True
    else:
        return False

# Returning extreme outlier stories
story_df["word_outlier"] = story_df.apply(lambda row: determine_word_outlier(row), axis = 1)
print("\nWord outliers:")
print(story_df[["preset_label", "prompt_label", "word_count"]][story_df["word_outlier"] == True])

25th Percentile (Q1): 1118.0	75th Percentile (Q3): 1249.25
IQR: 131.25
will sort out stories with less than 724.25 or more than 1643.0 words.

Word outliers:
                     preset_label prompt_label  word_count
10      Ace of Spade (14/02/2022)       Horror         138
18      Ace of Spade (14/02/2022)       Horror         281
56       All-Nighter (14/02/2022)       Horror         646
91   Basic Coherence (14/02/2022)       Horror         453
95   Basic Coherence (14/02/2022)       Horror         322
98   Basic Coherence (14/02/2022)       Horror         645
130         Fandango (14/02/2022)       Horror         668
251           Morpho (14/02/2022)       Horror         697
259           Morpho (14/02/2022)       Horror         723


9 unusually short stories, all from the Horror prompt. Seems like Horror is especially problematic. This might indicate the model has some troubles with longer horror stories.

Either way, to keep length not too inconsistent, I will remove those in the next step. To still have the same number of stories per preset and genre, I will repeat some stories in the data set. This way, the chance for a certain genre or preset to be picked will stay equal.

In [10]:
# create dictionary with count of outliers to know how often to repeat stories later
missing_stories_count = {}
outliers_df = story_df[["preset_label", "prompt_label"]][story_df["word_outlier"] == True]

for index, row in outliers_df.iterrows():
    preset = row["preset_label"]
    genre = row["prompt_label"]
    
    
    if (preset, genre) in missing_stories_count:
        missing_stories_count[(preset, genre)] += 1
    else:
        missing_stories_count[(preset, genre)] = 1

print(f"List of outliers:\n{outliers_df}")

# delete outliers
story_df = story_df[story_df["word_outlier"] != True]

print("\nOutliers deleted from story_df")

List of outliers:
                     preset_label prompt_label
10      Ace of Spade (14/02/2022)       Horror
18      Ace of Spade (14/02/2022)       Horror
56       All-Nighter (14/02/2022)       Horror
91   Basic Coherence (14/02/2022)       Horror
95   Basic Coherence (14/02/2022)       Horror
98   Basic Coherence (14/02/2022)       Horror
130         Fandango (14/02/2022)       Horror
251           Morpho (14/02/2022)       Horror
259           Morpho (14/02/2022)       Horror

Outliers deleted from story_df


## Build Blocks For Qualtrics Survey
I do not feel like entering all ~320 stories into Qualtrics by hand so I will be using [Qualtric's advanced text format](https://www.qualtrics.com/support/survey-platform/survey-module/survey-tools/import-and-export-surveys/) to easily import the story set into the Qualtric survey.

I will be creating a unique ID for each story that also identifies the genre and preset of the story (also gets written to the dataframe).

Since I dropped out some extremly short stories above, I will repeat some stories in the data set to still have the same number of stories per preset and genre. This way, the chance for a certain genre or preset to be picked will stay equal.

In [11]:
# Create IDs for different prompt combinations
genre_li = story_df["prompt_label"].unique()
preset_li = story_df["preset_label"].unique()

genre_preset_li = list(itertools.product(preset_li, genre_li))

# create ID prefixes by using the first 3 letters of preset and identifer for genre
# when adding more presets and/or genres make sure IDs stay unique
story_id_dict = {}

for comb in genre_preset_li:
    id_str = comb[0][:3] + "_"
    genre = comb[1]
    if genre == "High Fantasy":
        id_str += "HF"
    elif genre == "Horror":
        id_str += "HOR"
    elif genre == "Hard Sci-fi":
        id_str += "HSF"
    elif genre == "Historical Romance":
        id_str += "HR"
        
    id_str =  id_str.upper()
    story_id_dict[id_str] = 0

# Create string to later write into a file for Qualtrics' advanced txt format
qualtrics_str = "[[AdvancedFormat]]\n\n[[Block:Stories]]\n"

for index, row in story_df.iterrows():
    
    # determine story id
    preset = row["preset_label"]
    genre = row["prompt_label"]
    
    story_id_prefix = preset[:3] + "_"
    
    
    if genre == "High Fantasy":
        story_id_prefix += "HF"
    elif genre == "Horror":
        story_id_prefix += "HOR"
    elif genre == "Hard Sci-fi":
        story_id_prefix += "HSF"
    elif genre == "Historical Romance":
        story_id_prefix += "HR"
    
    story_id_prefix = story_id_prefix.upper()
    
    story_id = story_id_prefix + "_" + str(story_id_dict[story_id_prefix]+1)
    
    # increase counter for id
    story_id_dict[story_id_prefix] += 1
    
    # Write to qualtrics string
    qualtrics_str += "\n[[Question:DB]]"
    qualtrics_str += "\n[[ID:" + story_id + "]]\n"
    qualtrics_str += row["full_story"].replace("\n","<br>")
    qualtrics_str += "\n"
    
    # if there is a lack of stories of this type due to outliers, repeat stories
    if (preset, genre) in missing_stories_count:
        if missing_stories_count[(preset, genre)] > 0:
            qualtrics_str += "\n[[Question:DB]]"
            qualtrics_str += "\n[[ID:" + story_id + "_rep]]\n"
            qualtrics_str += row["full_story"].replace("\n","<br>")
            qualtrics_str += "\n"
            
            missing_stories_count[(preset, genre)] -= 1
    
    # Also record story id in dataframe
    story_df.loc[index, "Story_ID"] = story_id
    
# Use story ID as index
story_df.set_index("Story_ID", inplace = True)

## Save Data To Files
* Save the string for Qualtrics to text file
* Save cleaned story data to csv

In [12]:
# Save Qualtrics advanced txt file
with open("stories_qualtrics_advanced_txt.txt", "w", encoding='utf-8') as text_file:
    text_file.write(qualtrics_str)
    
# Save story datafile with story IDs
story_df.to_csv("NAI_story_data/NAI_story_data_for_qualtrics.csv")