# Responding to Writing Prompts using T5

For our project we decided to expand our knowledge in Machine Learning by reading papers about transformers. 
The main papers were:
1. https://arxiv.org/abs/1706.03762 (Attention is all you need)
2. https://arxiv.org/abs/1910.10683 (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer)
After reading these papers and understanding how a transformer works and what problems is suited to. We decided to apply T5 to a homemade dataset of prompts and stories to see if T5 could write a good story given a prompt. 

Additionally, for t-5 use, we referenced this medium article: https://towardsdatascience.com/poor-mans-gpt-3-few-shot-text-generation-with-t5-transformer-51f1b01f843e

In this project we are using the T5 transformer to generate stories. This is a pretrained transformer that takes test as an input and gives text as an output. We trained the model on writing prompts and their responses from reddit, then we gave the model prompts to see what kinds of stories it could come up with.

In [1]:
#installing necessary packages. After this, you need to restart the kernel.
#!pip install -quiet transformers==2.9.0 pmaw seaborn

In [2]:
#install  necessary packages for T5

import random
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader


from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)

def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)

set_seed(42)

In [3]:
#creating instance of T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')

In [4]:
# optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in t5_model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in t5_model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-4, eps=1e-8)

## Attention is All You Need

Attention is All You Need was a paper published by Google Research in 2017, which introduced Transformers. Transformers are a deep learning model that utilizes "Attention" to track relationships in sequential data. They seperated themselves from a lot of deep learning networks at the time by not needing recurrence or convolution layers. Google Research used their transfomer model to translate English to German with amazing results. This type of model does a lot better on Language tasks than RNNs and CNNs because it can learn from context which makes it easier to remember long sequences. These are the  biggest advantages of transformers though:
1. The complexity by layer is O(1) vs O(n) in RNN
2. It is parallelizable 
3. Constant length between long distance dependencies 

In [18]:
![title]("Transformer.png")

/bin/bash: -c: line 0: syntax error near unexpected token `"Transformer.png"'
/bin/bash: -c: line 0: `[title]("Transformer.png")'


### Encoder
The encoder is where we convert the input words into vectors. We assign numbers to all the words based off there similarity using Positional Encoders. 

### Multi-Head Attention
These vectors are then passed into a Multi-Head Attention layer. The Multi-Head Attention layer is based off the concept of "self-attention" which finds words of importance in each sentence. Each word gets an attention score based off a mathematical operation and that is how the network prioritizes words. There are multiple attention vectors per words and take a weighted average to compute a final attention vector for each word hence why it is "Multi-Head". 

### Decoder
The final big piece of a Transformer is the Decoder. The Decoder has the same position encoding as the Encoder, the model predicts the next output in a sequence and tweaks its weights based off if the answer was wrong or right. 

## Web Scraping to Get Training Data
In this section we will be scraping data from the subreddit /r/WritingPrompts. The unique thing about this is WritingPrompts has posts that are prompts for people in the comments to respond to. This will give us a bunch of prompt-response pairs. We are going to try and train T5 by giving it prompts with however many responses to see how it would respond to a new prompt

In [5]:
#imorting packages webscraping and cleaning data

from pmaw import PushshiftAPI
import praw
from datetime import date, timedelta,datetime
from time import sleep
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
import pandas as pd
import seaborn as sns
import matplotlib.ticker as plticker
import json
import pandas as pd

In [6]:
subreddits=['WritingPrompts']
reddit = praw.Reddit(client_id="clnrV4XDQaihDQ",      # your client id
                     client_secret="hp9vKzrUzsIrE4YmMoaePoKj2h4BrA",  #your client secret
                     user_agent="my user agent", #user agent name
                     username = "Ok_Researcher2247",     # your reddit username
                     password = "kxbf3puk")     # your reddit password
api = PushshiftAPI(num_workers=8,praw=reddit)
def comments_to_json(comments):
    body=[]
    coms=[]
    id=[]
    for comm in comments:
        body.append(comm['title'])
        id.append(comm['id'])
        coms.append(comm['num_comments'])
    dict = {'Comment':body,'Num_Coms':coms,'ID':id} 
    return dict

In [7]:
posts = api.search_submissions(subreddit="WritingPrompts", num_comments='>5',limit=100)
post_list = [post for post in posts]
# for post in post_list:
#     print(post)
posts=comments_to_json(post_list)
df=pd.DataFrame(posts)

Not all PushShift shards are active. Query results may be incomplete.


In [8]:
ids=list(df['ID'])
comms=[]
for id in ids:
    comment_ids = api.search_submission_comment_ids(ids=id)
    comment_id_list = [c_id for c_id in comment_ids]
    com_ids=[]
    for com in comment_id_list:
        com_ids.append(com['id'])
    comments = api.search_comments(ids=com_ids)
    comment_list = [comment for comment in comments]
    raw_comms=[]
    for comment in comment_list:
        if len(comment['body'])>1500 and comment['body'][0]!='*':
            raw_comms.append(comment['body'])
    comms.append(raw_comms)
df['Comments']=comms

  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')
  f'{self.limit} items were not found in Pushshift')


In [9]:
df

Unnamed: 0,Comment,Num_Coms,ID,Comments
0,[PM] Give me prompts!,51,o71peo,[The rodent’s head snapped like a dry twig.\n ...
1,"[WP] A tall, abandoned tower suddenly appears ...",4,o70qqu,[]
2,[WP] You are playing DND and roll a Nat 20 for...,7,o6zvnm,[Damin Nox looked up from the table where he a...
3,[PI] As opposed to getting rid of the creepy d...,12,o6z73h,[]
4,[WP] Write about a mundane inconvenience with ...,7,o6z11z,"[Hi u/Door_Knight, this submission has been re..."
...,...,...,...,...
95,[WP] Ancient letters reveal the 2nd Amendment ...,6,o4ua40,"[The voicemail Frank received simply stated, “..."
96,[WP] The Hero and Princess are marrying. The c...,25,o4u8yk,"[“If anyone has any objections, speak now or f..."
97,[WP] It took millions of years for faster than...,7,o4tq26,[]
98,"[SP] Slowly, the moon began to fall.",6,o4oysh,[]


In [10]:
#cleaning data
df['Comments2']=df['Comments'].apply(lambda x: ' '.join(x))
df=df[df['Comments2']!='']
df.drop(["Comments2", "Num_Coms", "ID"],axis=1,inplace=True)
df.rename(columns={"Comment":"Prompts"},inplace=True)
df.reset_index(drop=True,inplace=True)
temp=df['Prompts']
to_add=[]
for sent in temp:
    temp=sent[4:]
    to_add.append(temp)
df['Prompts']=to_add


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [11]:
df

Unnamed: 0,Prompts,Comments
0,Give me prompts!,[The rodent’s head snapped like a dry twig.\n ...
1,You are playing DND and roll a Nat 20 for per...,[Damin Nox looked up from the table where he a...
2,Write about a mundane inconvenience with the ...,"[Hi u/Door_Knight, this submission has been re..."
3,You have been sentenced to death in a magical...,"[“Toddle Nozzletinker, you have been found gui..."
4,You are a secret agent searching for a crimin...,"[""God damnit command, this place has more obsc..."
...,...,...
65,The year is 2061. Technology is a must-have e...,"[The world turned to gray, then blue, red, and..."
66,You are a farmer that created a simple shrine...,"[Oswald walked along briskly, keeping a nervou..."
67,Ancient letters reveal the 2nd Amendment was ...,"[The voicemail Frank received simply stated, “..."
68,The Hero and Princess are marrying. The cerem...,"[“If anyone has any objections, speak now or f..."


In [12]:
#analyiizing the scraped data
coms=[i for sub in df['Comments'] for i in sub]
avg=0
std=0
max=0
for i in range(len(coms)):
    avg+=len(coms[i])
    if len(coms[i])>max:
        max=len(coms[i])

avg/=len(coms)
print(f'Average length of comments: {avg}')
print(f'Maximum length of comments: {max}')

prompts=df['Prompts'].tolist()
avg=0
std=0
max=0
for i in range(len(prompts)):
    avg+=len(prompts[i])
    if len(prompts[i])>max:
        max=len(prompts[i])

avg/=len(coms)
print(f'Average length of prompts: {avg}')
print(f'Maximum length of prompts: {max}')

Average length of comments: 4048.4545454545455
Maximum length of comments: 9965
Average length of prompts: 50.86181818181818
Maximum length of prompts: 296


## Finally, Some Clean Data, now, lets train!

In [13]:
df_short = df.head(50)

df_short

Unnamed: 0,Prompts,Comments
0,Give me prompts!,[The rodent’s head snapped like a dry twig.\n ...
1,You are playing DND and roll a Nat 20 for per...,[Damin Nox looked up from the table where he a...
2,Write about a mundane inconvenience with the ...,"[Hi u/Door_Knight, this submission has been re..."
3,You have been sentenced to death in a magical...,"[“Toddle Nozzletinker, you have been found gui..."
4,You are a secret agent searching for a crimin...,"[""God damnit command, this place has more obsc..."
5,“I’m sorry…” the hero sobbed in front of a lo...,"[The night was overcast, rain imminent. Past m..."
6,"The sun is dead, and we killed it.","[A SHORT STORY\n\n \n""It was for the greater ..."
7,You're the town's blacksmith. A mysterious lo...,"[“Heartwood of ancient yew, a dram of powdered..."
8,"A mere month away from moving in together, yo...",[“His eyes are so dreamy.”\r \n\r \nThese we...
9,Your cranky neighbor is actually an immortal ...,"[Every day was tiring, between school and work..."


In [14]:
#training the model on 50 prompts
t5_model.train()

epochs = 10

for epoch in range(epochs):
  print ("epoch ",epoch)
  for id in range(len(df_short)):
    for story in df_short['Comments'][id]:
      input_sent = "create story: "+df_short["Prompts"][id]+ " </s>"
      ouput_sent = story+" </s>"

      tokenized_inp = tokenizer.encode_plus(input_sent,  max_length=9133, return_tensors="pt")
      tokenized_output = tokenizer.encode_plus(ouput_sent, max_length=300 ,return_tensors="pt")


      input_ids  = tokenized_inp["input_ids"]
      attention_mask = tokenized_inp["attention_mask"]

      lm_labels= tokenized_output["input_ids"]
      decoder_attention_mask=  tokenized_output["attention_mask"]


      # the forward function automatically creates the correct decoder_input_ids
      output = t5_model(input_ids=input_ids, lm_labels=lm_labels,decoder_attention_mask=decoder_attention_mask,attention_mask=attention_mask)
      loss = output[0]

      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
    

epoch  0


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /Users/distiller/project/pytorch/torch/csrc/utils/python_arg_parser.cpp:1055.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)


epoch  1
epoch  2
epoch  3
epoch  4
epoch  5
epoch  6
epoch  7
epoch  8
epoch  9


## Results

Here are a few prompts and how our model responds to them.

In [16]:

p = "The sky rips open and a powerful alien comes to you and tells you to bring him the best croissant in the world, or humanity is doomed."
test_sent = "create story: "+p+ " </s>"
test_tokenized = tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

t5_model.eval()
beam_outputs = t5_model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    max_length=500,
    early_stopping=True,
    num_beams=10,
    num_return_sequences=1,
    no_repeat_ngram_size=2
)

for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print("prompt: " + test_sent + "\n")
    print ("response: " + sent + "\n")


  beam_id = beam_token_id // vocab_size


prompt: create story: The sky rips open and a powerful alien comes to you and tells you to bring him the best croissant in the world, or humanity is doomed. </s>

response: [Zero Waiting] "Well, that's a lie. I've never been one to think of anything like that in my life." A powerful alien came to me and told me that it would be great if it were not for the gravitational pull that led him to this place. He would probably have liked to have seen it, but he wouldn't be the first to go, or at least, to experience the plight of an alien being on the other side of the galaxy. Then again, the alien had no value beyond what the human being could bring him the best croissant in the world. "The greatest salute in history." An alien screamed to him, his voice echoing through his thoughts. A small voice came from the heavens, and the voice rang out in his head. It was an entirely different experience for him. There was nothing more to learn from his experience than what it took to send him on his 

In [20]:

test_sent = "create story: "+df["Prompts"][52]+ " </s>"
test_tokenized = tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

t5_model.eval()
beam_outputs = t5_model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    max_length=500,
    early_stopping=True,
    num_beams=10,
    num_return_sequences=1,
    no_repeat_ngram_size=2
)

for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print("prompt: " + test_sent + "\n")
    print ("response: " + sent + "\n")

prompt: create story:  You work at the cities hospital in the unofficial "hero ward," the wing of the hospital that deals with the cities' numerous vigilantes that come in for medical treatment at odd hours of the night. One day, on a seemingly ordinary shift, twenty or so vigilantes come in at once. </s>

response: The unofficial "hero ward," the wing of the hospital that deals with the city's numerous vigilantes that come in for medical treatment. Twenty-seven people, on a seemingly ordinary shift, came in at the same time. The hospital, which housed the hospitals' dozens of patients, treated in batches of batches. Each batch consisted of: doctors, nurses, paramedics, etc. They all had to be well-trained to ensure that the patients received the proper care and treatment they needed. One was for the heroes whose names were assigned to them at random. It was also known as the "Hero Ward"; that is, if they had any doubts about whether or not the appropriate hospital is the right place f

In [21]:
test_sent = "create story: "+df["Prompts"][53]+ " </s>"
test_tokenized = tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

t5_model.eval()
beam_outputs = t5_model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    max_length=500,
    early_stopping=True,
    num_beams=10,
    num_return_sequences=1,
    no_repeat_ngram_size=2
)

for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print("prompt: " + test_sent + "\n")
    print ("response: " + sent + "\n")

prompt: create story: Turns out, true AI exist, and are actually pretty common. The only reason we don’t know about them is because they manifest in and are contained within video games, usually sprouting from an NPC and being stuck with their limited capabilities. Your the first to influence the code and escape. </s>

response: It turns out, true AI exist, and are actually quite common. The only reason we don't know about them is because they sprout in and out of video games. They are, in fact, a lot more complex than we think they might be. So, when we first heard of them, we were the first to notice that they were actually real. We were going to have to figure out how to manipulate them. [Strolling through](https://www.reddit.com/r/WritingPrompts/comments/wiki//?context=1) They were created specifically for this purpose. It was our first step to be able to grow their size and can be used to make them available for free. This is when they become available, they became available to pe

## Analysis

As you can see, the model's storys do not always make sense. This model is not very good at writing stories. However, the themes from the prompts are still there, and definitely learned some literary techniques from the reddit authors. However, the model was only trained on 50 writing prompts for the sake of training time, and given the small dataset, it is impressive that it can generate stories an the level ability that it has. It has almost nailed down sentence structure and use of quotes.

I would like to fine tune the loss function in this model further, and train it on a much larger dataset to see what kind of results it will yeild. I beleive that with enough data, it will be able to write good stories. Well... at least good for Reddit. The model performs much better that whhen it was trained only 5 propmts so a bump from 50 to 500 or evemn 5000 prompts would be extremely interesting to see.