# Poem Generation using FastAI


In this tutorial you will see how to fine-tune a pretrained transformer model from the transformers library by HuggingFace. It can be very simple with FastAI's data loaders. It's possible to use any of the pretrained models from HuggingFace. Below we will experiment with GPT2. 

## Import Libraries


In [1]:
# from fastbook import *
from fastai.text.all import *
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

In [2]:
pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Read Data
This data is organized by folder. There are two main folders: forms (e.g. haiku, sonnet, etc.) and topics (e.g. love, peace, etc.). Those main folders contain subfolders for the subcategories and then the poem txt files are contained in those.
With fastai, it's quite easy to read the data with the the get_text_files function. You can select all folders or select specific ones.

In [None]:
path = '../input/poemsdataset'

In [None]:
poems = get_text_files(path, folders = ['forms','topics'])
print("There are",len(poems),"poems in the dataset")

We'll start off with training the model on ballads. There are only 100 ballads so it won't take as long to train. However you can add more poem forms. For instance, a haiku would be very cool to experiment with and to see if it maintains the 5,7,5 syllable structure. You can also change the path to the topics folder instead of poem forms and you can try out a bunch of poem topics like love, anger, depression, etc.. 

In [None]:
ballads = get_text_files(path+'/forms', folders = ['ballad'])
print("There are",len(ballads),"ballads in the dataset")

In [None]:
txt = poems[0].open().read(); #read the first file
print(txt)

## Prepare the Data



In [None]:
ballads = L(o.open().read() for o in ballads) # to make things easy we will gather all texts in one numpy array

In [None]:
def flatten(A):
    rt = []
    for i in A:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt
  
all_ballads = flatten(ballads)

In [None]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [None]:
splits = [range_of(70), range(100)] # use a 70/30 split
tls = TfmdLists(all_ballads, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)

In [None]:
show_at(tls.train, 0)

In [None]:
bs,sl = 4,256
dls = tls.dataloaders(bs=bs, seq_len=sl)

In [None]:
dls.show_batch(max_n=2)

## Fine-tuning the model

In [None]:
class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

In [None]:
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()).to_fp16()

In [None]:
learn.validate()

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(1, 1e-4)

## Poem Generation Example

In [None]:
prompt = 'love is ridiculous' # create an initial text prompt to start your generated text
prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None].cuda()
inp.shape

Adding the `num_beams` and `no_repeat_ngram_size` arguments make a huge difference. This can be explained [here](https://huggingface.co/blog/how-to-generate). Basically beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Without beam search you will obtain a more greedy search. Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output. Moreover, without the `no_repeat_ngram_size` you will likely obtain a repeated output. Thus we add a penalty that makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0.

In [None]:
preds = learn.model.generate(inp, max_length=60, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(preds[0].cpu().numpy(), skip_special_tokens=True))

In [None]:
prompt = "I don't know what I would do"
prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None].cuda()
preds = learn.model.generate(inp, max_length=60, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(preds[0].cpu().numpy(), skip_special_tokens=True))