# Deep Learning - Lab 11
### Text Generation Using Transformers
### Story Generation from a Prompt

-----------------------------------------------

### Name: Faizaan Al Faisal

----------------------------------------------

This lab is regarding using a dataset of Writing Prompts, which consists of prompts followed by stories regarding those prompts, and using a Transformer Model (like GPT-2 from the Hugging Face Library) to create a model that can generate stories from prompts

-----------------------------------------------

## Module Imports & Downloads
Importing all necessary modules and libraries.

In [1]:
!git clone https://github.com/huggingface/transformers
!pip install transformers
!mkdir "/kaggle/working/combined/"

fatal: destination path 'transformers' already exists and is not an empty directory.
mkdir: cannot create directory ‘/kaggle/working/combined/’: File exists


In [2]:
import torch
import torch.nn as nn
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.optimization import AdamW, get_linear_schedule_with_warmup

import random
import numpy as np
import pandas as pd 
import logging
from tqdm import tqdm
import math
import os

-----------------------------------------

# Global Parameters
Using these parameters to modify overall command of the notebook. Modifies training parameters, dataset, etc.

In [75]:
# randomization seed
np.random.seed(100)

# model/training parameters
model_name = "gpt2"
# model_name = "distilgpt2"
max_len = 512
batch_size = 4
epochs = 1
warmup = 0.1
weight_decay = 0.01
lr = 5e-5
adam_eps = 1e-8

# dataset
sample_percent = 15
file_dir = "/kaggle/input/writing-prompts/writingPrompts/"
combined_dir = "/kaggle/working/combined/"

-----------------------------------------------

## Data Preprocessing
First, sample the original dataset. The datasets provided are huge and are not being properly converted to dataset objects due to their size and resource limitations.

Next combine the source and target files into one combined set, and store in /kaggle/working/combined for sake of training model more easily.

Now, data is stored like "prompt goes here [ENDPROMPT] story goes here \n". This is the format of inputs/outputs that GPT2 model expects and works best with, and so is our preferred approach.

In [76]:
# decrease the size of sets by sampling, then combine source/target into one file
# target is appended to source after [ENDPROMPT] token, newline is added at end of combined
def sample_and_combine(sample_percent, max_len, file_dir, combined_dir, name):
    with open(file_dir + name + ".wp_source", "r", encoding="utf-8") as f:
        source = f.readlines()
    with open(file_dir + name + ".wp_target", "r", encoding="utf-8") as f:
        target = f.readlines()
    
    # ensure same number of items in source and target
    assert len(source) == len(target)
    # calculate number of samples
    samples = int(len(source) * (sample_percent / 100))
    # randomly select indices from amount sampled
    indices = random.sample(range(len(source)), samples)
    # shorten the lists to only those in samples
    source = [source[i] for i in indices]
    target = [target[i] for i in indices]
    
    # combine the source and target with endprompt token in between
    combined = [source[i].rstrip()+ " <ENDPROMPT> " + " ".join(target[i].split()[0:max_len]) for i in range(len(target))]
    text = []
    with open(combined_dir + name + ".wp_combined", "w") as f:
        for x in combined:
            f.write(x.strip() + "\n")
            text.append(x)
    return text

def clean_data(s):
    punct_chars = '!,.:;?'
    for p in punct_chars:
        s = s.replace(' ' + p, p)
    contractions = ["n't", "'s", "' s", "'re", "'ve", "' ve", "'ll", "'am", "' m", "'m", "'ve", "'s"]
    for c in contractions:
        s = s.replace(' ' + c, c)
    s = s.replace('<newline>', '\n')
    return s

# generate combined files for the datasets
train_data = sample_and_combine(sample_percent, max_len, file_dir, combined_dir, "train")
valid_data = sample_and_combine(sample_percent, max_len, file_dir, combined_dir, "valid")
test_data = sample_and_combine(sample_percent, max_len, file_dir, combined_dir, "test")

# clean the data slightly by better dealing with contractions and weird characters
train_data = list(map(clean_data, train_data))
valid_data = list(map(clean_data, valid_data))
test_data  = list(map(clean_data, test_data))

In [77]:
train_data[1]

"[ IP ] Freelance Death Wizard <ENDPROMPT> The blue ink on my arm was fading fast but I had memorized it. \n \n `` Mercenary Magic: Death's Door Pub '' \n \n I hurried through the streets, having waited until almost midnight. A cold rain fell against my coat, pulled up tight to ward off the November chill. The cobbled streets were mostly empty but for the odd wanderer or drunk. None payed me any attention. \n \n Rounding a corner I saw it. Unmistakable. A black plank of wood with a bright blue eye emblazoned on it hung from two thick black chains. The street lights were out along the whole street except for the two in front of the pub. It was an eerie sight, bright yellow flickering lights against a nearly pitch black backdrop. \n \n The building itself was wooden, with two massive windows with iron frames jutting out into the street. There was yellow light escaping through tiny holes and gaps in the thick black curtains drawn on the inside of the windows. \n \n Glancing back and forth

------------------------------------------------

# Data Formatting
Data needs to be properly processed to be acceptable by the GPT-2 Models. First, we must tokenize it using the GPT2 Tokenizer objects from the HF-Transformers library.

Next, tokenize the data of all of the sampled/combined files.

Then, GPT2 model needs objects with three categories: "input_ids", "attention_mask", and "labels". The tokenizer generates input_ids and attention_mask, but the labels need to be set.

Finally, since we are working with the PyTorch side of the Hugging Face Transformers, we must create a custom dataset and dataloader to feed the data to the model properly

In [78]:
# create gpt2 tokenizer object and set padding token
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

# tokenize the 3 datasets, generates 2d list of input_ids & attention_mask
tokenized_train = tokenizer(train_data, padding=True,truncation=True,max_length=max_len)
tokenized_valid = tokenizer(valid_data, padding=True,truncation=True,max_length=max_len)
tokenized_test  = tokenizer(test_data,  padding=True,truncation=True,max_length=max_len)

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [79]:
# create labels for the data
def generate_labels(inputs):
    labels=[]
    for ids, attention in zip(inputs['input_ids'],inputs['attention_mask']):
        label = ids.copy()
        real_length = sum(attention)
        padding_length = len(attention) - real_length
        label[:] = label[:real_length] + [-100] * padding_length
        labels.append(label)
    
    inputs['labels']=labels
    
generate_labels(tokenized_train)
generate_labels(tokenized_valid)
generate_labels(tokenized_test)

## PyTorch Data Handling
Creating a custom Dataset class alongside creating the dataloader objects to fed information to the model during eval and training.

In [80]:
class LabDataset:
    def __init__(self, inputs):
        self.ids = inputs['input_ids']
        self.attention = inputs['attention_mask']
        self.labels = inputs['labels']

    def __len__(self):
        return len(self.ids)

    def __getitem__(self, idx):

        return  [
                    torch.tensor(self.ids[idx], dtype=torch.long),
                    torch.tensor(self.attention[idx], dtype=torch.long),
                    torch.tensor(self.labels[idx], dtype=torch.long)
                ]

In [81]:
# creating dataset objects
train_dataset = LabDataset(tokenized_train)
valid_dataset = LabDataset(tokenized_valid)
test_dataset  = LabDataset(tokenized_test)

# creating dataloader objects
train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
valid_loader = torch.utils.data.DataLoader(valid_dataset, shuffle=True, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_dataset, shuffle=True, batch_size=batch_size)

----------------------------------------------

# Modular Functions
Creating generic functions to train and evaluate the model for one epoch. This allows the code to be more modular and easy to manipulate.

In [82]:
# run one epoch of training
def model_train(model, optimizer, scheduler, dataloader):
    model.train()
    loss = []
    perplexity_vals = []
    
    for inputs in tqdm(dataloader, desc="Training"):
        # break inputs into all parts and put all on devices
        input_ids, attention, labels = inputs
        input_ids = input_ids.to('cuda')
        attention = attention.to('cuda')
        labels = labels.to('cuda')
        
        # get outputs
        optimizer.zero_grad()
        output = model(input_ids=input_ids, attention_mask=attention, labels=labels)
        # calculate loss
        logits = output.logits
        batch_loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
        # backward prop + steps
        batch_loss.backward()
        optimizer.step()
        scheduler.step()
        # store values for calculation of loss + perplexity
        loss.append(batch_loss.cpu().item())
        perplexity_vals.append(math.exp(batch_loss.cpu().item()))

        del batch_loss

    return  {
                "Loss": np.mean(loss),
                "Perplexity": np.mean(perplexity_vals)
            }

In [83]:
# run one epoch of validation or testing
def model_eval(model, dataloader, testing=False):
    model.eval()
    eval_loss = []

    desc = "Testing" if testing else "Validation"
    for inputs in tqdm(dataloader, desc=desc):
        input_ids, attention, labels = inputs
        input_ids = input_ids.to('cuda')
        attention = attention.to('cuda')
        labels = labels.to('cuda')
        
        with torch.no_grad():
            output = model(input_ids=input_ids, attention_mask=attention, labels=labels)
            logits = output.logits
            batch_loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))

        eval_loss.append(batch_loss.cpu().item())
        del batch_loss

    average_loss = np.mean(eval_loss)
    perplexity = math.exp(average_loss)
    
    return  {
                "Loss": average_loss,
                "Perplexity": perplexity,
            }

In [84]:
def generate_stories(model, tokenizer, prompt, target, k=0, p=0.9, output_length=300, temperature=0.9, 
                   num_return_sequences=3, max_new_tokens=0, repetition_penalty=1.0):
    
    print("\n  Prompt \n=====================================\n")
    print(prompt + "\n")
    print("\n\n  Target Story \n=====================================\n")
    print(target + "\n")
    
    tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.pad_token = tokenizer.eos_token
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    
    attention_mask = torch.ones_like(encoded_prompt)

    # Access the underlying model if using DataParallel
    model = model.module if isinstance(model, torch.nn.DataParallel) else model
    model = model.to('cpu')
    model.eval()

    output_sequences = model.generate(
        input_ids=encoded_prompt,
        attention_mask=attention_mask,
        max_length=output_length,
        temperature=temperature,
        top_k=k,
        top_p=p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        num_return_sequences=num_return_sequences,
#         max_new_tokens=max_new_tokens,
    )

    if len(output_sequences.shape) > 2:
        output_sequences.squeeze_()

    for idx, generated_sequence in enumerate(output_sequences):
        print(f"\n\n Generated Sequence {idx + 1} \n=====================================\n")
        generated_sequence = generated_sequence.tolist()
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        text = text[: text.find(tokenizer.eos_token)]
        print(text)
        
    model = nn.DataParallel(model.cuda())


-----------------------------------------------

# Model Training
Now we must fine tune the model with all that we have done so far

In [85]:
# model creation
model = GPT2LMHeadModel.from_pretrained(model_name)
model = model.to("cuda")

# model set to use both GPUs
model = nn.DataParallel(model)

# optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, eps=adam_eps,weight_decay=weight_decay)

# lr scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=len(train_loader) * epochs * warmup,
    num_training_steps=len(train_loader) * epochs
)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [86]:
# eval on test set before fine-tuning
print(f"Evaluating Model on Testing Set Before Training:\n=====================================\n")
test_results = model_eval(model, test_loader, testing=True)
print(f"Loss before Fine-Tuning: {test_results['Loss']}")
print(f"Perplexity before Fine-Tuning: {test_results['Perplexity']}")

Evaluating Model on Testing Set Before Training:



Testing: 100%|██████████| 568/568 [01:39<00:00,  5.72it/s]

Loss before Fine-Tuning: 9.25445453717675
Perplexity before Fine-Tuning: 10451.016623188278





In [90]:
# generate some stories before and after training
token = "<ENDPROMPT>"
chosen = 5
prompt = test_data[chosen][ : test_data[chosen].find(token)]
story = test_data[chosen][test_data[chosen].find(token)+len(token) : ]
generate_stories(model, tokenizer, prompt, story)


  Prompt 

[ WP ] You've spent the last two years living as a dog-like pet to an alien family who abducted/adopted/rescued. One day they come home with a new human... 



  Target Story 

 `` Come on Liz, time to eat, then we'll go for a walk. '' 
 
 I stretched from my cot, blinked at the creature staring at me holding a backpack with a leash. This again... Well, at least I wasn't a zoo exhibit. `` Morning, '' I grumbled, stretching and groaning. My vocal chords were unable to make many of the sounds that this strance species could, but I could communicate a little. `` Too early, '' I whined. My family, if i could call them that, were all early risers. Before... before here I never got up before the sun was up. Here, the sun wasn't even peeking at the horizon. `` Sleep more. '' 
 
 `` No little one, we have to go for a walk. You're too lazy when I don't walk you, and you don't want the human catchers to get you. Then you can watch TV all day and sleep if you want. '' 
 
 I frowned. N

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




 Generated Sequence 1 

[ WP ] You've spent the last two years living as a dog-like pet to an alien family who abducted/adopted/rescued. One day they come home with a new human...  We assume you, the adoptive parent, is upset. You find your "home" so the parents of the new human who has escaped that shack.  A therapist says this human has a "special relationship with you. It's as if they have come from the dead, with no memories. The human's mother went through a terrible mental breakdown, but has come back to life, with you, and your love and care. So you're so motivated to find her, you go up into the house. The therapist takes care of your home, and we spend the next three years living with her. They help you realize that you love them as you know them."
Deist Designs, 2010-2011
"The first people we tried, it was just a fun thing. The first time we came to San Francisco was a year or so ago. We came to San Francisco for an event. We bought some very good wooden bicycles and everyt

In [91]:
# fine tune the model
for epoch in range(epochs):

    print(f"Epoch {epoch + 1}:\n=====================================\n Training:\n")
    train_metrics = model_train(model, optimizer, scheduler, train_loader)
    print(f"Loss: {train_metrics['Loss']}")
    print(f"Perplexity: {train_metrics['Perplexity']}")

    
    print(f"=====================================\n Validation:\n")
    valid_metrics = model_eval(model, valid_loader)    
    print(f"Loss: {train_metrics['Loss']}")
    print(f"Perplexity: {train_metrics['Perplexity']}")

Epoch 1:
 Training:



Training: 100%|██████████| 10223/10223 [1:22:45<00:00,  2.06it/s]


Loss: 0.06648484685055595
Perplexity: 16.567529318647402
 Validation:



Validation: 100%|██████████| 586/586 [01:42<00:00,  5.71it/s]

Loss: 0.06648484685055595
Perplexity: 16.567529318647402





In [33]:
# evaluate the model again after fine-tuning
print(f"Evaluating Model on Testing Set after Fine-Tuning:\n=====================================\n")
test_results = model_eval(model, test_loader, testing=True)
print(f"Loss after Fine-Tuning: {test_results['Loss']}")
print(f"Perplexity after Fine-Tuning: {test_results['Perplexity']}")

Evaluating Model on Testing Set after Fine-Tuning:



Testing: 100%|██████████| 379/379 [00:43<00:00,  8.71it/s]

Loss after Fine-Tuning: 0.0003140458930951868
Perplexity after Fine-Tuning: 1.0003140952106693





In [89]:
# generate more stories from the same prompt as before fine-tuning to see the difference
token = "<ENDPROMPT>"
chosen = 4
prompt = test_data[chosen][ : test_data[chosen].find(token)]
story = test_data[chosen][test_data[chosen].find(token)+len(token) : ]
print(len(prompt))
generate_stories(model, tokenizer, prompt, story)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


72

  Prompt 

[ WP ] `` I'm sorry, but the thing you were looking for is sold out. '' 



  Target Story 

 “ I ’ m sorry, but the thing you were looking for is sold out. ” The government agent on the other end of the phone had said the sentence softly, yet firmly. She said nothing more, careful not to waste her resources to reason with a desperate, dying man. 
 
 It couldn ’ t be. There was just no way. When the product first emerged on the market, it was a satire. I purchased a few cans, as did all my friends. We were amused. Who knew, that it would actually catch on. It remained on sale, spreading, like wildfire. People purchased the cans as jokes, to show their friends, to silently protest. The government neither condemned nor encouraged them initially. It knew that attacking the product would give it power. 
 
 As time went on, the pollution spread. Smog that could previously be blocked out by mere masks could now permeate even the best clothing. A fifteen minute walk outside req

In [None]:
# chose 10 random prompts from the test dataset and run the generate_stories function using same logic as before 
for i in range(10):
    chosen = random.sample(len(test_data), 1)
    prompt = test_data[chosen][ : test_data[chosen].find(token)]
    story = test_data[chosen][test_data[chosen].find(token)+len(token) : ]
    print(len(prompt))
    generate_stories(model, tokenizer, prompt, story)

--------------------------------------

# Conclusion

This lab taught us the basics of the Hugging Face Transformers library, and a specific Transformer model GPT2. This model is capable of lots of language tasks such as story generation, translation, etc. Figuring out something new and complex like this was rewarding and will provide high level experience for future tasks.