## Introduction
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Zero-Shot Transfer: The pre-training task for GPT-2 is solely language modeling. All the downstream language tasks are framed as predicting conditional probabilities and there is no task-specific fine-tuning.

In [13]:
import numpy as np
import pandas as pd
import torch
import logging
from tqdm import tqdm
import math
import argparse
import os

In [14]:
!git clone https://github.com/huggingface/transformers
!pip install transformers/
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.optimization import AdamW, get_linear_schedule_with_warmup

fatal: destination path 'transformers' already exists and is not an empty directory.
Processing ./transformers
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25ldone
[?25h  Created wheel for transformers: filename=transformers-4.45.0.dev0-py3-none-any.whl size=9694583 sha256=ca40e01b7d793f5b76e31ef585354a19ddd9fc42f77e3866875ad71792206114
  Stored in directory: /tmp/pip-ephem-wheel-cache-rhatsl6i/wheels/7e/b2/24/0b3be37b3b423a6f2fd25fd6368a1f4b0888942789c7e68bc6
Successfully built transformers
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.45.0.dev0
    Uninstalling transformers-4.45.0.dev0:
      Successfully uninstalled transformers-4.45.0.dev0
Successfully installed transforme

In [15]:
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type = int, default = 88888)
parser.add_argument('--model_name', default = "gpt-2", type = str)
parser.add_argument('--max_seq_length', default = 512 , type = int)
parser.add_argument('--train_batch_size', default = 4, type = int)
parser.add_argument('--valid_batch_size', default = 4, type = int)
parser.add_argument('--num_train_epochs', default = 1, type = int)
parser.add_argument('--warmup', default = .1 , type = float)
parser.add_argument('--learning_rate', default = 5e-5, type = float)
parser.add_argument('--input_text_path', default = '/kaggle/input/story-text', type = str)
args, _ = parser.parse_known_args()

## Prepare the data
Combine the prompt and story, do a little text clean. There are train, valid and test dataset in the original dataset. And the prompts and stories are in seperate files. For a example, the valid.wp_source has the writing prompts and valid.wp_target has the corresponding stories. The train dataset is very large. Since kaggle notebook limits the kernel running time to 3 hours. I decide to take the valid dataset as my train dataset, and the test dataset as valid dataset.

In order to feed the prompt an story together to GPT-2, I combine the prompts and stories togeter.Thus every line in the combined file includes the prompt and it's corresponding story.

In [16]:
DATAPATH = args.input_text_path

def combine_text(prompt, story):
    fp = open(os.path.join(DATAPATH, prompt), encoding = 'utf8')
    fs = open(os.path.join(DATAPATH, story), encoding = 'utf8')
    prompts = fp.readlines()
    stories = fs.readlines()
    assert len(prompts) == len(stories), "Unbalance length"
    combine = []
    for i in range(len(prompts)):
        combine.append(prompts[i].rstrip() + ' <sep> '+" ".join(stories[i].split()[:300]))
    return combine



def cleanpunctuation(s):
    # Usage : Text cleaning with punctuations
    for p in '!,.:;?':
        s=s.replace(' '+p,p)
    s=s.replace(' '+'n\'t','n\'t')
    s=s.replace(' '+'\'s','\'s')
    s=s.replace(' '+'\'re','\'re')
    s=s.replace(' '+'\'ve','\'ve')
    s=s.replace(' '+'\'ll','\'ll')
    s=s.replace(' '+'\'am','\'am')
    s=s.replace(' '+'\'m','\'m')
    s=s.replace(' '+'\' m','\'m')
    s=s.replace(' '+'\'m','\'m')
    s=s.replace(' '+'\' ve','\'ve')
    s=s.replace(' '+'\' s','\'s')
    s=s.replace('<newline>','\n')
    return s

train_text = combine_text('valid.wp_source', 'valid.wp_target')
train_text = list(map(cleanpunctuation, train_text))
valid_text = combine_text("test.wp_source", "test.wp_target")
valid_text = list(map(cleanpunctuation, valid_text))

## Tokenize and load to dataloader¶
GPT-2 uses BPE to tokenize the text squence.BPE merges frequently co-occurred byte pairs in a greedy manner. In order to let the sequences in the same batch have the same length, I set the max length of sequence as 512, and truncate the longer sequence and pad the shorter sequence. Since the tokenizer function only return the input_ids and attention_mask. For training purpose, I need to feed the labels(targets) to the model. So I create labels sequence for every input_ids squence. In the label sequence,I rule out the padding tokens by set it to -100 to avoid compute loss on them. And also GPT-2 will automatically shift the labels to the right to match the inputs_ids, so I don't need to deal with it.

In [17]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

inputs_train = tokenizer(train_text, padding = True,
                         truncation = True, max_length = args.max_seq_length)
inputs_valid = tokenizer(valid_text, padding = True, truncation = True,
                        max_length = args.max_seq_length)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



In [18]:
def create_labels(inputs):
    labels = []
    for ids, attention_mask in zip(inputs['input_ids'],inputs['attention_mask']):
        label = ids.copy()
        real_len = sum(attention_mask)
        padding_len = len(attention_mask)-sum(attention_mask)
        label[:] = label[:real_len]+[-100]*padding_len
        labels.append(label)
    inputs['labels'] = labels

create_labels(inputs_train)
create_labels(inputs_valid)

In [19]:
class StoryDataset:
    def __init__(self, inputs):
        self.ids = inputs['input_ids']
        self.attention_mask = inputs['attention_mask']
        self.labels = inputs['labels']
    def __len__(self):
        return len(self.ids)
    def __getitem__(self, item):
        return [torch.tensor(self.ids[item], dtype = torch.long),
               torch.tensor(self.attention_mask[item], dtype = torch.long),
               torch.tensor(self.labels[item], dtype = torch.long)]

In [20]:
train_batch_size = args.train_batch_size
valid_batch_size = args.valid_batch_size
train_data = StoryDataset(inputs_train)
train_dataloader = torch.utils.data.DataLoader(train_data, shuffle = False,
                                              batch_size = train_batch_size)
valid_data = StoryDataset(inputs_valid)
valid_dataloader = torch.utils.data.DataLoader(valid_data, shuffle = True,
                                              batch_size = valid_batch_size)

In [23]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.to(device)
model.eval()
eval_loss = []

for inputs in tqdm(valid_dataloader, desc = "eval"):
    d1, d2, d3 = inputs
    d1 = d1.to('cuda')
    d2 = d2.to('cuda')
    d3 = d3.to('cuda')
    
    with torch.no_grad():
        output = model(input_ids = d1, attention_mask = d2,
                      labels = d3)
        batch_loss = output[0]
    eval_loss += [batch_loss.cpu().item()]
    del batch_loss
eval_loss=np.mean(eval_loss)
perplexity=math.exp(eval_loss)
print(f'The average perplexity for valid dataset before fine-tuning is {perplexity}') 

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

eval: 100%|██████████| 3785/3785 [06:01<00:00, 10.47it/s]

The average perplexity for valid dataset before fine-tuning is 39.27932676546657





Let's pick a prompt from the valid dataset and input it into the model, have the model generate a 300 words long story. The output stories is really great! I use the generate method comes with the model. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling. The meanings of key arguments are below:
* 1)do_sample: if set to False greedy decoding is used.
* 2)The temperature is used to module the next token probabilities.
* 3)top_k is the number of highest probability vocabulary tokens to keep for top-k-filtering.
* 4)top_p is the cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling.
* 5)repetition_penalty is the parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty.

In [24]:
prompt = valid_text[300][:valid_text[300].find('<sep>')]
target = valid_text[300][valid_text[300].find("<sep>")+5:]

def generate_story(prompt, target, k = 0, p = .9, output_length = 300, temperature = 1,
                  num_return_sequences = 3, repitition_penalty = 1.):
    print("| Prompt |\n")
    print(prompt + "\n")
    print("| Generated |\n")
    print(target + "\n")
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens = False,
                                     return_tensors ='pt')
    model.to('cpu')
    model.eval()
    output_sequences = model.generate(
    input_ids = encoded_prompt,
    max_length = output_length,
    temperature = temperature,
    top_k = k,
    top_p = p,
    repetition_penalty = repitition_penalty,
    do_sample = True,
    num_return_sequences = num_return_sequences)
    
    if len(output_sequences.shape)>2:
        output_sequences.squeeze_()
    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        print("|| Generated Sequence {} ||".format(generated_sequence_idx+1))
        generated_sequece = generated_sequence.tolist()
        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces = True)
        # Remove all text after eos token
        text = text[:text.find(tokenizer.eos_token)]
        print(text)

generate_story(prompt, target)

| Prompt |

Children's logic dictates the way the world works. [ WP ] 

| Generated |

 “ That ’ s not an option I ’ m currently willing to exercise. ” 
 
 I pinch the bridge of my nose to stave off the headache building behind my eyes. If this goes on much longer, I ’ m gon na have to start to start cutting back on the vegetables. 
 
 “ She ’ s dangerous, Jimmy. You know that. You ’ ve seen it. Dealt with it first hand. She just doesn ’ t play by anyone ’ s rules. ” 
 
 Ali finished off her sucker and unwrapped a fresh one, offering it to me. I declined. I ’ d sworn off the things after my third cavity scare. That one saw me at the dentist for the third time in as many months. I don ’ t care what my dad says, I know that guy is evil. Who owns a drill like that? A murderer, that ’ s who. I still hear the damn thing in my nightmares. 
 
 While she savored the smooth flavor of blue-raspberry, I pondered her words. We both knew she was right. The situation was spiralling out of control. T

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


|| Generated Sequence 1 ||
Children's logic dictates the way the world works. [ WP ]  But does this work for almost anybody?
Here's what used to be this passage on why Religion is the moral order, again out of all the arguments concerning why:
"Deeper stages of human society do not yield either less liberty or more safety, as does a breach of the rules and norms of an open society, or more there are simply more bugs and ornaments left to reduce the number of ambiguities in particular persons, the degrees of alarm in the mind, and the likelihood of unintended or mischievous incidents. By contriving such practicality to enable mankind to maintain its basic functioning of safety it is a more effective means of controlling external threats and maintaining the dignity of conscience. That truth about purity and chastity is now shattered. This is the first stage of religious morality which is at the root of the highest of all other life. So it is the first stage of religious morality, in virt

# Fine-tune the model
The number of training samples is 15620. With one GPU to train the model, it tooks about 21 minutes to run 1 epoch. After 1 epoche learning, the perplexity for valid dataset is about 24, which is better than the score before fine- tuning.

In [26]:
num_train_epochs = args.num_train_epochs
training_steps_per_epoch = len(train_dataloader)
total_num_training_steps = int(training_steps_per_epoch*num_train_epochs)
weight_decay = 0
learning_rate = args.learning_rate
adam_epsilon = 1e-8
warmup_steps = int(total_num_training_steps*args.warmup)
no_decay = ['bias', "LayerNorm.weight"]

optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_num_training_steps
)

In [27]:
print("***** Running training *****")
print("  Total_num_training_step = {}".format(total_num_training_steps))
print("  Num Epochs = {}".format(num_train_epochs))
print(f"  Train_batch_size per device = {train_batch_size}")
print(f"  Valid_batch_size per device = {valid_batch_size}")
model.to('cuda')
for epoch in range(num_train_epochs):
    print(f"Start epoch{epoch+1} of {num_train_epochs}")
    train_loss=0
    epoch_iterator = tqdm(train_dataloader,desc='Iteration')
    model.train()
    model.zero_grad()    
    for _, inputs in enumerate(epoch_iterator):        
        d1,d2,d3=inputs
        d1=d1.to('cuda')
        d2=d2.to('cuda')
        d3=d3.to('cuda')
        output = model(input_ids=d1, attention_mask=d2,labels=d3)
        batch_loss=output[0]
        batch_loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        train_loss+=batch_loss.item()
        epoch_iterator.set_description('(batch loss=%g)' % batch_loss.item())
        del batch_loss
    print(f'Average train loss per example={train_loss/training_steps_per_epoch} in epoch{epoch+1}')    
    print(f'Starting evaluate after epoch {epoch+1}')
    eval_loss=[]    
    model.eval()    
    for inputs in tqdm(valid_dataloader, desc="eval"):
        d1,d2,d3=inputs
        d1=d1.to('cuda')        
        d2=d2.to('cuda')
        d3=d3.to('cuda')
        with torch.no_grad():
            output = model(input_ids=d1, attention_mask=d2,labels=d3)
            batch_loss=output[0]
        eval_loss+=[batch_loss.cpu().item()]
        del batch_loss
    eval_loss=np.mean(eval_loss)
    perplexity=math.exp(eval_loss)
    print(f'Average valid loss per example={eval_loss} in epoch{epoch+1}')    
    print(f'Perplextiy for valid dataset in epoch{epoch+1} is {perplexity}')

***** Running training *****
  Total_num_training_step = 3905
  Num Epochs = 1
  Train_batch_size per device = 4
  Valid_batch_size per device = 4
Start epoch1 of 1


(batch loss=2.89359): 100%|██████████| 3905/3905 [21:30<00:00,  3.03it/s]


Average train loss per example=3.2843164310870496 in epoch1
Starting evaluate after epoch 1


eval: 100%|██████████| 3785/3785 [06:00<00:00, 10.49it/s]

Average valid loss per example=3.182704730140958 in epoch1
Perplextiy for valid dataset in epoch1 is 24.11188156833926





## Generate Stories
Use the fine tuned model to generate stories with the same prompt before fine-tune

In [28]:
prompt = valid_text[300][:valid_text[300].find('<sep>')]
target = valid_text[300][valid_text[300].find('<sep>')+5:]
generate_story(prompt, target)

| Prompt |

Children's logic dictates the way the world works. [ WP ] 

| Generated |

 “ That ’ s not an option I ’ m currently willing to exercise. ” 
 
 I pinch the bridge of my nose to stave off the headache building behind my eyes. If this goes on much longer, I ’ m gon na have to start to start cutting back on the vegetables. 
 
 “ She ’ s dangerous, Jimmy. You know that. You ’ ve seen it. Dealt with it first hand. She just doesn ’ t play by anyone ’ s rules. ” 
 
 Ali finished off her sucker and unwrapped a fresh one, offering it to me. I declined. I ’ d sworn off the things after my third cavity scare. That one saw me at the dentist for the third time in as many months. I don ’ t care what my dad says, I know that guy is evil. Who owns a drill like that? A murderer, that ’ s who. I still hear the damn thing in my nightmares. 
 
 While she savored the smooth flavor of blue-raspberry, I pondered her words. We both knew she was right. The situation was spiralling out of control. T

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


|| Generated Sequence 1 ||
Children's logic dictates the way the world works. [ WP ] 
 
 In a perfect world we all feel superior to the man or woman we love. We are kings and queens, blessed with a limitless amount of power and thousands upon thousands of genes. We are blessed with vast, red-colored planets with the ability to sustain life billions of miles away. 
 
 In a perfect world we all would be born who have the same basic human genetics but who neither used as much power nor knew of the dangers of genetic modification. But that is not the case. The balance between good and evil are always tipped. 
 
 In a perfect world, every generation has the same ultimate capability - we are kings and queens. But the balance is tipped wrong. The planet is barren with only nature that we can manipulate - and when that happens, we lose control over our chosen masters. Then, we lose the world, and life continues. 
 
 The truth is, no one ever truly believes in us. We have been through so much, 