# About

This notebook implements a pre-trained GPT language model to generate text.

In [1]:
!pip install transformers Datasets



# START


In [2]:
import torch
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

from transformers import AdamW
from transformers import get_scheduler
from transformers import pipeline

from datasets import load_dataset

from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    TextDataset,
    set_seed
)

In [5]:
# Load pretrained tokenizer and model
model_name = "pranavpsv/gpt2-genre-story-generator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50266, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


In [4]:
# Sanity check of pre-trained model
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
stories = generator("<BOS> <superhero> Shrek", max_length=200, num_return_sequences=2)
print(*[story['generated_text'] + "\n\n\n------------------------\n" for story in stories])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<BOS> <superhero> Shrek and Aunt May look after their two young nephews, and their son, after they are kicked out of their house, and Uncle Vernon is arrested. Shrek and his friend Fudd are sent to reform school, while Aunt May takes on their father's job as a janitor after Aunt May passes away. Together, the three children visit various museums, learning to use the magic of their grandfather's house. When Shrek meets Fudd, they bond, and while they argue over Uncle Vernon's job and Uncle May's contract with the IRS, their friendship blossoms into love and together they find a new home. Meanwhile, Fudd is arrested again and sends to jail, while the children return home.


------------------------
 <BOS> <superhero> Shrek and his friends Peter and pals are living happily, but Peter is depressed and Peter cannot accept it. Wandering along the streets of his fictional city of "Rotten Milk", he meets two children, two little girls, and a little boy named Finn. A small shopkeeper, Mr. Pyp, 

In [12]:
# Special tokens:
print(*tokenizer.all_special_tokens)

<BOS> <EOS> <|endoftext|> <PAD> <superhero> <action> <drama> <thriller> <horror> <sci_fi>


In [None]:
# HOW TRAINING DATA SHOULD LOOK
# The dataset should be a t

["<BOS> <action> <drama> Inception <SEP> Plot text of inception <EOS>",
 "<BOS> <superhero> <action> <thriller> Batman begins <SEP> Plot text of Batman begins <EOS>",
 "<BOS> <sci_fi> <action> Star Wars <SEP> Plot text of Star Wards <EOS>"] new_plot_text
# we therefore must add special tokens <SEP> to separate title from plot 
# as well as tokens for each additional genre we want to support.

In [13]:
# Add additional special tokens
special_tokens = tokenizer.additional_special_tokens
special_tokens.extend(['<comedy>','<SEP>']) # TODO add all genres/keywords we want to allow
new_special_tokens_dict = {'additional_special_tokens': special_tokens}
num_added_toks = tokenizer.add_special_tokens(new_special_tokens_dict)

# We must resize token embeddings since new special tokens were added
model.resize_token_embeddings(len(tokenizer))

# Special tokens:
print(*tokenizer.all_special_tokens)

<BOS> <EOS> <|endoftext|> <PAD> <superhero> <action> <drama> <thriller> <horror> <sci_fi> <comedy> <SEP>


# Create dataset

First, we load the dataset

In [None]:
datasets = load_dataset("text", data_files={"train": "train.txt", "validation": "validate.txt"})
print(datasets['train'][0])

### Casual language modeling
For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
    
    part of text 1
    
or

    end of text 1 <BOS_TOKEN> beginning of text 2
    
 
depending on whether they span over several of the original texts in the dataset or not.
**Also the labels will be the same as the inputs, shifted to the left.**

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns=["text"])
print(tokenizer.decode(tokenized_datasets["train"][0]["input_ids"]))

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain block_size. To do this, we will use the map method again, with the option batched=True. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.
First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in our GPU RAM, so here we take a bit less at just 128.

In [None]:
#block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:


In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the map method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
print(lm_datasets)

And we can check our datasets have changed: now the samples contain chunks of block_size contiguous tokens, potentially spanning over several of our original texts.

In [None]:
print(tokenizer.decode(lm_datasets["train"][0]["input_ids"]))
print()
print(tokenizer.decode(lm_datasets["train"][1]["input_ids"]))

Now that the data has been cleaned, we're ready to instantiate our Trainer.

In [None]:
training_args = TrainingArguments(
    "finetuning_test",
    evaluation_strategy = "epoch",
    num_train_epochs=1,
    learning_rate=1e-5,
    weight_decay=0.01,
    #push_to_hub=True,
)

The last argument to setup everything so we can push the model to the Hub regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the hub_model_id argument to set the repo name (it needs to be the full name, including your namespace: for instance "sgugger/gpt-finetuned-wikitext2" or "huggingface/gpt-finetuned-wikitext2").
We pass along all of those to the Trainer class:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

In [None]:
train_results=trainer.train()

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

### TODO
Make inference (create a pipeline?) https://huggingface.co/transformers/task_summary.html#text-generation