# About

This notebook implements a pre-trained GPT language model to generate text.

In [1]:
!pip install transformers datasets git-lfs



In [1]:
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /Users/antonclaesson/.huggingface/token


# START


In [2]:
import math
import torch

from datasets import load_dataset
from transformers import (
    pipeline,
    AdamW,
    get_scheduler,
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    TextDataset,
    set_seed
)

#from tqdm.auto import tqdm

In [4]:
# Load pretrained tokenizer and model
model_name = "pranavpsv/gpt2-genre-story-generator"
config=AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name,config=config)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [5]:
# Sanity check of pre-trained model
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
stories = generator("<BOS> <superhero> Shrek", max_length=200, num_return_sequences=2)
print(*[story['generated_text'] + "\n\n\n------------------------\n" for story in stories])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<BOS> <superhero> Shrek, a New York orphaned boy, and his Uncle Ben travel to Camp Crystal while Ben is away on a business trip from New York City. Upon arrival at Camp Crystal Ben is welcomed by Queen Hippolyta's daughter, Princess Evangeline; and the three visit the Fortress of Solitude for the annual Quidditch match. After much exploration of the Fortress Ben meets King Eobard Thordan, who falls in love with Ben and helps to secure an alliance between the two kingdoms. During the event, Ben and Evangeline are spotted by Thordan who plans to capture them for a reward.  Thordan captures Ben, the children, and Evangeline's maid, and forces Ben to take him to El Nino, a mountain pass, while the rest of Ben's group are captured by the rebels. Thordan also captures Evangeline, the Queen's new servant, and forces Ben and Ben's group to return home, but Evangeline


------------------------


------------------------



# Load dataset

First, we load the dataset and split into train and validation 

In [14]:
# Display genres and count
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data/genres.csv',names=['genre', 'count'])
data=data.sort_values(by='count', ascending=False)
pd.set_option('display.max_rows', 7)
data.head(100000)

Unnamed: 0,genre,count
9,drama,19134
16,comedy,10467
18,romance film,6666
...,...,...
111,c-movie,1
286,comdedy,1
362,homoeroticism,1


In [31]:
# Load dataset from text file called "data.txt" and split into train/val
datasets = load_dataset("text", data_files="data.txt")['train']
datasets = datasets.train_test_split(train_size=0.975)
datasets['validation'] = datasets.pop('test')
datasets

Using custom data configuration default-aac4e4e1cce5f43e
Reusing dataset text (/Users/antonclaesson/.cache/huggingface/datasets/text/default-aac4e4e1cce5f43e/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached split indices for dataset at /Users/antonclaesson/.cache/huggingface/datasets/text/default-aac4e4e1cce5f43e/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-fffe7ee87ffe1501.arrow and /Users/antonclaesson/.cache/huggingface/datasets/text/default-aac4e4e1cce5f43e/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-ed1c6d8aa0f9e101.arrow


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 40176
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1031
    })
})

In [55]:
# Example
print(datasets['train'][5]['text'] + '\n')
print(datasets['train'][6]['text'] + '\n')
print(datasets['train'][47]['text'])

<BOS> <thriller> <melodrama> <adventure> <black-and-white> <drama> <romance film> Ceiling Zero <SEP> Old pals Jake Lee, Tex Clarke and Dizzy Davis flew together in the Army  during World War I. Almost twenty years later, Jake is the manager of the Newark, New Jersey branch of Federal Airlines, a New York-based airline company. Tex works as an airmail pilot and Dizzy, also still flying planes, is seeking employment with his friends. Prior to his hot-shot arrival , a New York associate warns Jake about Dizzy, calling him unreliable and troublesome. Insulted, Jake replies that Dizzy is one of the best pilots in the country, telling a few stories about his fearlessness and bravery. Jake hires Dizzy as an airmail pilot. Dizzy is immediately attracted to 'Tommy' Thomas, a 19 year old girl also working there, who has just learned to fly solo. In order to go on a date with her, Dizzy, scheduled for a flight to Cincinnati in the evening, pretends he is suddenly sick and gets Tex to replace him.

As can be seen, the examples are of different lengths. Examples longer than 1024 tokens needs to be truncated as this is the maximum input to GPT2.

## Tokenization

We now need to tokenize the dataset. The pre-trained model that we are using have a few special tokens for a few genres. We need to add special tokens for all of our new genres as well as a special \<SEP\> token to the tokenizer.

In [58]:
new_special_tokens = ['<SEP>']
new_special_tokens.extend(['<' + str(genre) + '>' for genre in data['genre']])
print(new_special_tokens[:10])

['<SEP>', '<drama>', '<comedy>', '<romance film>', '<thriller>', '<action>', '<world cinema>', '<crime fiction>', '<horror>', '<black-and-white>']


In [59]:
# Add new special tokens to the tokenizer 
special_tokens = tokenizer.additional_special_tokens
special_tokens.extend(new_special_tokens) 
new_special_tokens_dict = {'additional_special_tokens': special_tokens}
num_added_toks = tokenizer.add_special_tokens(new_special_tokens_dict)

# We must resize token embeddings since new special tokens were added
model.resize_token_embeddings(len(tokenizer))

# Special tokens:
print(*tokenizer.all_special_tokens[:20])

<BOS> <EOS> <|endoftext|> <PAD> <superhero> <action> <drama> <thriller> <horror> <sci_fi> <SEP> <comedy> <romance film> <world cinema> <crime fiction> <black-and-white> <indie> <action/adventure> <adventure> <family film>


**Tokenize the dataset**

We tokenize the dataset. The tokenized examples contain the column names 'attention_mask' which is a mask for padding tokens and 'input_ids' which is the id of each token corrsponding to a word. We drop the text as that is not needed anymore.


In [11]:
def tokenize_function(examples):
    """
    padding='max_length' to pad to a length specified by the max_length argument 
    or the maximum length accepted by the model.
    truncation=True to truncate to a maximum length accepted by the model
    """
    return tokenizer(examples["text"], padding='max_length', truncation=True)

tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns=["text"])

  0%|          | 0/41 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [12]:
#Make dataset format pytorch tensors
tokenized_datasets.set_format("torch")
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids'],
        num_rows: 40176
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids'],
        num_rows: 1031
    })
})

In [13]:
# Finally, extract the datasets and select a subset if wanted
train_set = tokenized_datasets['train'].select(list(range(2)))
valid_set = tokenized_datasets['validation'].select(list(range(1)))

In [14]:
train_set[0]

{'attention_mask': tensor([1, 1, 1,  ..., 1, 1, 1]),
 'input_ids': tensor([50257,   220, 50263,  ..., 22028,   323,   338])}

### Training
First, setup training args.
The last argument to setup everything so we can push the model to the Hub regularly during training..

Then pass training args to Trainer.

In [20]:
finetuned_model_name = "movie-plot-generator"
training_args = TrainingArguments(
    finetuned_model_name,
    evaluation_strategy = "epoch",
    num_train_epochs=1,
    learning_rate=1e-5,
    weight_decay=0.01,
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=valid_set,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Cloning https://huggingface.co/AntonClaesson/movie-plot-generator into local empty directory.


In [None]:
train_results=trainer.train()

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

### Push to HUB

Push tokenizer and model to hub

In [26]:
tokenizer.save_pretrained("./finetuned_model_name/")
tokenizer.push_to_hub(finetuned_model_name)

tokenizer config file saved in ./finetuned_model_name/tokenizer_config.json
Special tokens file saved in ./finetuned_model_name/special_tokens_map.json
tokenizer config file saved in movie-plot-generator/tokenizer_config.json
Special tokens file saved in movie-plot-generator/special_tokens_map.json


OSError: On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean


In [None]:
trainer.push_to_hub()

------
------
### Casual language modeling ## 
For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
    
    part of text 1
    
or

    end of text 1 <BOS_TOKEN> beginning of text 2
    
 
depending on whether they span over several of the original texts in the dataset or not.
**Also the labels will be the same as the inputs, shifted to the left.**

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain block_size. To do this, we will use the map method again, with the option batched=True. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.
First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in our GPU RAM, in that case decrease the size.

In [None]:
#block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:


In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the map method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
print(lm_datasets)

And we can check our datasets have changed: now the samples contain chunks of block_size contiguous tokens, potentially spanning over several of our original texts.

In [None]:
print(tokenizer.decode(lm_datasets["train"][0]["input_ids"]))
print()
print(tokenizer.decode(lm_datasets["train"][1]["input_ids"]))

Now that the data has been cleaned, we're ready to instantiate our Trainer.