### TODO: 

**Fix class imbalance** 
  * using NLP data augmentation https://neptune.ai/blog/data-augmentation-nlp So for examples containing rare genres and no common genres we can oversample with augmentation techniques. Problem: We have multi-label problem (many genres possible). To decide how to oversample we can have a "repeat ceiling". Better way would be to utlize 
  https://medium.com/thecyphy/handling-data-imbalance-in-multi-label-classification-mlsmote-531155416b87
  https://link.springer.com/chapter/10.1007/978-3-642-40846-5_16 to decide which examples to oversample.
  
  
**Fine-tune using distilgpt2** 
  * GPT2 is too large for our GPU so we use distilled version https://huggingface.co/distilgpt2
  * For best performance: Make LM dataset using plots only and finetune the model a bit on this. (Large text document with each plot after the other. See example notebook from hugging face)
  * Once the model outputs plot-like text, we want to train using the labeled dataset with genres and title.
  
** to start **
Since we won't have time to do everything probably, let's start with the current dataset and just try to see what happens if we simply finetune w/o any augmentation and no pre-training on plot text.
   


# About

This notebook implements a pre-trained GPT language model to generate text.

In [1]:
!pip install transformers datasets git-lfs ipywidgets



In [2]:
!conda list

# packages in environment at C:\Users\Anton\Anaconda3\envs\storygen:
#
# Name                    Version                   Build  Channel
abseil-cpp                20210324.2           h0e60522_0    conda-forge
aiohttp                   3.7.4.post0      py38h294d835_0    conda-forge
argon2-cffi               21.1.0           py38h294d835_0    conda-forge
arrow-cpp                 5.0.0           py38h9929e98_8_cpu    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
async_generator           1.10                       py_0    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
aws-c-cal                 0.5.11               he19cf47_0    conda-forge
aws-c-common              0.6.2                h8ffe710_0    conda-forge
aws-c-event-stream        0.2.7               h70e1b0c_13    conda-forge
aws-c-io                  0.10.5               h2fe331c_0    conda-forge
aws-checksums             0.1.11               h1e232aa_

pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pygments                  2.10.0             pyhd3eb1b0_0  
pyopenssl                 21.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyqt                      5.12.3           py38haa244fe_7    conda-forge
pyqt-impl                 5.12.3           py38h885f38d_7    conda-forge
pyqt5-sip                 4.19.18          py38h885f38d_7    conda-forge
pyqtchart                 5.12             py38h885f38d_7    conda-forge
pyqtwebengine             5.12.1           py38h885f38d_7    conda-forge
pyrsistent                0.18.0                   pypi_0    pypi
pysocks                   1.7.1            py38haa244fe_3    conda-forge
python                    3.8.12               h6244533_0  
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-xxhash             2.0.2            py38h294d835_0    conda-forge
python_abi 

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value="<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# START


In [4]:
import math
import re
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

import torch
from datasets import load_dataset
from transformers import (
    TrainerCallback,
    GPT2Config,
    GPT2Tokenizer,
    GPT2LMHeadModel,
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    AdamW,
    TrainingArguments,
    Trainer,
)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast

from transformers import (GPT2Config,
                          GPT2LMHeadModel)

model_name = 'movie-plot-generation-from-scratch'

# Load dataset

First, we load the dataset

In [5]:
# Load dataset from text file called "data.txt". We won't use a validation set
dataset = load_dataset("text", data_files="data_top_15_genres.txt")['train']
dataset

Using custom data configuration default-2fcf8d2135508f85
Reusing dataset text (C:\Users\Anton\.cache\huggingface\datasets\text\default-2fcf8d2135508f85\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Dataset({
    features: ['text'],
    num_rows: 37031
})

## Tokenizer training

We now need to tokenize the dataset. We create a tokenizer and train it on our data.

In [6]:
# Add special tokens for each genre
genres = ['romantic drama', 'short film', 'family film',
          'adventure', 'action/adventure', 'indie',
          'black-and-white', 'horror', 'crime fiction',
          'world cinema', 'action', 'thriller', 
          'romance film', 'comedy', 'drama']

special_tokens = ['<UNK>', '<BOS>', '<EOS>', '<PAD>', '<SEP>']
genre_tokens =  [f'<{genre}>' for genre in genres]
all_special_tokens = special_tokens + genre_tokens

tokenizer = Tokenizer(BPE(unk_token="<UNK>"))
trainer = BpeTrainer(special_tokens=all_special_tokens, vocab_size=50257)
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train_from_iterator(dataset['text'], trainer)

# load our tokenizer into huggingface transformers library 
# For some reason the special tokens are not assigned to the corresponding properties 
# even though tokenization works as intended. We therefore add the special tokens manually.
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer, 
    model_input_names=['input_ids', 'attention_mask'])
special_tokens_dict = {'additional_special_tokens': genre_tokens}
tokenizer.add_special_tokens(special_tokens_dict)
tokenizer.unk_token = '<UNK>'
tokenizer.bos_token = '<BOS>'
tokenizer.eos_token = '<EOS>'
tokenizer.pad_token = '<PAD>'
tokenizer.sep_token = '<SEP>'

# Save 
tokenizer.save_pretrained(model_name)

('movie-plot-generation-from-scratch\\tokenizer_config.json',
 'movie-plot-generation-from-scratch\\special_tokens_map.json',
 'movie-plot-generation-from-scratch\\tokenizer.json')

### Define transformer model

In [7]:
# Load a new GPT2 model with 512 max length
config = GPT2Config(
    vocab_size=50257,
    n_positions=512,
    n_ctx=512,
)
model = GPT2LMHeadModel(config=config)

# Load tokenizer 
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
tokenizer

PreTrainedTokenizerFast(name_or_path='movie-plot-generation-from-scratch', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'bos_token': '<BOS>', 'eos_token': '<EOS>', 'unk_token': '<UNK>', 'sep_token': '<SEP>', 'pad_token': '<PAD>', 'additional_special_tokens': ['<romantic drama>', '<short film>', '<family film>', '<adventure>', '<action/adventure>', '<indie>', '<black-and-white>', '<horror>', '<crime fiction>', '<world cinema>', '<action>', '<thriller>', '<romance film>', '<comedy>', '<drama>']})

**Tokenize the dataset**

We tokenize the dataset. The tokenized examples contain the column names 'attention_mask' which is a mask for padding tokens and 'input_ids' which is the id of each token corrsponding to a word. We drop the text as that is not needed anymore. Also note that we duplicate the inputs to add our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

In [8]:
def tokenize_function(examples):
    result = tokenizer(examples["text"], max_length=512, padding='max_length', truncation=True)
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

#Make dataset format pytorch tensors
tokenized_dataset.set_format("torch")

# Finally, select a subset if wanted
train_set = tokenized_dataset#.select(list(range(10)))
train_set

HBox(children=(FloatProgress(value=0.0, max=38.0), HTML(value='')))




Dataset({
    features: ['attention_mask', 'input_ids', 'labels'],
    num_rows: 37031
})

### Training
First, setup training args.
The last argument to setup everything so we can push the model to the Hub regularly during training..

Then pass training args to Trainer.

In [9]:
class SaveTokenizer(TrainerCallback):
    """
    A callback used to save the tokenizer whenever a model checkpoint is saved.
    """
    def on_save(self, args, state, control, **kwargs):
        tokenizer.save_pretrained(model_name)

        
ce_loss = torch.nn.CrossEntropyLoss()
        
def compute_metrics(eval_pred):
    """
    The compute function needs to receive a tuple (with logits and labels)
    and has to return a dictionary with string keys (the name of the metric) and float values.
    It will be called at the end of each evaluation phase on the whole arrays of predictions/labels.
    """
    logits, labels = eval_pred
    # Calculate perplexity https://huggingface.co/transformers/perplexity.html
    # "the exponentiation of the cross-entropy between the data and model predictions."
    
    perplexity = math.exp(ce_loss(logits, labels))
    
    return {'perplexity': perplexity}

In [10]:
torch.cuda.empty_cache()
batch_size = 2 # 1:34:39 for one epoch (no evaluation steps) with batch_size = 2

training_args = TrainingArguments(
    model_name,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=batch_size,
    save_steps=2000,
    save_total_limit=1,
    log_level='info',
    logging_steps=250
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    compute_metrics=compute_metrics,
    callbacks=[SaveTokenizer],
)

In [None]:
train_results=trainer.train()
pickle.dump(train_results, open(model_name+"/train_results.pickle", "wb")) #Load: train_results = pickle.load(open("train_results.pickle", "rb"))

model.save_pretrained(model_name)
tokenizer.save_pretrained(model_name)

***** Running training *****
  Num examples = 37031
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 18516


Step,Training Loss
250,4.3881
500,3.5488
750,3.7126


In [None]:
# Inference test
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
stories = generator("<BOS> <romantic drama> Expecting the unexpected <SEP> Kajsa and Anton are", max_length=512, num_return_sequences=4)
print(*[story['generated_text'] + "\n\n\n------------------------\n" for story in stories])

### Push to HUB

Push tokenizer and model to hub

### Casual language modeling ## 
For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
    
    part of text 1
    
or

    end of text 1 <BOS_TOKEN> beginning of text 2
    
 
depending on whether they span over several of the original texts in the dataset or not.

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain block_size. To do this, we will use the map method again, with the option batched=True. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.
First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in our GPU RAM, in that case decrease the size.

In [None]:
#block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:


In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the map method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
print(lm_datasets)

And we can check our datasets have changed: now the samples contain chunks of block_size contiguous tokens, potentially spanning over several of our original texts.

In [None]:
print(tokenizer.decode(lm_datasets["train"][0]["input_ids"]))
print()
print(tokenizer.decode(lm_datasets["train"][1]["input_ids"]))

Now that the data has been cleaned, we're ready to instantiate our Trainer.