### TODO: 

**Fix class imbalance** 
  * using NLP data augmentation https://neptune.ai/blog/data-augmentation-nlp So for examples containing rare genres and no common genres we can oversample with augmentation techniques. Problem: We have multi-label problem (many genres possible). To decide how to oversample we can have a "repeat ceiling". Better way would be to utlize 
  https://medium.com/thecyphy/handling-data-imbalance-in-multi-label-classification-mlsmote-531155416b87
  https://link.springer.com/chapter/10.1007/978-3-642-40846-5_16 to decide which examples to oversample.
  
  
**Fine-tune using distilgpt2** 
  * GPT2 is too large for our GPU so we use distilled version https://huggingface.co/distilgpt2
  * For best performance: Make LM dataset using plots only and finetune the model a bit on this. (Large text document with each plot after the other. See example notebook from hugging face)
  * Once the model outputs plot-like text, we want to train using the labeled dataset with genres and title.
  
** to start **
Since we won't have time to do everything probably, let's start with the current dataset and just try to see what happens if we simply finetune w/o any augmentation and no pre-training on plot text.
   


# About

This notebook implements a pre-trained GPT language model to generate text.

In [None]:
!pip install transformers datasets git-lfs ipywidgetsa

In [None]:
!conda list

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# START


In [1]:
import math
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

import torch
from datasets import load_dataset
from transformers import (
    TrainerCallback,
    GPT2Config,
    GPT2Tokenizer,
    GPT2LMHeadModel,
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    AdamW,
    TrainingArguments,
    Trainer,
)

In [2]:
start_from_checkpoint = False

# Load pretrained tokenizer and model
finetuned_model_name = 'movie-plot-generator'

if start_from_checkpoint:
    config=AutoConfig.from_pretrained(finetuned_model_name)
    tokenizer = AutoTokenizer.from_pretrained(finetuned_model_name)
    model = AutoModelForCausalLM.from_pretrained(finetuned_model_name, config=config)
else:
    model_name = 'distilgpt2' 
    config=AutoConfig.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name, config=config)
    model = AutoModelForCausalLM.from_pretrained(model_name)

In [13]:
# Check of pre-trained model (only pranavpsv possible since and w/o <SEP> token)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
stories = generator("<BOS> <action> Shrek in the swamp <SEP> He was in the ", max_length=200, num_return_sequences=2)
print(*[story['generated_text'] + "\n\n\n------------------------\n" for story in stories])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<BOS> <action> Shrek in the swamp <SEP> He was in the 


------------------------
 <BOS> <action> Shrek in the swamp <SEP> He was in the 


------------------------



# Load dataset

First, we load the dataset and split into train and validation 

In [3]:
# Display genres and count
data = pd.read_csv('data/genres.csv',names=['genre', 'count'])
data=data.sort_values(by='count', ascending=False)
pd.set_option('display.max_rows', 7)
data.head(None)

Unnamed: 0,genre,count
9,drama,19134
16,comedy,10467
18,romance film,6666
...,...,...
255,bengali cinema,10
102,filipino,10
260,world history,10


In [4]:
# Load dataset from text file called "data.txt" and split into train/val
datasets = load_dataset("text", data_files="data.txt")['train']
datasets = datasets.train_test_split(train_size=0.985, seed=42)
datasets['validation'] = datasets.pop('test')
datasets

Using custom data configuration default-31bdbee93af814ae
Reusing dataset text (C:\Users\Anton\.cache\huggingface\datasets\text\default-31bdbee93af814ae\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Loading cached split indices for dataset at C:\Users\Anton\.cache\huggingface\datasets\text\default-31bdbee93af814ae\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5\cache-6b25b4449dc837a1.arrow and C:\Users\Anton\.cache\huggingface\datasets\text\default-31bdbee93af814ae\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5\cache-20ce305499314001.arrow





DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 40188
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 613
    })
})

In [5]:
# Example
print(datasets['train'][0]['text'] + '\n')
print(datasets['train'][1]['text'] + '\n')
print(datasets['train'][2]['text'])

<BOS> <romantic comedy> <world cinema> <drama> <comedy> <romantic drama> <romance film> Sandwich <SEP> Sher Singh a.k.a. ShekharGovinda , a struggling movie scriptwriter, is in love with a temperamental girl Nisha . However, Nisha's wealthy father arranges her marriage to his friend's son Vicky . Sher Singh rushes home in response to a telegram to find that his marriage has been arranged with a local girl, Sweetie , in return for his sister's marriage to Sweetie's brother. Since Sher's sister has a lame foot, this may be her only chance of finding a good husband. For his sister's sake, therefore, Sher gets married to Sweetie and returns to Mumbai. Meanwhile, Nisha refuses to marry Vicky and is ready to kill herself. To pacify Nisha, her father forces Sher to get married to her. Sher doesn't get a chance to explain about his first marriage. However, after the wedding, he confesses to Nisha's father that he is already married. Nisha's father advises Sher not to tell her anything. Meanwhi

As can be seen, the examples are of different lengths. Examples longer than 1024 tokens needs to be truncated as this is the maximum input to GPT2.

## Tokenization

We now need to tokenize the dataset. The original tokenizer don't have all special tokens we require.

In [6]:
print(*tokenizer.all_special_tokens)

<|endoftext|>


We need to add the special tokens that we use in our dataset. 

In [7]:
# Set special tokens
tokenizer.bos_token = '<BOS>'
tokenizer.eos_token = '<EOS>'
tokenizer.pad_token = '<PAD>'
tokenizer.sep_token = '<SEP>'

# Add special tokens
new_special_tokens = ['<BOS>','<EOS>','<PAD>','<SEP>']
new_special_tokens.extend(['<' + str(genre) + '>' for genre in data['genre']])
print(f'Number of added genres: {len(new_special_tokens)}')
special_tokens = tokenizer.additional_special_tokens

special_tokens.extend(new_special_tokens) 
new_special_tokens_dict = {'additional_special_tokens': special_tokens}
num_added_toks = tokenizer.add_special_tokens(new_special_tokens_dict)

# We must resize token embeddings since new special tokens were added
model.resize_token_embeddings(len(tokenizer))

print(model.config.vocab_size, tokenizer.vocab_size + len(tokenizer.get_added_vocab()))
assert(model.config.vocab_size == tokenizer.vocab_size + len(tokenizer.get_added_vocab()))
print(*tokenizer.all_special_tokens)

Number of added genres: 265
50522 50522
<BOS> <EOS> <|endoftext|> <SEP> <PAD> <drama> <comedy> <romance film> <thriller> <action> <world cinema> <crime fiction> <horror> <black-and-white> <indie> <action/adventure> <adventure> <family film> <short film> <romantic drama> <animation> <musical> <science fiction> <mystery> <romantic comedy> <fantasy> <comedy film> <crime thriller> <war film> <period piece> <japanese movies> <comedy-drama> <film adaptation> <documentary> <silent film> <psychological thriller> <bollywood> <western> <chinese movies> <black comedy> <lgbt> <teen> <parody> <family drama> <children's/family> <coming of age> <martial arts film> <cult> <sports> <television movie> <slasher> <suspense> <biographical film> <biography> <supernatural> <satire> <political drama> <film noir> <slapstick> <melodrama> <children's> <action thrillers> <crime drama> <b-movie> <costume drama> <biopic [feature]> <history> <music> <art film> <ensemble film> <creature film> <spy> <gangster film> <b

In [8]:
tokenizer.tokenize('<BOS> <drama> He was' )

['<BOS>', 'Ġ', '<drama>', 'ĠHe', 'Ġwas']

**Tokenize the dataset**

We tokenize the dataset. The tokenized examples contain the column names 'attention_mask' which is a mask for padding tokens and 'input_ids' which is the id of each token corrsponding to a word. We drop the text as that is not needed anymore. 

In [9]:
def tokenize_function(examples):
    """
    padding='max_length' to pad to a length specified by the max_length argument 
    or the maximum length accepted by the model.
    truncation=True to truncate each sequence to the maximum length accepted by the model
    """
    result = tokenizer(examples["text"], padding='max_length', truncation=True) # Max input according to model(1024)
    #result = tokenizer(examples["text"], max_length=512, padding='max_length', truncation=True)

    result["labels"] = result["input_ids"].copy()
    return result

tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns=["text"])

Loading cached processed dataset at C:\Users\Anton\.cache\huggingface\datasets\text\default-31bdbee93af814ae\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5\cache-9f952363e49421e1.arrow
Loading cached processed dataset at C:\Users\Anton\.cache\huggingface\datasets\text\default-31bdbee93af814ae\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5\cache-f12dd1a3e64af226.arrow


Note that we duplicate the inputs to add our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

In [10]:
#Make dataset format pytorch tensors
tokenized_datasets.set_format("torch")
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 40188
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 613
    })
})

In [11]:
# Finally, extract the datasets and select a subset if wanted
train_set = tokenized_datasets['train']#.select(list(range(10)))
valid_set = tokenized_datasets['validation']#.select(list(range(2)))

In [12]:
print(train_set, valid_set)

Dataset({
    features: ['attention_mask', 'input_ids', 'labels'],
    num_rows: 40188
}) Dataset({
    features: ['attention_mask', 'input_ids', 'labels'],
    num_rows: 613
})


### Training
First, setup training args.
The last argument to setup everything so we can push the model to the Hub regularly during training..

Then pass training args to Trainer.

In [13]:
class SaveTokenizer(TrainerCallback):
    """
    A callback used to save the tokenizer whenever a model checkpoint is saved.
    """
    def on_save(self, args, state, control, **kwargs):
        tokenizer.save_pretrained(f"./{finetuned_model_name}/")

        
ce_loss = torch.nn.CrossEntropyLoss()
        
def compute_metrics(eval_pred):
    """
    The compute function needs to receive a tuple (with logits and labels)
    and has to return a dictionary with string keys (the name of the metric) and float values.
    It will be called at the end of each evaluation phase on the whole arrays of predictions/labels.
    """
    logits, labels = eval_pred
    # Calculate perplexity https://huggingface.co/transformers/perplexity.html
    # "the exponentiation of the cross-entropy between the data and model predictions."
    
    perplexity = math.exp(ce_loss(logits, labels))
    
    return {'perplexity': perplexity}

In [14]:
torch.cuda.empty_cache()
batch_size = 2 # 1:34:39 for one epoch (no evaluation steps) with batch_size = 2

training_args = TrainingArguments(
    finetuned_model_name,
    evaluation_strategy = "no",
    num_train_epochs=3,
    learning_rate=1e-6,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_steps=5000,
    save_total_limit=1,
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=valid_set,
    compute_metrics=compute_metrics,
    callbacks=[SaveTokenizer],
)

In [None]:
train_results=trainer.train()
pickle.dump(train_results, open("train_results.pickle", "wb")) #Load: train_results = pickle.load(open("train_results.pickle", "rb"))

model.save_pretrained(f"./{finetuned_model_name}/")
tokenizer.save_pretrained(f"./{finetuned_model_name}/")

***** Running training *****
  Num examples = 40188
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 60282


Step,Training Loss
500,12.7062
1000,2.6767
1500,1.7077
2000,1.5865
2500,1.4661
3000,1.5165


In [26]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 2
  Batch size = 2


TypeError: cross_entropy_loss(): argument 'input' (position 1) must be Tensor, not numpy.ndarray

In [None]:
tokenizer.push_to_hub(finetuned_model_name)
trainer.push_to_hub(finetuned_model_name)

In [27]:
# Inference test
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
stories = generator("<BOS> <action> The dark knight <SEP>", max_length=500, num_return_sequences=2)
print(*[story['generated_text'] + "\n\n\n------------------------\n" for story in stories])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<BOS> <action> The dark knight <SEP>


------------------------
 <BOS> <action> The dark knight <SEP>


------------------------



### Push to HUB

Push tokenizer and model to hub

------
------
### Casual language modeling ## 
For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
    
    part of text 1
    
or

    end of text 1 <BOS_TOKEN> beginning of text 2
    
 
depending on whether they span over several of the original texts in the dataset or not.
**Also the labels will be the same as the inputs, shifted to the left.**

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain block_size. To do this, we will use the map method again, with the option batched=True. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.
First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in our GPU RAM, in that case decrease the size.

In [None]:
#block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:


In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the map method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
print(lm_datasets)

And we can check our datasets have changed: now the samples contain chunks of block_size contiguous tokens, potentially spanning over several of our original texts.

In [None]:
print(tokenizer.decode(lm_datasets["train"][0]["input_ids"]))
print()
print(tokenizer.decode(lm_datasets["train"][1]["input_ids"]))

Now that the data has been cleaned, we're ready to instantiate our Trainer.