[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/further_pretraining_a_lm.ipynb)

# Further pretraining a language model and generating text with it

This Colab demonstrates taking a pretrained language model (distilgpt2 in this case), pretraining it further using the HuggingFace Trainer and then at the end generating new text with it.

The first part is largely based on the [HuggingFace language modeling tutorial](https://huggingface.co/docs/transformers/tasks/language_modeling).

## Install dependencies

If needed, you could install dependencies with the command below:

```
pip install transformers datasets accelerate
```

## Further pretraining a language model

The first part is using some new text to further pretrain the language model

### Get text to further pretrain

We'll download [Shakespeare sonnets](https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt) to further pretrain the language model on.

In [None]:
!wget https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt

And we'll load it up and store it as one long string.

In [None]:
with open('shakespeare.txt') as f:
  shakespeare = [ line.strip() for line in f ]
  shakespeare = shakespeare[4:] # Skip the title
  shakespeare = " ".join(shakespeare)

Let's see the beginning of that

In [None]:
shakespeare[:100]

### Tokenize the text

Now we'll tokenize the text and convert it to token IDs. We'll use the `distilgpt2` model here. Notably we're stuck with the previously created tokenizer so if there are new interesting words in our new text, we are unable to adapt the tokenizer to deal with them well and it may split them up strangely.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

For instance, Shakespeare uses the word 'honorificabilitudinitatibus' in Love’s Labour’s Lost. Maybe it would be important that the tokenizer deals with it gracefully. As we're building on an existing language model, we have to keep the already existing tokenizer (and cannot create a new one). Let's see how it does.

In [None]:
tokenizer.tokenize('honorificabilitudinitatibus')

Not great but probably doesn't matter too much here. This may be more important in domains such as biomedical text where there are a lot of uncommon words that a general-purpose tokenizer badly butchers into unhelpful subword tokens.

Now let's tokenize our big bit of text.

In [None]:
tokenized = tokenizer(shakespeare)

It gives a warning about the tokenized text being far longer than the maximum sequence for this model (1024). We need to split it into blocks to be processed one at a time.

There are a few key fields that we'll examine: `input_ids` and `attention_mask`.

First, we've got the `input_ids` that are the numeric token identifiers and the `attention_mask` which is used to tell the Transformer to ignore any padding (which we shouldn't have here). Let's just see the `input_ids`

In [None]:
tokenized['input_ids'][:10]

We can check what those `input_ids` translate back to in text using the `.decode` function of the tokenizer.

In [None]:
tokenizer.decode([4863, 37063, 301, 8109, 356, 6227, 2620, 11, 1320, 12839])

What about the `attention_mask`? We don't have padding so this shouldn't show much. It should contain a `1` for tokens to pay attention to and `0` for tokens to ignore.

In [None]:
tokenized['attention_mask'][:10]

All `1`s. In fact, if we check the whole `attention_mask` there are only 1s as we have no padding and hence no tokens to ignore.

In [None]:
set(tokenized['attention_mask'])

### Splitting the dataset into blocks

How many tokens do we have in the whole corpus?

In [None]:
total_length = len(tokenized['input_ids'])
total_length

As noted in a warning when we tokenized the text, it's too long to process in one go. We need to split it into chunks. Let's follow the [HuggingFace tutorial](https://huggingface.co/docs/transformers/tasks/language_modeling)'s choice of 128.

In [None]:
block_size = 128

And for simplicity, we want our total length to be an exact multiple of the block size, so let's make that happen:

In [None]:
total_length = (total_length // block_size) * block_size
total_length

Now we block up the `input_ids` and `attention_mask` in `tokenized` into blocks of length 128

In [None]:
tokenized_blocks = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)] for k, t in tokenized.items()
}

How many blocks have we got?

In [None]:
len(tokenized_blocks['input_ids'])

We're going to be training using this data so we need to tell the system what the expected output is. In causal language modelling, we're doing next token prediction. Hence the tokens that are used as input are effectively the intended outputs as well. Practically, they are shifted over by one, so that the target output token for an input token is the next one (and not itself). But HuggingFace does that shift for us, and we just copy the `input_ids` in as a field called `labels` that the Trainer picks up.

In [None]:
tokenized_blocks["labels"] = tokenized_blocks["input_ids"].copy()

### Creating a Dataset object

Before we can start running this, we need to turn this data into a `Dataset` object that HuggingFace is happy to play with. We can use the `from_dict` function for that

In [None]:
from datasets import Dataset

lm_dataset = Dataset.from_dict(tokenized_blocks)
lm_dataset

And practically we want a training set and a validation set so that we can watch the various metrics to understand how well the model is training and generalizing. We can use the `.train_test_split` function of the `Dataset` object for this.

In [None]:
lm_dataset = lm_dataset.train_test_split(test_size=0.2, shuffle=True)
lm_dataset

### Training!

Now we get ready to actually train the language model. First we set up a `DataCollatorForLanguageModeling` which does the nice job of moving data around and getting things in the right place and right form for our task (language modelling). We use `mlm=False` which tells it we are not doing a masked language modeling task, instead we are doing causal language modeling.

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now we load the pretrained `distilgpt2` model. This has already been trained on lots of text, and we're going to take it a bit further and trained it with the Shakespeare text.

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

And now for some actual training. Realistically, you would need to try many parameter settings, monitor the validation loss and decide the best set up. But for now, we'll just pick some values.

**Importantly:** We haven't told HuggingFace anything about GPUs, but it will, by default, check if one is available and use it. This lab should have a GPU so should run quickly.

In [None]:
training_args = TrainingArguments(
    output_dir="notused", # Use save_strategy="no" to not dump out to file
    save_strategy="no", # We'll save the model ourselves at the end (but may want to when longer slower training)
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=3,
    report_to="none" # Let's not use wandb here
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

Yay, we've trained a model. It would be a good idea to try different hyperparameters (e.g. more epochs, different learning rate, etc) to see what extra performance can be achieved.

Now we'll save the model to disk where it could be loaded in another process and used for generation. You could also use the `trainer.save_model` function here.

In [None]:
model.save_pretrained("shakespeare_model")

You can also save the tokenizer (though nothing has changed as we used an unchanged `distilgpt2` tokenizer).

In [None]:
tokenizer.save_pretrained("shakespeare_model")

Let's see what the files look like. The important one is `config.json` that HuggingFace looks for when it tries to load a model.

In [None]:
!ls shakespeare_model

## Using a language model for generating text

We'll use a text generation pipeline now. We can either provide a specific model & tokenizer (which may be helpful if we need to do some custom things) or give it the name for it to load itself.

Now let's do the longer way first where we loaded the model/tokenizer ourselves:

**Importantly:** We do need to tell the pipeline to use the GPU here (with `device='cuda:0'`)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# This loads the model and tokenizer from disk
model = AutoModelForCausalLM.from_pretrained("shakespeare_model")
tokenizer = AutoTokenizer.from_pretrained("shakespeare_model") # Could also have loaded the `distilgpt2` tokenizer as it is the same

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device='cuda:0')

Alternatively the equivalent way of loading it is below by giving the name (`"shakespeare_model"`) that we saved it with earlier. HuggingFace will always search the local directory first for that model before going to the [Huggingface Hub](https://huggingface.co/docs/hub/index) and downloading it (if one there matches).

In [None]:
generator = pipeline("text-generation", model="shakespeare_model", device='cuda:0')

Now we can pass in some text to the text generation pipeline. We need to tell it how many extra tokens to generate with `max_new_tokens`.

In [None]:
generator("Is that a dagger which I see", max_new_tokens=20)

We can also pass in a few sequences:

In [None]:
several_sequences = [
    "To be, or not to",
    "All the world's a",
    "A horse! a horse! my kingdom for a",
    "Friends, Romans, countrymen, lend me your"
]

generator(several_sequences, max_new_tokens=20)

There are various parameters for text generation. The defaults may be set in `model.generation_config` or fallback to the defaults in the [GenerationConfig documentation](https://huggingface.co/docs/transformers/v4.30.0/main_classes/text_generation).

Let's examine setting a few of them manually. We'll explicitly ask for three possible sequences (using `num_return_sequences=3`) using sampling (`do_sample=True`) so that there is a random factor in generation. Sampling works well for making interesting text, but for experiments with a language model it is more typical to not use sampling.

In [None]:
generator("Is that a dagger which I see", max_new_tokens=20, do_sample=True, num_return_sequences=3)

To generate some text deterministically without sampling (which is often the approach for experiments on language models), use `do_sample=False`. This outputs the most likely token each time.

In [None]:
generator("Is that a dagger which I see", max_new_tokens=20, do_sample=False)

You can also set `top_p` which is one factor to filter out less common tokens (and reduce the likelihood of it generating really odd looking text).

In [None]:
generator("Is that a dagger which I see", max_new_tokens=20, do_sample=True, top_p=0.7)

Or `temperature`. See [this page](https://lukesalamone.github.io/posts/what-is-temperature/) for more of an explanation. Low temperature makes it more deterministic, higher temperature makes it more "creative".

In [None]:
generator("Is that a dagger which I see", max_new_tokens=20, do_sample=True, temperature=0.1)

You could also use `return_full_text=False` to only get the new generated text (instead of it all).

In [None]:
generator("Is that a dagger which I see", max_new_tokens=20, do_sample=False, return_full_text=False)

There are lot of different parameters that can be explored for using sampling in text (with `do_sample=True`). However, there are plenty of scenarios where you don't want sampling. The parameters can be examined on the [GenerationConfig documentation page](https://huggingface.co/docs/transformers/v4.30.0/main_classes/text_generation).

## Further Reading

HuggingFace provides a good [blog post](https://huggingface.co/blog/how-to-generate) about language generation that goes over many of the techniques including beam search. Note again that many of these techniques use sampling (and will be non-deterministic) which isn't always what is desired. It depends on the problem. There's also details of the different generation algorithms on [this page](https://huggingface.co/docs/transformers/v4.30.0/generation_strategies).