**Note from Brown Applied Computing:** The following is a workshop from Brown Machine Intelligence Community (B-MIC). If you weren't there for the in-person workshop, we highly recommend running this workshop in Google Colab to learn about how to fine-tune GPT-2!

# Workshop: Baby ChatGPT

As we mentioned in the slides, we are going to be building and training our own "baby" ChatGPT that writes its own song lyrics based on a prompt that you give it.

A lot of the steps we talked about in the slides are taken care of by code written by other people that we can just import and use without manually doing the steps ourselves. This will become more clear as you progress through the workshop!

### Step 1: Install libraries

Here we are installing and importing the right libraries for the model. Libraries consist of code written by other developers that we can import and use without having to implement a lot of repititive function on our own. The libraries we will be using are:

*Datasets:* This contains the data on which we will train our model. It was created by HuggingFace, an open-source machine learning company.

*Transformers:* Also by HuggingFace, this contains the model architecture for transformers (mentioned in the slides) that actually make up our model.

*PyTorch (torch):* This is a machine learning library created by Meta that we use to train and manipulate our data and our model.

*Pandas:* This is a library for working with datasets in Python. It is very common in machine learning/deep learning and has a ton of useful functionality.

In [None]:
!pip install datasets transformers numpy
import datasets, transformers, torch
import pandas as pd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.0-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloa

### Step 2: Load data

Now it's time for us to load our data. Here, we can use pandas to load in the data from the file we need (in this case, `lyrics-data-sub.txt`). Since our data contains many, many examples, many of which are not in English, we can filter down the data to only include 2000 examples (so that training the model doesn't take too long) which are only in English.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

df = pd.read_csv('gdrive/My Drive/Baby GPT/lyrics-data-sub.txt') # Load data into variable df
dataset = datasets.Dataset.from_pandas(df[df['language']=='en'].sample(2000)) # Restrict data to only 2000 English examples

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


### Step 3: Training and testing split

In machine learning, we split our data into different sets. Often, we split it into two sets: the training set and the testing set. The training set is the data we train our model on. After the model is trained, we use the testing set to see how well it works. The reason we do this is to make sure that our model is actually good at what we think it's good at–if a model does really well on the training data but not the testing data, it probably means that our model isn't as good at predicting as we think it is!

We can use the `train_test_split()` function to split our data into training and testing data. The `train_test_split()` function takes an argument `test_size`, which is the decimal representing how much of our data we want to reserve for testing.

**TODO:** Split up our dataset into training and testing data.

*Hint: a common split is 80% training, 20% testing*

In [None]:
# TODO: Split our data!
dataset = dataset.train_test_split(test_size=0.2)

### Step 4: Explore the data

Now that we have our data, let's take a look at what is actually in the dataframe. Below, you can explore the dataset, thinking about:

- What columns are in the data?
- What types of data are in the columns?
- How can this data be useful for our BabyGPT?

Some functions you might use to explore the data include:

- `df.head()`: Display the first 5 rows of the dataset.
- `df.columns`: Print the columns of the dataset.
- `df['<column_name>']`: Access a certain columns (`<column_name>`) from the data
- `df['<column_name'][n]`: Access the `n`th row in the column `<column_name>` (Here, `n` is a number)

Modify the example below to see what you can find in the data with these commands (or any others you might know).

In [None]:
# TODO: have a look at the df!
df.head()

Unnamed: 0,Lyric,language
0,[verse: 1]\nCame to the world in a time where ...,en
1,[verse 1]\nTha world is mine nigga get back\nD...,en
2,"Now come one,\nCome all,\nTo this tragic affai...",en
3,"Maybe...\nOh if I could pray, and I try, dear\...",en
4,"You must've a been in a place so dark, couldn'...",en


### Step 5: Tokenization

Now it is time to process our data, starting with *tokenization*. As a reminder, this is the process of changing our text input into numerical data. Luckily, the `transformers` library comes with tokenizers that will do most of the dirty work for us, since the process can get pretty complicated. Below, we load a tokenizer from the `transformers` library.

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2-medium')

We need to define a function to map a given training example, currently in string format, into a tokenized version. To do this, we apply the tokenizer loaded just above to each of the song lyrics in the dataset! The function `preprocess_func` is defined below. The input type is a dictionary, with the `'Lyric'` key containing a string with all of the lyrics for a song. 

The tokenizer can be called on a string in the following manner: `tokenizer(string)`.

In [None]:
def preprocess_func(example):
    # TODO: fill in!
    return tokenizer(example['Lyric'], truncation=True) # we additionally pass truncation=True, you don't need to worry about what it does!

Now that we've written `preprocess_func`, we can apply it to the entire training set. For this, we can use the `Dataset.map` function. Provide the appropriate function to the follwing `Dataset.map` call, such that the entire dataset is tokenized!

In [None]:
# TODO: fill in!
tokenized_dataset = dataset.map(preprocess_func, batched=True, remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

You can ignore the following cell, it basically creates the labels for 'blocks' of lyrics of size `block_size`. It does this by dividing the text up into 'blocks' of tokens of length `block_size`, and then assigns the label of each block to be the same as the input, which will later be shifter to the right by one token.

In [None]:
block_size = 256
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


Now, call `.map` again on the tokenized dataset, providing `group_texts` as the function to apply to each song lyric. Additionally pass `batched=True`. 

In [None]:
# TODO: fill in!
lm_dataset = tokenized_dataset.map(group_texts, batched=True)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

### Step 6: Training

Great! Now that our data is all tokenized, and organized in a way that makes it easy to train a lanugage model, let's get started training! For this, we're going to import more tools from `transformers`, and use their handy APIs to handle all the training details automatically. Below, we import and initialize a `DataCollatorForLangagueModeling` object, which will organize the training data into 'batches', and deal with other details like padding the input.

Batches are just a way to pass multiple inputs to the model at once, as well as get multiple outputs/predictions, which is a lot faster than training the model one example at a time!  

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now we get to import the model that we'll be using! As outlined in the slides, this model is GPT-2. In actuality, there are several different versions of this model, the main difference between them being their size. We're using the biggest one that we can, named `'gpt2-medium'`, due to hardware restrictions. Below, pass the the call to the `.from_pretrained` the name of the model that we're using (as a string). 

#### Pre-trained models
The version of GPT-2 that we're importing below is **pre-trained**, meaning that it's already pretty good at predicting the next word in text. However, it's extremely general, and has been trained to predict the next most likely word using a lot of publicly available text from the internet. However, we want to train a model that's really good (ok, maybe not *that good*) at writing song lyrics specifically, we we'll take the alread-trained model and just do a bit of extra training. This is called **fine-tuning**. 

#### ChatGPT and fine-tuning
Technically, ChatGPT is just a fine-tuned version of GPT-3 - the most recent of the GPT models! Although it's been fine-tuned using some special reinforcement learning techniques, the main process is the same as the one we're doing here: Take an existing pre-trained model, and fine-tune it on a specific task. OpenAI essentially (this sweeps some details under the rug) fine-tuned their best model to be really good at answering questions in a chat-like way.

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
# TODO: fill in!
model = AutoModelForCausalLM.from_pretrained('gpt2-medium')

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

#### Note on Hugging Face models
Hugging Face, the creator of the `transformers` library, has many different models for various applications of NLP, such as langauge modeling (what we're doing here), masked lanuage modeling, translation, classification, and more! To load other kinds of models is very simple: If I wanted to use BERT (a masked language model) for a masked language modeling task, I could use `model = AutoModelFormMaskedLM.from_pretrained('bert-base-uncased')` to load it!

#### Training:
We'll let Hugging Face deal with the details of training, we just need to provide a bunch of hyperparameters, as well as our model and datasets. 

Below, fill in the following hyperparameters: We want a learning rate of `2e-5`, weight decay of `0.01`, 2 training epochs, and both batch sizes to be `4`.

In [None]:
# TODO: fill in!
training_args = TrainingArguments(
    output_dir="lyric-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5, # controls how much we update the model when it's wrong
    weight_decay=0.01,
    num_train_epochs=2, # controls how many times we want train on the entire training dataset
    per_device_train_batch_size=4, # controls how big batches should be during training
    per_device_eval_batch_size=4, # controls how big batches should be during evaluation
)

Finally, we can train! **Warning, this will take 15-20 minutes. You can stop the training early if need be, but let it train for at least one epoch first!** To stop the training, just stop the execution of the cell, and move on!

We will pass the model, training arguments that we just defined, datasets, and the data_collator to a `Trainer` object, which will use them to train the model!

**HINT:** *Remember, `lm_dataset` (our dataset) is like a dictionary, it has two keys: `'train'` and `'test'`.*

In [None]:
torch.cuda.empty_cache()

# TODO: fill in!
trainer = Trainer(
    model=model, # pass the model here!
    args=training_args, # pass that TrainingArguments object we made in the previous cell
    train_dataset=lm_dataset['train'], # pass training set, check out the hint for help!
    eval_dataset=lm_dataset['test'], # pass testing set....
    data_collator=data_collator, # finally, pass the DataCollatorForLanguageModeling
)

trainer.train()

***** Running training *****
  Num examples = 2194
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 1098
  Number of trainable parameters = 354823168
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.97,2.853773
2,2.8212,2.84727


Saving model checkpoint to lyric-model/checkpoint-500
Configuration saved in lyric-model/checkpoint-500/config.json
Configuration saved in lyric-model/checkpoint-500/generation_config.json
Model weights saved in lyric-model/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 533
  Batch size = 4
Saving model checkpoint to lyric-model/checkpoint-1000
Configuration saved in lyric-model/checkpoint-1000/config.json
Configuration saved in lyric-model/checkpoint-1000/generation_config.json
Model weights saved in lyric-model/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 533
  Batch size = 4


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1098, training_loss=2.8832812144239526, metrics={'train_runtime': 1072.2419, 'train_samples_per_second': 4.092, 'train_steps_per_second': 1.024, 'total_flos': 2037569323794432.0, 'train_loss': 2.8832812144239526, 'epoch': 2.0})

### Step 7: Evaluation

We'd like to know how well our model works, so we can use the testing dataset to see how well it performs on unseen lyrics! `Trainer` objects have a handy `.evaluate` function, which returns a dictionary with a couple keys, only one of which we care about: `'eval_loss'`. The measurement of performance that we'll use is called **perplexity**. In this case, the perplexity of a lanuage model is essentially a measure of how unexpected the testing set is. There are many possible words that could come next after the start of a phrase, for example, the following are all very plausible, despite the correct next word being 'mat':

'The cat sat on the ____':

Plausible outputs:

'floor', 'bed', 'lap'


Because of this, we can't measure how good a language model is by asserting that there's only one possible word that comes after 'The cat sat on'. So, we look at the probabilities that the the model assigned to the next word, and use it as a measure of how much the model expected the next word to be 'mat'. The lower the perplexity is, the more expected the testing set was to the model!

Here, after two epochs, the perplexity should be under 19. The perplexity is calculated by taking exp(cross-entropy loss) (cross-entropy is the loss function. For those who don't know what loss is, think of it as a measure of how bad a model is: lower is better).

In [None]:
import math

# TODO: evaluate the model!
eval_loss = ???
perplexity = ???
print(f"Perplexity: {perplexity:.2f}")

### Step 9: Using the model!

Great, now that we've trained and evaluated the model, it's time to generate some text and see how well it works!

The first approach to this that we'll explore is called **greedy decoding**. It's called this because it's greedy in the algorithmic sense: it maximizes the likelihood of the output token by token! If you've taken the intro sequence, you may have encountered this. 

The greedy decoder follows the following steps:
1. Initialize the first input to the model to be some prompt (tokenized)
2. Pass the inputs to the model, and retrieve the next most likely token
3. Concatenate (add) the token to the inputs
4. Repeat from step 2, growing the input, until the model either outputs the 'end of sequence' token, or we hit a pre-determined length limit
5. Use the tokenizer to decode the final model output, and return.

In [None]:
def greedy_decode(model, prompt=" ", max_tokens=128):

  # TODO: fill in the '???' in the below code!

  # first, tokenize the prompt. call the tokenizer on the prompt. Additionally pass add_special_tokens=False and return_tensors='pt'
  tokenized = tokenizer(prompt, add_special_tokens=False, return_tensors='pt')

  # we can't quite pass the tokenized prompt as-is, we extract the token IDs first
  # also, if you're interested: we call .to(0), which returns a copy of the tensor, but on the GPU
  inputs = tokenized['input_ids'].to(0)

  # loop a maximum of max_token times
  for i in range(max_tokens):

    # get output from the model, you can treat `model` as if it's a function here
    decoder_output = model(inputs)

    # Since the model outputs a probability distribtion over all the words in the vocabulary,
    # of which there are 50,257, just take the index of the largest value, which will be our 
    # predicted token. This has been done with argmax - nothing TODO!
    output_token = torch.argmax(decoder_output.logits[:, -1], keepdims=True)

    # add the new newly predicted token to the end of the inputs!
    # we'll use torch.cat, which concatenates tensors (the datatype of output_token, and inputs) together!
    # Fill in the blanks! What order should the tensors be provided?
    inputs = torch.cat([inputs, output_token], dim=1)

    # check if the output token was the 'end of sequence' (EOS) token, and break, if so
    if output_token.item() == tokenizer.eos_token:
      break

  # return the 
  return tokenizer.decode(inputs[0])

We can try calling `greedy_decode`, let's try using a prompt like `"My love burns like a"` and see where it goes with it - since that sounds like the kind of thing you'd hear in a song!

In [None]:
print(greedy_decode(model, "I hate people who"))

I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't know me
I hate people who don't


Hmmmm.... that might not have worked very well. We've noticed that `greedy_decode` likes to repeat a lot, which isn't very realistic, is it? The model also apparently never outputs the EOS token, since our output gets cut off after generating `max_len` (128) tokens. 


There are a couple of fundamental issues with the greedy decoding, for example, we'd like to penalize repetition and long outputs. Furthermore, the most likely word might not be the best word to choose! After all, it means that repeatedly prompting the model with the same prompt will always output the same thing, which is no fun. 


To address this, we'll use an existing decoding implementation provided by, yup, you guessed it: Hugging Face! Their method of decoding isn't quite the same as the greedy decoder, as it tracks multiple different decodings at once, and chooses the next token based on some heuristics that penalize repetition. If you're interested, a better algorithm for decoding in this scenario is called **beam search**!


Below, we import the `pipeline` function from Hugging Face's library. After specifying a model and tokenizer, as well as the target task ("text-generation", in this case), it allows us to generate new text!

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

Here's a helper function that generates new lyrics based on a prompt (which can be empty!). In the signature `num` corresponds to how many different output song lyrics we want, and `max_length` is the maximum number of tokens that we let the model output! 

In [None]:
def generate_lyrics(prompt='', num=10, max_length=64):

  outputs = generator(prompt, num_return_sequences=num, max_new_tokens=max_length)

  for output in outputs:
    print("-"*20)
    print(output['generated_text'])

Here's an example:

In [None]:
generate_lyrics("My love burns like a", num=5, max_length=128) # this might take a few seconds

Feel free to play around with more prompts! Change the `max_length` parameter to generate longer songs!

In [None]:
generate_lyrics("I'm in love with ChatGPT")

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--------------------
I'm in love with ChatGPT
'Cause it's true ChatGPT knows how to be in love

I'm addicted to ChatGPT
'Cause it's true ChatGPT knows how to be in love

'Cause it's true ChatGPT knows (sore-fucked) how to be in love


--------------------
I'm in love with ChatGPT
And
I'm in love with ChatGPT
And
I love ChatGPT
And
I love ChatGPT
and
I want to know who you are
I want to know you
Come and find me
I want to know you
I want to know you
I love Chat
--------------------
I'm in love with ChatGPTX and I can't understand why

But I must take this chance
To prove to the world my innocence
I'm the only person that can make me
Believe in myself
Not just believe in myself
Not just believe in myself
Not just believe in myself
Believe, believe, I'll
--------------------
I'm in love with ChatGPTi
Just wanna be with you, I was looking for you again
My phone ring, call the cops
'Cause if my phone ring, call the cops
I'm gonna be hurt and I'll be killed
Call the cops, call the cops
'Caus