# LLM Fine-tuning

- Instructor: Jake Snell
- Date: January 23, 2024

The contents of this tutorial are based on the following guide:
- https://huggingface.co/docs/transformers/v4.37.0/tasks/language_modeling

**Note:** Before beginning, be sure to select "Runtime > Change runtime type > T4 GPU". This will make fine-tuning a lot faster.

In [None]:
!pip install transformers[torch] datasets evaluate accelerate -U

## Part 1: Text Generation

First, we need to download a LLM from HuggingFace. Here we use GPT-2, but feel free to experiment with your own choice of LLM by browsing https://huggingface.co/models?pipeline_tag=text-generation&sort=trending.

In [None]:
# Let's grab GPT-2 from HuggingFace

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda:0")

In [None]:
# Let's get the corresponding tokenizer as well
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
# Now let's tokenize a sample sentence
tokenized_sentence = tokenizer("One good tokenizer is worth more than a hundred bad ones.")
tokenized_sentence

**Question**: Based on the tokenizer output, is your tokenizer a character-level, word-level, or subword-level tokenizer? How can you tell?



**Exercise.** There are several methods for sampling a text sequence from a language model. Using the guide at https://huggingface.co/blog/how-to-generate, choose at least 2 sampling methods and implement them. Which technique generates higher quality text samples? What happens when you seed the text with different phrases, such as "Wherefore art", "Four score and seven", etc. Does the output match what you would expect?

In [None]:
# Write your code to sample from the model here!

## Part 2: Basic Fine-tuning

Now we will download a small dataset to fine-tune our LLM. We will use the [TinyShakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt) dataset here, but feel free to find your own dataset on HuggingFace: https://huggingface.co/datasets?sort=trending. If you do, be sure to choose one of the filters under "Natural Language Processing" so you get a text dataset.


In [None]:
from datasets import load_dataset

dataset = load_dataset("Trelis/tiny-shakespeare")

In [None]:
dataset

**Quick check**: Verify that the text in the dataset is what you expect. Based on the original [TinyShakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt) file, what strategy do you think was used to split into train and test? Would you have used this strategy, or something else?

In [None]:
# First, we will need to tokenize this dataset using our tokenizer
tokenized_dataset = dataset.map(
    lambda example: tokenizer(example["Text"]),
    batched=True,
    num_proc=4,
    remove_columns=dataset["train"].column_names
)

In [None]:
tokenized_dataset

In [None]:
# Now, we will need to split the rows into blocks
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
finetuning_dataset = tokenized_dataset.map(group_texts, batched=True, num_proc=4)

In [None]:
finetuning_dataset

**Question:** Why is the number of rows different between `tokenized_dataset` and `finetuning_dataset`? Given the number of rows in a split from `tokenized_dataset`, can you write down an expression for the number of rows in the corresponding `finetuning_dataset` split?

In [None]:
# Here we set up the data collator to pass into the training loop
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_shakespeare_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=finetuning_dataset["train"],
    eval_dataset=finetuning_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

In [None]:
# Here we evaluate perplexity
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

**Exercise**: Modify the code above to experiment with different learning rates, weight decay, and/or number of epochs.

1. How do these choices affect training loss and validation loss? Which fine-tuning strategy is best?
2. Use your sampling strategies from Part 1 above to sample from your fine-tuned model. How do the samples compare?

## Part 3: PEFT

Another approach to fine-tuning is known as Parameter Efficient Fine-tuning (PEFT). See slides for a diagram of LoRA (Hu et al., 2021)

**Exercise:** Use the HuggingFace [PEFT Guide](https://huggingface.co/docs/peft/quicktour) as a base to implement LoRA or the PEFT technique of your choice. Fine-tune your original LLM using PEFT, being sure to record training loss, validation loss. After PEFT fine-tuning is complete, generate some samples from your model.

1. How do the training/validation losses and generated samples compare to your model from Part 2? Which model is better, in your opinion?
2. How does the time taken for fine-tuning differ between ordinary fine-tuning and PEFT?
3. What are some benefits of PEFT relative to ordinary fine-tuning? Which technique would you recommend to use in practice?

## Homework (Optional)

- Now that you know how to fine-tune a model for text generation, choose another text-based task. You could choose translation, text classification, or anything else that takes text as input. Fine-tune the LLM of your choice on this new task. How well does it perform? Did the promise of "foundation model + adaptation" live up to what you expected, or is there something still to be desired?
- For a more in-depth look at transformers, you can check out ["The Annotated Transformer" tutorial](https://nlp.seas.harvard.edu/annotated-transformer/) by Sasha Rush. It covers masking and positional encoding in more detail, and also discusses advanced topics such as label smoothing and learning rate scheduling.

# Thank you for joining us this Wintersession! We wish you the best of luck on your machine learning journey!

If you have any questions or comments, please reach out to me at <js2523@princeton.edu>.