4.1 Supervised fine-tuning and RL fine-tuning
==========
**Author:** Polina Tsvilodub

* understanding instruction tuning (examples), understanding data collection for reward models, trying out reward models, RL for LMs in practice (packages), (optional: discussing different rule-based approaches to alignment)
* recap distinction between pretraining, task fine-tuning, instruction fine-tuning, assistant fine-tuning
* Understanding RLHF: reward model objective, ways to elicit feedback, issues with various approaches.
* RL: understand the issue with the **differential bottleneck and how RL actually helps to solve it**

This sheet provides an overview of different flavours of fine-tuning of LLMs and their respective use cases. Particular focus is provided for *RL based* fine-tuning, and specifically, *RLHF* (reinforcement learning from human feedback). 

**TODO:** more prose about RLHF seemingly being the secret fairy dust of helpful and nice assistants.

The key learning goals for this sheet are:
* be able to identify the type of fine-tuning used for a particular model and explain it conceptually
* gain a practical understanding of RLHF components, in particular:
  * creation and use of a reward model
  * training steps for the policy
  * important hyperparameters of RLHF.

## Flavours of fine-tuning

In [sheet 2.5](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02e-intro-to-hf.html), we already brushed over the distinction between pretrained and fine-tuned models. In particular, fine-tuning was introduced as the training LMs on *task-specific* datasets, e.g., for movie review generation or classification. For classification, LMs are usually fine-tuned with a classification head (see [sheet 3.2](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/03b-transformers-heads-training.html) for a recap). However, from here on, we will focus on fine-tuning of model for *generative* tasks, i.e., simply fine-tuning LMs for next-word prediction on a specific task. 

We have seen fine-tuning of a generative LM for question answering in homework 1. Here, the specific task the model is supposed to learn is constituted by the specific dataset and the formatting of the questions and the answers. The model was trained on examples of questions with correct answers; i.e., this is *supervised fine-tuning* where the model was shown what the desired behavior is (i.e., which next tokens it is supposed to predict).

For state-of-the-art LLMs, it has been identified that there is a particular kind of task that *general-purpose LLMs*, and in particular *assistants*, should be able to complete: namely, **instruction-following**. That is, rather than creating LMs that can only do QA on a specific dataset with a particular formatting, the community started to build *instruction-tuned* LLMs, which generate sensible outputs given user instructions ranging from "Provide ten recipes with tofu in a bullet list format" to "Summarize the following scientific paper". 
This has also been achieved with supervised fine-tuning, where the example input and output pairs in the dataset consist of example isntructions, and their respective completions.

**TODO**: distinction to self-supervised training. mention catastrophic forgetting. behavioral cloning.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 4.1.1: Supervised-finetuning</span></strong>
>
> 1. Take a look at [this](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset?row=23) dataset. For which kind of fine-tuning is it intended? What kinds of examples are there?
> 2. Consider the following use cases for which you want to build an LM: (a) an assistant tutor model for students for different subjects, (b) a model for answering highly specific questions about a medical knowledge base, (c) a model intended for writing abstracts of scientific papers. What kind of fine-tuning set up would you consider (i.e., what kind of fine-tuning dataset would you ideally choose)?
> 3. Please come up with a prompt for testing whether an LM can follow instructions. Use the following code to test the instruction-following performance of an instruction-tuned model (Phi-3) and a simple small LM (GPT-2). Feel free to play around with the decoding parameters! (**WARNING**: the instruction-tuned model is a large model, so it can take a moment to load on Colab. Please also be aware of it if you execute the notebook locally!) 

In [2]:
# import packages

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
tokenizer_instruct = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model_instruct = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

tokenizer_lm = AutoTokenizer.from_pretrained("gpt2-small")
model_lm = AutoModelForCausalLM.from_pretrained("gpt2-small")

In [None]:
instruction_text = #### YOUR TEXT HERE ####
input_ids_instruct = tokenizer_instruct.encode(instruction_text, return_tensors="pt").to(device)
input_ids_lm = tokenizer_lm.encode(instruction_text, return_tensors="pt").to(device)

prediction_instruct = model_instruct.generate(
    input_ids_instruct
)
print("Instruction-tuned model's prediction: ", tokenizer_instruct.decode(prediction_instruct[0], skip_special_tokens=True))

prediction_lm = model_lm.generate(
    input_ids_lm
)
print("GPT-2's prediction: ", tokenizer_lm.decode(prediction_lm[0], skip_special_tokens=True))

The distinctions above focused on distinctions in the *content* of the fine-tuning, i.e., the content of the input-output demonstrations in the datasets used for the supervised fine-tuning.

Additionally, the lecture introduced different *methods* of efficient supervised fine-tuning, which is especially important for large LMs that take a lot of resources to train. 
The QA fine-tuning that we did in homework 1 was naive fine-tuning. That is, during the fine-tuning, all parameters were updated. However, as explained in the lecture, the more common state-of-the-art approach to fine-tuning is *parameter-efficient*, i.e., only a *selected* subset of the pretrained model parameters, or a *small set of new parameters* is updated. 

The following code provides an simple example of vanilla selective fine-tuning where only the last transformer block and the last layer (i.e., LM head) of GPT-2 would be finetuned, and all other layers are frozen (❄️). Concretely, this means that we don't want to compute gradients of parameters that are frozen, and we do not want to change their values. Usually parameters are (un)frozen by-layer / component.

Of course, the same approach can be used for (un)freezing any other subset of layers, in any other `transformers` model. For this, it is useful to know how to inspect and access different components of a pretrained model, as was briefly shown in [sheet 3.1](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/03a-tokenization-transformers.html#transformers).

> Optionally, you can reuse the code from the homework to fine-tune this partially frozen model on the QA task from the homework. Do your results change?

In [3]:
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

# first we inspect the model's configuration
print(gpt2_model)



GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)




In [4]:
# first, we can inspect the model's configuration and named parameters
for name, _ in gpt2_model.named_parameters():
    print(name)

transformer.wte.weight
transformer.wpe.weight
transformer.h.0.ln_1.weight
transformer.h.0.ln_1.bias
transformer.h.0.attn.c_attn.weight
transformer.h.0.attn.c_attn.bias
transformer.h.0.attn.c_proj.weight
transformer.h.0.attn.c_proj.bias
transformer.h.0.ln_2.weight
transformer.h.0.ln_2.bias
transformer.h.0.mlp.c_fc.weight
transformer.h.0.mlp.c_fc.bias
transformer.h.0.mlp.c_proj.weight
transformer.h.0.mlp.c_proj.bias
transformer.h.1.ln_1.weight
transformer.h.1.ln_1.bias
transformer.h.1.attn.c_attn.weight
transformer.h.1.attn.c_attn.bias
transformer.h.1.attn.c_proj.weight
transformer.h.1.attn.c_proj.bias
transformer.h.1.ln_2.weight
transformer.h.1.ln_2.bias
transformer.h.1.mlp.c_fc.weight
transformer.h.1.mlp.c_fc.bias
transformer.h.1.mlp.c_proj.weight
transformer.h.1.mlp.c_proj.bias
transformer.h.2.ln_1.weight
transformer.h.2.ln_1.bias
transformer.h.2.attn.c_attn.weight
transformer.h.2.attn.c_attn.bias
transformer.h.2.attn.c_proj.weight
transformer.h.2.attn.c_proj.bias
transformer.h.2.ln_2

In [6]:
# next, we define which layers NOT to freeze 
# (of course, we can do this vice versa and define which layers to freeze)

layers_to_unfreeze = ["transformer.h.11", "transformer.ln_f.weight", "transformer.ln_f.bias"]

# iterate over model's parameters
# note that, by default, in train mode, all parameters are set to require grad = True (i.e., unfrozen)
for name, param in gpt2_model.named_parameters():
     # check that these parameters are not in the layers_to_unfreeze list
     if all([not name.startswith(n) for n in layers_to_unfreeze]): 
        # if not, freeze these parameters
        param.requires_grad = False

In [7]:
# now we check how many parameters are trainable
params = [p for p in gpt2_model.parameters() if p.requires_grad]
print(f'The model has {sum(p.numel() for p in params):,} trainable parameters')

The model has 7,089,408 trainable parameters


> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 4.1.2: PEFT</span></strong>
>
> 1. Compare the number above with the number of trainable parameters in the original model. What changed? How do you expect this to affect fine-tuning results?
> 2. Suppose that we wanted to use rank $r=4$ LoRA for fine-tuning the decoder self-attention block of GPT-2. How many parameters would the lower rank matrices A and B have (see slides 27-29 for reference)?

### Outlook: PEFT in practice

Below, more optional resources for the other types of fine-tuning introduced in the (LoRA, QLoRA) can be found.

**TODO**

## RL fine-tuning

Reinforcement learning is often introduced a separate type of machine learning, in addition to supervised and unsupervised learning. Reinforcement learning broadly defines the field of study and the methods for training agents to learn to take actions that (optimally) achieve a goal, based on experience in the environment. It can be seen as the computational formalization of trial-and-error learning.

The key difference to supervised learning is that the agent (the terms "model", "LM" and "agent" will be used interchangeably in this section) learns which actions are useful for achieving the goal itself, rather than being shown the "ground truth" optimal actions as in supervised learning. The task of the developer is, therefore, to correctly specify the goal, and the agent will "discover" a way to achieve it. 
In the formal framework which underpins RL (namely, MDPs), the goal is represented via the *reward function*. This function assigns high rewards to desired outcomes, and low rewards to undesired ones, therefore implicitly representing a goal. 
It is important to note that, in general, the approach of RL allows to specify what the developer want the agent to learn to do (i.e., the goal), but *not necessarily how*, exactly. Correct specification of the goal is a far from trivial task and a lot of current research goes into understanding how to specify these goals in the field of *alignment* (more on this in future sessions).

Using RL for fine-tuning LLMs is one of the main methodological innovations that seems to have led to the impressive performance of SOTA LLMs. In particular, RL allows to fine-tune LLMs towards human preferences and commericial usability, because its mechanics lend itself to training a model based on a more abstract signal whether an output is good or bad (i.e., let it discover how to generate output that recevies a "good" reward!), rather than based on particular demonstrations. This is especially useful because the objectives of fine-tuning of SOTA LLMs often include aspects that are very difficult to specify via specific demonstrations, like being *helpful, honest, harmless* ([Bai et al., 2022](https://arxiv.org/pdf/2204.05862)). 
To this end, instead of using supervised learning, *human feedback* can be used as a reward signal to fine-tune the model. 
This is why the fine-tuning technique in question  is called Reinforcement Learning from Human Feedback (RLHF).

Below, practical aspects of the core components of RLHF are discussed. These core components are: the policy (i.e., the backbone LLM), the supervised fine-tuning data (SFT), the reward modeling data and the resulting reward model, and, finally the training objective (commonly, the PPO algorithm). 
As with standard LM training, there are packages which implement some of the steps required for training for us. We will look at the `transformers` based package `trl`. 

### Supervised fine-tuning

**TODO**

### Reward modeling

A core component of RL fine-tuning is the reward model. For training a reward model, human annotations (i.e., human feedback) are required. As discussed in the lecture, these can take different forms. 
Below, examples of a reward model and reward modeling dataset based on simple binary comparison are shown.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 4.1.3: Reward modeling</span></strong>
>
> 1. Inspect this dataset. What do the single columns contain? Based on slide XXX, which values go where?
> 2. Look at the code below and make sure to understand it. What was the RM trained for? Test it on a few of your own intuitive examples; do the score (ordinally) correspond to your intuition?

In [None]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    AutoModelForSequenceClassification,
    LogitsProcessorList,
    MinLengthLogitsProcessor,
    TemperatureLogitsWarper,
    StoppingCriteriaList,
    MaxLengthCriteria,
)
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 

In [None]:
# Load reward model 
reward_tokenizer = AutoTokenizer.from_pretrained("lvwerra/distilbert-imdb")
# TODO try to find a helpfulness model
reward_model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

In [None]:
# Run an example from the IMDB train split to see how the reward model works

input_reward = reward_tokenizer(imdb_ds['train'][0]['text'], return_tensors='pt')
out_reward = reward_model(**input_reward)
print("Raw output format of the reward model: ", out_reward)
# transform logits to probabilities
reward = torch.softmax(out_reward.logits, dim=1)
print(reward) # reward at index 1 is the probability of being positive; i.e., this can be used as the training reward

**TODO:** have people look at the RM objective and the creating of RM data.

### Outlook: Why is RL so useful?

**TODO:** speak about the differential bottleneck.

### Outlook

Other sources of reward.