4.1 Supervised fine-tuning and RL fine-tuning
==========
**Author:** Polina Tsvilodub

* understanding instruction tuning (examples), understanding data collection for reward models, trying out reward models, RL for LMs in practice (packages), (optional: discussing different rule-based approaches to alignment)
* recap distinction between pretraining, task fine-tuning, instruction fine-tuning, assistant fine-tuning
* Understanding RLHF: reward model objective, ways to elicit feedback, issues with various approaches.
* RL: understand the issue with the **differential bottleneck and how RL actually helps to solve it**

This sheet provides an overview of different flavours of fine-tuning of LLMs and their respective use cases. Particular focus is provided for *RL based* fine-tuning, and specifically, *RLHF* (reinforcement learning from human feedback). 

**TODO:** more prose about RLHF seemingly being the secret fairy dust of helpful and nice assistants.

The key learning goals for this sheet are:
* be able to identify the type of fine-tuning used for a particular model and explain it conceptually
* gain a practical understanding of RLHF components, in particular:
  * creation and use of a reward model
  * training steps for the policy
  * important hyperparameters of RLHF.

## Flavours of fine-tuning

In [sheet 2.5](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02e-intro-to-hf.html), we already brushed over the distinction between pretrained and fine-tuned models (please recap if needed!). In particular, fine-tuning was introduced as the training LMs on *task-specific* datasets, e.g., for movie review generation or classification. For classification, LMs are usually fine-tuned with a classification head (see [sheet 3.2](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/03b-transformers-heads-training.html) for a recap). However, from here on, we will focus on fine-tuning of model for *generative* tasks, i.e., simply fine-tuning LMs for next-word prediction on a specific task. 

We have seen fine-tuning of a generative LM for question answering in homework 1. Here, the specific task the model is supposed to learn is constituted by the specific dataset and the formatting of the questions and the answers. The model was trained on examples of questions with correct answers; i.e., this is *supervised fine-tuning* where the model is "shown" what the desired behavior is (i.e., which next tokens it is supposed to predict).

For state-of-the-art LLMs, it has been identified that there is a particular kind of task that *general-purpose LLMs*, and in particular *assistants*, should be able to complete: namely, **instruction-following**. That is, rather than creating LMs that can only do QA on a specific dataset with a particular formatting, the community started to build *instruction-tuned* LLMs, which generate sensible outputs given user instructions ranging from "Provide ten recipes with tofu in a bullet list format" to "Summarize the following scientific paper". 
This has also been achieved with supervised fine-tuning, where the example input and output pairs in the dataset consist of example isntructions, and their respective completions.

**TODO**: distinction to self-supervised training.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 4.1.1: Supervised-finetuning</span></strong>
>
> 1. Take a look at [this](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset?row=23) dataset. For which kind of fine-tuning is it intended? What kinds of examples are there?
> 2. Consider the following use cases: assistant tutor model for students for different subjects, model for answering highly specific questions about a knowledge base, a model intended for writing abstracts of scientific papers. What kind of fine-tuning set up would you consider?
> 3. Please come up with a prompt testing whether an LM can follow instructions. Use the following code to test the instruction-following performance of an instruction-tuned model and a simple small LM. Feel free to play around with the decoding parameters! (**WARNING**: the instruction-tuned model is a large model, so it can take a moment to load on Colab. Please also be aware of it if you execute the notebook locally!) 

In [None]:
# import packages

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
tokenizer_instruct = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model_instruct = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

tokenizer_lm = AutoTokenizer.from_pretrained("gpt2-small")
model_lm = AutoModelForCausalLM.from_pretrained("gpt2-small")

In [None]:
instruction_text = #### YOUR TEXT HERE ####
input_ids_instruct = tokenizer_instruct.encode(instruction_text, return_tensors="pt").to(device)
input_ids_lm = tokenizer_lm.encode(instruction_text, return_tensors="pt").to(device)

prediction_instruct = model_instruct.generate(
    input_ids_instruct
)
print("Instruction-tuned model's prediction: ", tokenizer_instruct.decode(prediction_instruct[0], skip_special_tokens=True))

prediction_lm = model_lm.generate(
    input_ids_lm
)
print("GPT-2's prediction: ", tokenizer_lm.decode(prediction_lm[0], skip_special_tokens=True))

## RL fine-tuning

**The following is just a draft**:

Reinforcement Learning from Human Feedback (RLHF) approach as it was used for training ChatGPT and InstructGPT. Core components of the approaches are discussed, especially focusing on the reward modeling aspect. The second part provides an overview of developments that were made in the domain of policy-gradient methods (e.g., advantage estimation, gradient variance reduction techniques etc). The session focuses on Proximal Policy Optimization (PPO), one of the most popular deep RL algorithms in the current literature.

In [None]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    AutoModelForSequenceClassification,
    LogitsProcessorList,
    MinLengthLogitsProcessor,
    TemperatureLogitsWarper,
    StoppingCriteriaList,
    MaxLengthCriteria,
)
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 

In [None]:
# Load reward model 
reward_tokenizer = AutoTokenizer.from_pretrained("lvwerra/distilbert-imdb")
# TODO try to find a helpfulness model
reward_model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

In [None]:
# Run an example from the IMDB train split to see how the reward model works

input_reward = reward_tokenizer(imdb_ds['train'][0]['text'], return_tensors='pt')
out_reward = reward_model(**input_reward)
print("Raw output format of the reward model: ", out_reward)
# transform logits to probabilities
reward = torch.softmax(out_reward.logits, dim=1)
print(reward) # reward at index 1 is the probability of being positive; i.e., this can be used as the training reward

**TODO:** have people look at the RM objective and the creating of RM data.

### Outlook: Why is RL so useful?

**TODO:** speak about the differential bottleneck.

Outlook:

more utilities for fine-tuning: PEFT, qlora, quantization, deepspeed.