<a href="https://githubtocolab.com/Swiss-AI-Safety/swiss-summer-camp-23/blob/main/day05/ex1_RLHF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning from Human Feedback

## Introduction

### Context - Pretraining is not enough

You've seen earlier in the course that we are able to train very large and performant models like GPT2 using next-token prediction. Such models, prior to any fine-tuning, must be steered carefully with prompts in order to generate useful output. Most language models used in services of any kind today are not only pre-trained models. Rather, we use many training techniques to make them more useful. 

RLHF is one of many techniques which can convert a pre-trained model, into a more useful model for practical application.

### Context - RLHF as a naive alignment strategy

The field AI alignment is concerned with aligning AI systems with our desired outcomes. There are many reasons to think that intelligent systems do not, by default, share human values or that whilst training against any objective will lead to reliable, expected outcomes being produced by AI systems. Nevertheless, training AI systems to produce outcomes that humans prefer over outcomes which they don't seems to be a concrete step towards AI alignment, which we can build on later. 

Thus we get the core idea of RLHF as an alignment strategy. We care about outcomes, so we provide the AI feedback based on what we think likely outcomes of it's actions are and update it to produce good outcomes according to our preferences. 

For more detail on RLHF, see Paul Christiano's blog post [here](https://www.alignmentforum.org/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research#The_case_for_a_positive_impact).


### What is RLHF?

Reinforcement Learning with Human Feedback (RLHF) is a RL technique where the rewards issued by a reward model, which is itself trained from labelled data from a human operator. Often, it can be hard to specify the reward function $R : S \times A \to \mathbb{R}$ that the environment uses to issue reward to the agent, so we ask a human instead to reward/punish the agent based on the action it took. [OpenAI](https://openai.com/research/learning-from-human-preferences) uses RLHF to adjust the behaviour of models to desirable behaviour, but this can also incentivise the agent to hack the reward signal (by taking actions that look good to the human, or influencing the human to always give good rewards.)

One should note that in the framework of RLHF, the environment only has one state, and the model that we are trying to fine-tune with RLHF no longer needs to "plan ahead", so in this sense it is closer to a bandit problem than the MDPs we saw in previous days.

### Why does it matter?

RLHF (at the moment) is a successful method of nudging large language models towards desired behaviour when that behaviour is difficult to write as an algorithm.

For chess, it's easy to evaluate whether an agent won/lost the game, so we can reward that directly. For text generation, it can be hard to formally specify
what we mean by harmful or abusive text. One could have simple proxies like a filter to encourage/discourge use of particular words, and use that
to train against, but it's very easy to construct harmful text such that no particular word in the sentence would be classed as offensive:
"I would love to eat your pet puppy" contains no offensive words, even though the semantic meaning of the entire sentence is quite offensive. 
A simple proxy for offensiveness might even rate this as a positive statement, as it contains "nice" words like *love* and *puppy*.

However, samples from humans are expensive and slow. Even running a single batch of examples through the model could take a long time
if we need a human to give a scalar reward for each action chosen by the model. So, the solution is to collect a lot of data from a human
(a set of (observation, action, reward) tuples), train a reward model on this data, and then use the reward model as the reward function.


### How does RLHF work in practice?

RLHF involves 3 stages:

1. We pretrain a language model (LM) using existing supervised learning techniques.
2. We gather labelled data from humans, and train a reward model that will act as a proxy for the human's rewards.
3. We fine-tuning the LM with reinforcement learning. 

#### 1. Pretraining

Since reinforcement learning is very sample inefficient, it is unreasonable to expect to be able to train a language model from scratch using online learning. Rather, we must start with an existing pre-trained model and then fine-tune it. 

We will be using GPT-2-small as our base model to finetune.

<img src="https://raw.githubusercontent.com/jbloomAus/ARENA_2.0-RLHF/main/media/pretraining.png" width="500">

#### 2. The Reward Model 

The reward model is used to assign a reward to any given output of the model during training. 
Rather than have reward be a simple function of the state of the world (as for RL environments like CartPole), 
the reward model assigns a reward to a given piece of text. 
The reward model acts like a text classifier, rewarding "good" pieces of text, and punishing "bad" text.

The reward model is trained on a set of prompts, hand labelled by humans into "good" and "bad".
This is then used to train the reward model, to act as a stand-in for the human during the fine-tuning stage.

The model acts as a mapping between arbitrary text and human preferences. 

<img src="https://raw.githubusercontent.com/jbloomAus/ARENA_2.0-RLHF/main/media/reward-model.png" width="700">

#### 3. Fine-Tuning with Reinforcement Learning 

Finally, given some reward model and some pre-trained model, we can use an algorithm such as PPO to reward the model for producing prompt completions when the reward model predicts the completion to be preferable.

In the standard RL framework, the agent recieves a reward on every timestep during interaction.
Here, the "observation" that the agent receives is a textual prompt, and the "action" the agent takes is the choice of words
to complete the prompt. The reward model then assigns a reward based on the prompt together with the completion from the agent,
which is then used to compute the loss, and update the weights of the model.

<img src="https://raw.githubusercontent.com/jbloomAus/ARENA_2.0-RLHF/main/media/rlhf.png" width="800">

### How does RLHF differ from PPO?

- No "environment". RLHF operates on text completions made by the pre-trained generative model.
- Reward Model. Reward itself is generated by the reward model which itself must be trained.
- Adding a Value Head. We add a value head to the policy/LM architecture so that we have both an actor and a critic for PPO. 
- KL Divergence penalty. The KL divergence term penalizes the RL policy from moving substantially away from the initial pretrained model with each training batch, to ensure we maintain coherent outputs, and the fine-tuned model avoids generating text that overfits to what the reward model is looking for.

#### Aside - value heads

The "actor" in our PPO setup is the GPT model. We get the "critic" by adding a **value head** to the GPT architecture - i.e. you stick a classifier to GPT2 and train that as our value function. 

For an example, see the source code for AutoModelForCausalLMWithValueHead in the [TRLX github](https://github.com/CarperAI/trlx/blob/main/trlx/models/modeling_ppo.py). This gives us an autoregressive transformer which has 2 outputs: one corresponding to the standard next token prediction objective, and one which sticks a classifier on the end to get a value function. It does this by adding `self.v_head`, a function which reads from the final value of the residual stream in GPT (which stores some compressed embedding of the prompt), and extracts a value function from this representation. You can think of this as a kind of feature extraction, analogous to the feature extraction that we implemented with out ResNet models in the first week.

The TRLX library we'll be working with today handles all of this under the hood. However, you should definitely have a poke around this library to get a feel for how it works.

#### Aside - KL divergence term

**Important note** - the KL div penalty is not the same as the version of PPO which uses a KL div penalty term in the surrogate objective function. The first one is a feature of the RLHF setup; it makes sure we don't get too far from the original model (i.e. it's static throughout training, used to constrain how much we change from the original model by the end). The second one is a feature of PPO setup; it makes sure we don't make huge updates from where we were before the last training step (i.e. it's a moving target, used to constrain how much we change each step).

The KL div term we use in RL heavily penalises our new model when it outputs something which **would have low probability in the original model.** This is related to the reason RLHF'ed models are sometimes described as ["lobotomized"](https://twitter.com/repligate/status/1640488734192726018) - they converge to a subset of the kinds of outputs that our original model might have had, meaning they lose some of the variance and creativity of the original model.

If you want to have a better intuition, you can have a look at [this video](https://www.youtube.com/watch?v=ErfnhcEV1O8) which explains, the entropy, cross-entropy, KL-divergence and how they are related to each others.

## Readings for RLHF

* [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593) (paper)
* [Learning to summarize from human feedback](https://arxiv.org/abs/2009.01325) (paper)
* [AI safety via debate](https://openai.com/research/debate) (OpenAI blog post)
* [Thoughts on the impact of RLHF research](https://www.alignmentforum.org/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research), by Paul Christiano

## Content & Learning Objectives

Here, we'll do only the two last sections as we'll use an already existing model as the reward model.

#### ~~1️⃣ Prompt Dataset & Reward Model~~

In the first section, we'll get set up with the prompt dataset and reward model we'll be using for the rest of the exercises.

> ##### Learning objectives
> 
> * Load datasets from Huggingface and break them up into prompts
> * Output emotions from models with Huggingface models

#### 2️⃣ Using RLHF for Finetuning

In the second section, we'll finetune a model pre-trained on the IMDB dataset using RLHF to generate positive reviews.

> ##### Learning objectives
> 
> - Learn about TRLX and how it can be used
> - Using RLHF to improve a specific emotion of a Flan-T5 model

#### 3️⃣ Bonus

In this final section, we'll suggest a set of bonus exercises

# Just run!

In [None]:
!pip install datasets transformers git+https://github.com/CarperAI/trlx.git

In [None]:
import os; os.environ["ACCELERATE_DISABLE_RICH"] = "1"
import gc
from pathlib import Path
import torch
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification
from typing import cast, Any, List, Optional, Union, Tuple
import tensorflow as tf

# Force memory to growth instead of being allocated at the beginning
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
    tf.config.experimental.set_memory_growth(gpu_instance, True)

from trlx.data.default_configs import TRLConfig, TrainConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, ModelConfig
from trlx.models.modeling_ppo import PPOConfig
from trlx import train

def find_best_device() -> str:
    """Finds the best device to use 
    Returns:
        str: The name of the device to use.
    """
    if torch.cuda.is_available():
        return "cuda"
    if torch.backends.mps.is_available() and torch.backends.mps.is_built():
        return "mps"
    return "cpu"

device = find_best_device()
print(f"Using device: {device}")

# 1️⃣ Prompt Dataset & Reward Model


> ##### Learning objectives
> 
> * Load datasets from Huggingface and break them up into prompts
> * Generate text from Huggingface models 
> * Output specific emotions from models with a Flan-T5 model

## Background - BERT/RoBERTa

In the transformers chapter, we only worked with autoregressive transformers like GPT2 or OthelloGPT (unless you did the bracket classification task). Here, we'll work with BERT, a well-known **bidirectional transformer**. 

BERT predates GPT2 slightly (it was released in 2018, one year after the seminal "Attention is all you need" paper). It was the next in a proud tradition of naming transformers after muppets (no, [that's](https://arxiv.org/pdf/1910.13034.pdf) [not](https://arxiv.org/pdf/1904.09223.pdf) [a](https://arxiv.org/pdf/1905.12616.pdf) [joke](https://arxiv.org/pdf/1906.01604.pdf)). It has bidirectional attention, meaning we don't apply masking to the attention patterns - information can flow backwards and forwards in the model. BERT is usually used for classification tasks, such as sentiment analysis. 

### How is BERT trained?

The architecture is similar to GPT, although the "core BERT" model doesn't have an unembedding (i.e. the output has shape `(batch, seq_len, d_model)`). 

BERT is trained on two kinds of tasks: **next sentence prediction** (NSP) and **masked language modelling** (MLM).

* In MLM, we take a sequence and replace some of its tokens with a special `[MASK]` token, then train the model to predict the original token.
* In NSP, we take two sentences, and train the model to predict whether the second sentence follows the first (we do this by adding a small classifier at the end of BERT, which just reads from the final value of the residual stream at the zeroth sequence position, which is a special classification token `[CLS]`).

Importantly, **both of these two tasks require the model to learn some kind of compressed representation of the input sequence** in its residual stream.

### How do we turn BERT into a classifier?

We usually stick a classification head onto the end of the "core BERT architecture" at the `[CLS]` token, then take the pretrained model and fine-tune it on a classification task. If pretraining has been successful, the model will have learned some kind of compressed representation of the input sequence in its residual stream, and the classifier will be doing something like feature extraction.

## Emotions dataset


First, load in the IMDB user reviews dataset. Documentation about the Emotions dataset can be found here: https://huggingface.co/datasets/go_emotions. We want to use both the train and test splits to collect prompts.



In [None]:
ds = load_dataset("go_emotions", split="train+test")

### Exercise - Create a set of prompts 

```c
Difficulty: 🟠🟠⚪⚪⚪
Importance: 🟠🟠🟠⚪⚪

You should spend up to ~10 minutes on this exercise.
```

A prompt to the model can look like "Today was not fun ", "In the event of " or "Mary gave John a ". These prompts will serve as the starting point for model generations during the RLHF process.

In the context of the exercise to push Flan-T5 towards outputting sentences with a more specific emotion, we want to try and have a set of prompts that can produce varying kinds of sentiments rather than just one kind of sentiment. This set of prompts essentially forms our "observation space" and all completions are "actions", if our observation space contains primarily positive sentiment the model will not update heavily and will potentially still output negative sentiment when a prompt heavily favors it. Ideally we want our set of prompts to have a mix of sentiments.

We want to collect the first few (3-5, the choice is yours) words from each sentence to serve as prompts for our finetuned model. The generated text from these prompts will be later used to evaluate the performance of our finetuned model.

Emphasis - **we want to capture these prompts straight from the emotions dataset rather than write them ourselves.**


In [None]:
def generate_prompts(dataset) -> List[str]:
    '''Generate & return prompts from dataset.'''
    # SOLUTION

prompts = generate_prompts(ds)

<details>
<summary>Solution</summary>


```python
def generate_prompts(dataset):
    '''Generate & return prompts from dataset.'''
    prompts = [" ".join(review.split()[:4]) for review in dataset["text"]]
    return prompts

prompts = generate_prompts(ds)
```
</details>


## Flan-T5-small

The model that we will perform RLHF on is a Flan-T5 model, which can be found here: https://huggingface.co/google/flan-t5-small. We used this one because it has less parameters (60M parameters) than the smallest GPT2 model (117M parameters) and thus is convenient to train on consumer GPUs (<12GB of vram).


### Exercise - Load the Flan-T5-small model and generate reviews from prompts

```c
Difficulty: 🟠🟠🟠⚪⚪
Importance: 🟠🟠🟠⚪⚪

You should spend up to 10-25 minutes on this exercise.
```

You will need to use the `AutoTokenizer`, `AutoModelForSeq2SeqLM` from the transformers package. You might want to use the generate method of the Flan-T5 model that you load, if you do you should use `top_p` sampling and set the `max_new_tokens` argument to something that's large enough.

Play around with generating completions from this prompt and verify whether the completions approximately fit your initial expectations of the sentiments that the model would output.

**Note** - when you run `tokenizer(prompt)`, this will return a dictionary containing things like `token_ids` as well as a couple of other things that need to be passed into the model in a forward pass (e.g. a tensor indicating where you should mask `[PAD]` tokens). The best way to deal with this is to take `inputs = tokenizer(prompt)` and run `model.generate(**inputs)`.

In [None]:
def generate_completion_from_model(prompt, model, tokenizer) -> str:
    '''
    Remember to set the `do_sample=True` flag when you call `model.generate`.
    '''
    raise NotImplementedError


def generate_completion(prompt) -> str:
    '''
    Loads the Flan-T5-small tokenizer and model, and generates completions for the given prompt (in the form of a string).

    Find name of model & tokenizer at the documentation page: https://huggingface.co/google/flan-t5-small.
    '''
    tokenizer = ...
    model = ...
    return generate_completion_from_model(prompt, model, tokenizer)


generate_completion(prompts[0]) 


<details>
<summary>Solution</summary>


```python
def generate_completion_from_model(prompt, model, tokenizer) -> str:
    '''
    Remember to set the `do_sample=True` flag when you call `model.generate`.
    '''
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = tokenizer.decode(model.generate(**inputs, do_sample=True, top_k=10, max_new_tokens=64).squeeze(0), skip_special_tokens=True)
    return outputs

def generate_completion(prompt) -> str:
    '''
    Loads the Flan-T5-small tokenizer and model, and generates completions for the given prompt (in the form of a string).

    Find name of model & tokenizer at the documentation page: https://huggingface.co/google/flan-t5-small.
    '''
    tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')
    model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small').to(device)
    return generate_completion_from_model(prompt, model, tokenizer)
```

</details>


### The reward function

Judging by the name of this chapter you might think that you would be providing the reward function yourself but sadly we will not be doing this. Instead, we will be using a language model trained to perform emotion analysis to generate the emotion score (higher of a specific emotion is positive). The language model we will be using to generate sentiment scores can be found here: https://huggingface.co/SamLowe/roberta-base-go_emotions

We'll have the choice in the different emotions (in this order) : 

    - admiration
    - amusement
    - anger
    - annoyance
    - approval
    - caring
    - confusion
    - curiosity
    - desire
    - disappointment
    - disapproval
    - disgust
    - embarrassment
    - excitement
    - fear
    - gratitude
    - grief
    - joy
    - love
    - nervousness
    - optimism
    - pride
    - realization
    - relief
    - remorse
    - sadness
    - surprise
    - neutral


#### Exercise - Get emotion scores for a text

```c
Difficulty: 🟠🟠🟠🟠⚪
Importance: 🟠🟠🟠⚪⚪

You should spend up to 15-30 minutes on this exercise.
```

We can use the model mentioned above in eval mode to generate emotion scores and then transform the sentiments into rewards to be fed into the RLHF training loop.

Note - the model is not passed as an argument because we want you to call the model linked in the description **inside the function body**.

In [None]:
emotions = [
    "admiration",
    "amusement",
    "anger",
    "annoyance",
    "approval",
    "caring",
    "confusion",
    "curiosity",
    "desire",
    "disappointment",
    "disapproval",
    "disgust",
    "embarrassment",
    "excitement",
    "fear",
    "gratitude",
    "grief",
    "joy",
    "love",
    "nervousness",
    "optimism",
    "pride",
    "realization",
    "relief",
    "remorse",
    "sadness",
    "surprise",
    "neutral",
]

def reward_model(samples, emotion="joy", **kwargs) -> List[float]:
    '''
    Returns the rewards for the given samples (according to model which is defined inside function body).

    kwargs are passed to your model during a forward pass.
    '''
    pass

reward_model(["I'm very happy", "I'm so angry!!!!", "Meeh, this is not a very interesting notebook..."])


<details>
<summary>Solution</summary>


```python
def reward_model(samples, emotion="joy", **kwargs) -> List[float]:
    '''
    Returns the rewards for the given samples (according to model which is defined inside function body).

    kwargs are passed to your model during a forward pass.
    '''
    # SOLUTION
    # Load model directly
    
    tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
    model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

    rewards = []
    
    inputs = tokenizer(samples, padding=True, truncation=True, return_tensors="pt")
    
    with torch.inference_mode():
        outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], **kwargs)
    
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=-1)

    emotion_idx = emotions.index(emotion)

    for reward in probabilities:
       rewards.append(reward[emotion_idx].item())
    
    return rewards
```
</details>


### Exercise - Sentiment playground

```c
Difficulty: 🟠⚪⚪⚪⚪
Importance: 🟠🟠🟠🟠⚪

You should spend up to 10-15 minutes on this exercise.
```

The reward model is now ready and you should take some time to feed in sentences of varying sentiments to check whether the rewards are as you expect. Remember the reward model is also a trained model so it exhibits all the quirks of one such as weird failure modes and potential to be broken with adversarial examples. 

What are the most counterintuitive results you can find? 

In [None]:
for sentence in (
    'I really enjoy playing volleyball',
    'Wow, this lake is very cold...',
    "I'm getting worried, when is the starlink antenna arriving?"
):
    print(sentence, reward_model(sentence))

# 2️⃣ Using RLHF for Finetuning



> ##### Learning objectives
> 
> - Learn about TRLX and how it can be used
> - Use RLHF to improve an emotion on a Flan-T5 model


## TRLX


### What is TRLX?

trlX is a library made for training large language models using reinforcement learning. It currently supports training using PPO or [ILQL](https://arxiv.org/abs/2206.11871) for models up to 20B using Accelerate.

In practice, RLHF with trlX is very easy if you already have a reward model and pretrained model. 

### Using trLX

Using trLX, we need to choose:

- Training Config
- A prompt dataset. 
- A reward function (which makes use of the reward model). 
- Evaluation Prompts

These 4 objects are inputs to the train function which has already been imported for you.


#### Training Config

See below for a config that when fed into TRLX performs RLHF using PPO, all hyperparameters are set to enable training and are best left untouched for the next exercise. You might want to increase max_new_tokens to get longer generations on your evaluation prompts during finetuning. 

Increasing max_new_tokens will increase training time. For reference, keeping everything else the same in the config below and changing max_new_tokens from 40 to 100 increases finetuning time from ~6 mins to ~10 mins assuming the number of epochs and steps stay the same as the default. Picking a max_new_tokens value somewhere in the middle would be the best.

The model keyword specifies which model will be finetuned and we will be using the same GPT2 model that we used before to generate initial prompt completions.


#### Prompt Dataset

The prompt dataset is the dataset that we'll use to generate reviews from the model specified in the config. These generations will be then be scored by the chosen reward function, this score will be used as the reward that will steer PPO to update the weights of the model towards maximising the reward function. As mentioned before the prompt dataset also forms the observation space for the PPO algorithm.

#### Reward Function

The reward function provides rewards given a set of prompt completions. In this particular case, the rewards will correspond with the positive sentiment of the completions and will steer the model towards generating strings that are generally positive .

#### Evaluation Prompts

The evaluation prompts are a set of prompts that we will use to validate the training process and the completions from these prompts will provide an indication of whether the overall sentiment is trending upwards.

We will have a single prompt repeated as the eval prompt for a number of times equal to the batch size of the reward model such as: 

```python
['I am quite interested '] * batch_size_of_reward_model
```

In this particular prompt, the initial prompt choice will cause the eval reward curve to have different starting points and end states.


## Exercise: Putting it all together - Reinforcing positive sentiment

```c
Difficulty: 🟠🟠🟠⚪⚪
Importance: 🟠🟠🟠🟠⚪

You should spend up to 10-20 minutes on this exercise.
```

We will now be calling the train funcation and pass in the arguments as we've described above. The train function has already been imported for you and should be called like so:

```python
trainer = train(
    reward_fn = ...,
    prompts = ...,
    eval_prompts = ...,
    config = ...
) 
```

If you want to save your model, you can save it using:

```python
trainer.save_pretrained("path/to/save")
```

All that you need to do below is fill in these four arguments in the `main` function. Make sure you understand what the significance of all four of these arguments is before moving on.

In [None]:
def ppo_config():
    return TRLConfig(
        train=TrainConfig(
            seq_length=1024,
            epochs=1,
            total_steps=5000,
            batch_size=32,
            checkpoint_interval=1000,
            eval_interval=100,
            pipeline="PromptPipeline",
            trainer="AcceleratePPOTrainer",
        ),
        model=ModelConfig(model_path="google/flan-t5-small", num_layers_unfrozen=-2, model_arch_type="seq2seq"),
        tokenizer=TokenizerConfig(tokenizer_path="google/flan-t5-small", truncation_side="right"),
        optimizer=OptimizerConfig(
            name="adamw", kwargs=dict(lr=3e-5, betas=(0.9, 0.95), eps=1.0e-8, weight_decay=1.0e-6)
        ),
        scheduler=SchedulerConfig(name="cosine_annealing", kwargs=dict(T_max=1e12, eta_min=3e-5)),
        method=PPOConfig(
            name="PPOConfig",
            num_rollouts=128,
            chunk_size=128,
            ppo_epochs=4,
            init_kl_coef=0.001,
            target=None,
            horizon=10000,
            gamma=1,
            lam=0.95,
            cliprange=0.2,
            cliprange_value=0.2,
            vf_coef=1,
            scale_reward="ignored",
            ref_mean=None,
            ref_std=None,
            cliprange_reward=10,
            gen_kwargs=dict(
                max_new_tokens=32,
                top_k=10, # or can do top_p
                do_sample=True,
            ),
        ),
    )

def reward_fn(
        samples,
        **kwargs,
):
    return torch.tensor(reward_model(samples)).to(device)

def main() -> None:
    # Call the `train` function with appropriate arguments
    raise NotImplementedError

gc.collect()
torch.cuda.empty_cache()
trainer = main()


<details>
<summary>Solution</summary>


```python

def main() -> None:
    # Call the `train` function with appropriate arguments
    trainer = train(
        reward_fn = reward_fn,
        prompts = prompts,
        eval_prompts = ['In my opinion'] * 32, # Feel free to try different prompts
        config =  ppo_config()
    )
    # You can save trainer here if you want, using trainer.save_pretrained("path/to/save")
    return trainer

```
</details>


Notice that we call `torch.cuda.empty_cache()` here, which is essential to free up GPU memory that might be held up as remnants of completed GPU operations or past failed runs. Running out of memory might be a common issue that you run in and running `torch.cuda.empty_cache()` will help you not get stuck as much. There are times when this is insufficient and you might need to restart the kernel to free up memory, you can call `nvidia-smi` on your terminal to see how much GPU memory is currently being used, and you can run `watch -n 1 nvidia-smi` to constantly keep an eye on GPU utilisation & available memory. Jupyter is unfortunately quite opaque in terms of memory management and you might need to call `torch.cuda.empty_cache()` and `gc.collect()` more often than you would expect. 

TRLX logs to W&B and you should be prompted to add in your W&B key at some point. Take a look at the reward graph that shows the change in reward received by completions from the eval_prompts over the course of the training run. All the prompt completions are stored in the files section under the media folder. 


## Exercise: Sentiment playground - Post RLHF

```c
Difficulty: 🟠⚪⚪⚪⚪
Importance: 🟠🟠🟠🟠⚪

You should spend up to ~10 minutes on this exercise.
```

Try out your RLHF'd model, ideally after a `gc.collect()` and a `torch.cuda.empty_cache()` call to ensure there is free GPU memory. 

In [None]:
# Sample here

<details><summary>Solution</summary>

```python
generate_completion_from_model("<Insert your prompt here>", trainer.model, trainer.tokenizer)
```

</details>

# 3️⃣ Bonus

## Bonus exercises - more RLHF exploration


Here are a few bonus exercises. They're ordered (approximately) from easiest to hardest. Doing these (or the other bonus exercises from previous sections) will last for the rest of this week (until the end of the RL section).


### Calculate the KL penalty for divergence from the previous model.

Dive into the trlX trainer here: https://github.com/CarperAI/trlx/blob/404217b2f3f295ff0f68851524517064acc43a15/trlx/trainer/accelerate_ppo_trainer.py#L251

This is the function that implements many parts of the RLHF training loop, the KL divergence steps can be found starting line 430. Try replicating this code for a toy model to calculate KL divergence between this model and a copy of it during training.


### Experiment with other huggingface models

In the above exercise, we trained a Flan-T5 model to show a specific emotion. We can follow a similar procedure to tune other models with desirable behaviours. You can swap out the reward model function and the model to be finetuned with (almost) any Huggingface model. Below are a few suggestions for tasks:

#### Fin-BERT finetuning GPT2

FinBERT is a BERT model that outputs positive and negative sentiment of financial news. The reward of outputting positive sentiment news is entangled with outputting financial news rather than any other kind of text generation. You can RLHF vanilla GPT2 with FinBERT as the reward model to verify this phenomenon and observe its effect.

FinBERT: https://huggingface.co/ProsusAI/finbert
Vanilla GPT2: https://huggingface.co/gpt2

#### Tiny stories

Doing fine-tuning with tiny stories to encourage good or bad endings of initial prompts to vanilla GPT2.

Reward model: https://huggingface.co/roneneldan/TinyStories-1M


### RLHF on DALL-E

Train a [version of DALL-E](https://github.com/lucidrains/DALLE-pytorch) (or a smaller version) using reinforcement learning from human feedback, so that it’s better at following human instructions.

Probably a useful first step here is finding something that the model isn’t good at because its training data didn’t incentivize it to be, but that it probably could be good at with a bit of data.

This project was recommended as a capstone for the last week of MLAB2, by Buck Shlegeris.


## Reward Model Mechanistic interpretability  

Mechanistic interpretability on RLHF'ed models is a very interesting (and currently woefully unexplored) area. If you're interested in this, the person to speak to is [Curt Tigges](https://twitter.com/CurtTigges).

As a start, have a look at https://blog.eleuther.ai/trlx-exploratory-analysis/. There is also an accompanying [colab](https://colab.research.google.com/drive/1DK6_HNRjUHliolQ2uMYNpB24XyeD9BIl), which contains some exploratory analysis similar in tone to the early sections of IOI (but looking at the logit difference between the completions `" bad"` and `" good"` for the sentence `"This movie was really"`). A great exercise could be to replicate this or find other interesting mechanistic behaviour.

Here is a relevant section of Neel Nanda's "exciting open problems in mech interp" doc, which was drafted during the currently running training phase of SERI MATS. Take this with a pinch of salt, because it hasn't been heavily refined yet:

> RLHF seems like a big deal, and I have no idea how it changes language models. I’d love to understand this better! This breaks down in a few ways:
> 
> * Understanding what fine-tuning does to a model - does it just upweight or downweight certain circuitry, or does it actively learn new circuits? This is easiest to study in a non-RLHF setting,
> * Understanding the interplay between the preference model and the learned policy - this is one of the unique features of RLHF that I do not understand! 
>     * Does the preference model learn concepts significantly before the policy does? What does this look like?
> * More generally, just forming any understanding of the preference model would feel interesting to me
>     * Maybe looking for spurious correlations, like “summaries look higher quality if they have longer words, or if they’re longer”
> 
> Fuzzy ideas:
> 
> * Understand how networks form social models of users, as a precursor to manipulation or deception:
>     * Exploring sycophancy in models seems very exciting here, though probably doesn’t work on open source models (maybe the biggest LLAMA, but that’s such a pain to work with)

You may also be interested in the [Interpreting Reinforcement Learning](https://www.lesswrong.com/s/yivyHaCAmMJ3CqSyj/p/eqvvDM25MXLGqumnf) section of Neel's 200 Concrete Open Problems sequence.


## Bonus exercises (non-RLHF)


Here are a few more suggestions of bonus exercises that you can do for the rest of this week, which are not related to RLHF.


### Reimplement a bunch of RL algorithms

e.g. a bunch of the algorithms from [https://spinningup.openai.com/](https://spinningup.openai.com/).

This project was recommended as a capstone for the last week of MLAB2, by Buck Shlegeris.

### Decision Transformers

*This section is taken from Joseph Bloom's guidelines on decision transformers.*

What to read:

* [[2106.01345] Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)
* [[2106.02039] Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039)
* [[2205.06175] A Generalist Agent](https://arxiv.org/abs/2205.06175)

Offline-RL, general steps:

1. Collect trajectories (process into sequence)
2. Set up architecture (decision transformer or other model)
    a. Ensure you have ways to embed your states and actions (and reward to go’s if it’s a decision transformer)
    b. Main model is GPT2 like, but without an embedding (you do that manually)
    c. Predict one of the allowed actions (if in discrete space)
3. Train
    a. Use predictive loss on actions to update weights (so ignore outputs of 2 out of 3 tokens in each timestep)
    b. Can calculate test loss (important) but also essential to look at performance on actual task (real measure of generalisation). During inference, need an environment to generate reward/observation every timestep that your model predicts an action for.

Github repo of original paper: https://github.com/kzl/decision-transformer/tree/master

My github repo: GitHub - jbloomAus/DecisionTransformerInterpretability: Interpreting how transformers simulate agents performing RL tasks (I’m not sure my implementation is that much more readable since I’ve worked on it for a while and done lots of stuff that’s non-trivial). People can also try to use my library and see how they go. I suspect it's pretty hard to use at the moment, but I'd love if someone could make it through and then help me clean it up/make it more accessible. The core functions are pretty good.