# Reinforcement Learning with Human Feedback (RLHF)

An LLM trained on public data from the internet can generate information that is harmful, false, or unhelpful. RLHF is an important tuning technique that has been critical to align an LLM's output with human preferences and values. This algorithm is a big deal and has been a central part to the rise of LLMs and it turns out that RLHF can be useful to us even if we are not training an LLM from scratch but instead building an application whose values we want to set. While fine-tuning could be one way to do this but in many ways RLHF can be useful. RLHF is a method to gather which response humans prefer in order to train the model to generate more responses that 

## How RLHF works
Let's say we want to tune our model on summarization tasks. We might start by gathering some text samples to summarize and then have humans produce a summary for each input. We can use this human generated summaries to create pairs of input text and summary, and we could train a model directly on a bunch of these pairs. But the thing is there is no one correct way to summarize a piece of text. Natural language is flexible, and there are often many ways to say the same thing. There can be multiple summaries which are equally valid. Each summary might be technically correct, but different people, different audiences will all have a preference. And preferences are hard to quantify. Problems like entity extraction and classification have correct answers, but sometimes the task we want to teach the model doesn’t have a clear objective best answer. So instead of trying to find the best summary for a particular piece of input text, we are going to frame this problem a little differently. We are going to gather information on human preferences, and to do that we will provide a human labeller with two candidate summaries and ask the labeler to pick which one they prefer. And instead of a standard supervised tuning process where we tune the model to map an input to a single correct answer, we will use reinforcement learning to tune the model to produce responses that are aligned with human preferences.<br>
**Supervised Fine Tuning →** {input text, summary}<br>
**RLHF →** {input text, summary 1, summary 2, human preference}<br>
RLHF consists of three stages. First we create a preference data set, then we use this dataset to train a reward model with supervised learning. And then, we use the reward model in a RL loop to fine tune our base LLM.

Here we will be using the base model as **LLAMA2 Model**. One thing we can do is that we can label the response on some absolute scale but it doesn't yield the best result because scales are subjective and they tend to vary across people. One way to do this is to give two responses to the labeller and let them specify which one they prefer. So a preference dataset indicates a human labeler's preference between two possible model outputs for the same input. Note that this dataset captures the preferences of human labellers not human preference in general. Creating a preference dataset can be one of the trickiest parts of this process, because first we need to define our alignment criteria. What are we trying to achieve by tuning? Do we want to make the model more useful, less toxic, more positive, etc? We will need to be clear on this so that we can provide specific instructions and choose the correct labels for the task. But once we have done that, step one is complete.

Next we move on to step two and we take this preference dataset, and we use it to train something called a **reward model**, generally with RLHF and LLMs this reward model is itself another LLM. At inference time we want this reward model to take any prompt and a completion and return a scalar value that indicates how good that completion is for the given prompt. So the reward model is essentially a **regression model**. It outputs numbers. The reward model is trained on the preference dataset, using the triplets of prompt and two completions, the winning candidate and the losing candidate. For each candidate completion, we get the model to produce a score, and the loss function is a combination of these scores. Intuitively, we can think of this as trying to maximize the difference in score between the winning candidate and the losing candidate. Once we trained this model we can now pass in a prompt and completion, and get  back a score indicating how good the completion is. The measure of how good a completion is, is subjective, but we can think of this as the higher the number, the better this completion aligns with the preferences of the people who labeled the data. After this we will use this model in the final step of this process, where the RL of RLHF comes into play. Our goal here is to tune the base LLM to produce completions that will maximize the reward given by the reward model. So, if the base LLM produces completions that better align with the preferences of the people who labeled the data, then it will receive higher rewards from the reward model. To do this we introduce a second dataset, prompt dataset. Just as the name implies, a dataset of prompts, no completions.<br>
<img src="Images/RL1.png" width="600" height="400"><br>
RL is useful when we want to train a model to learn how to solve a task that involves a complex and fairly open-ended objective. We may not know in advance what the optimal solution is, but we can give the model rewards to guide it towards an optimal series of steps. The way we frame problems in RL is as an **agent** learning to solve a task by interacting with an **environment**. This agent performs action on the environment and as a result it changes the state of the environment and receives a reward that helps it to learn the rules of that environment. For example, **AlphaGo**, a model trained with RL. It learns the rules of the board game Go by trying things and receiving rewards or penalties based on its actions. This loop of taking actions and receiving rewards repeats for many steps, and this is how the agent learns. Note that this framework differs from supervised learning, because there's no supervision. The agent isn't shown any examples that map from input to output, but instead the agent learns by interacting with the environment, exploring a space of possible actions, and then adjusting its path. The agent learned understanding of how rewarding each possible action, given the current conditions, are saved in a function. This function takes as input the current state of the environment and outputs a set of possible actions that the agent can take next, along with the probability that each action will lead to a higher reward. This function that maps the current state to the set of actions is called **Policy**, and the goal of RL is to learn a policy that maximizes the reward. We will often hear people describe the policy as the brain of the agent, and that is because it is what determines the decisions that the agent takes.<br>
<img src="Images/RL2.png" width="600" height="400"><br>
In RLHF the policy is the base LLM that we want to tune. The current state is whatever is in the context. And actions are generating tokens. Each time the base LLM outputs a completion, it receives a reward from the reward model indicating how aligned that generated text is. Learning the policy that maximizes the reward amount to a LLM that produces completions with high scores from the reward model. Policy is learned via **policy gradient method, proximal policy optimization (PPO)**. This is a STD RL algorithm.

Each time we update the weights the policy should get a little better at outputting a line text. In practice we usually add a penalty term to ensure the tune model doesn't stray too far away from the base model.<br>
**Full Fine Tuning →** re-trained the base model updating all of the models weights (parameters). But since the LLM are large and it will take a lot of time to do this we can try PEFT.<br>
**Parameter Efficient Fine Tuning →** re-trained the base model but only updated a small subset of the model parameters. Keep all the other parameters frozen. This small subset of parameters can be the existing one or a new subset.

# Datasets For Reinforcement Learning Training

"Reinforcement Learning from Human Feedback" **(RLHF)** requires the following datasets:
- Preference dataset
  - Input prompt, candidate response 0, candidate response 1, choice (candidate 0 or 1)
- Prompt dataset
  - Input prompt only, no response

### Preference dataset
This dataset is already been pre-processed

In [1]:
preference_dataset_path = 'sample_preference.jsonl'

In [2]:
import json

In [3]:
preference_data = []

In [4]:
with open(preference_dataset_path) as f:
    for line in f:
        preference_data.append(json.loads(line))

In [5]:
sample_1 = preference_data[0]

In [6]:
print(type(sample_1))

<class 'dict'>


In [7]:
# This dictionary has four keys
print(sample_1.keys())

dict_keys(['input_text', 'candidate_0', 'candidate_1', 'choice'])


The key: `'input_test'` is a prompt.

In [8]:
sample_1['input_text']

'I live right next to a huge university, and have been applying for a variety of jobs with them through their faceless electronic jobs portal (the "click here to apply for this job" type thing) for a few months. \n\nThe very first job I applied for, I got an interview that went just so-so. But then, I never heard back (I even looked up the number of the person who called me and called her back, left a voicemail, never heard anything).\n\nNow, when I\'m applying for subsequent jobs - is it that same HR person who is seeing all my applications?? Or are they forwarded to the specific departments?\n\nI\'ve applied for five jobs there in the last four months, all the resumes and cover letters tailored for each open position. Is this hurting my chances? I never got another interview there, for any of the positions. [summary]: '

In [9]:
# Try with another examples from the list, and discover that all data end the same way
preference_data[2]['input_text'][-50:]

'plan something in those circumstances. [summary]: '

Print `'candidate_0'` and `'candidate_1'`, these are the completions for the same prompt.

In [10]:
print(f"candidate_0:\n{sample_1.get('candidate_0')}\n")
print(f"candidate_1:\n{sample_1.get('candidate_1')}\n")

candidate_0:
 When applying through a massive job portal, is just one HR person seeing ALL of them?

candidate_1:
 When applying to many jobs through a single university jobs portal, is just one HR person reading ALL my applications?



Print `'choice'`, this is the human labeler's preference for the results completions (candidate_0 and candidate_1)

In [11]:
print(f"choice: {sample_1.get('choice')}")

choice: 1


### Prompt dataset

In [12]:
prompt_dataset_path = 'sample_prompt.jsonl'

In [13]:
prompt_data = []

In [14]:
with open(prompt_dataset_path) as f:
    for line in f:
        prompt_data.append(json.loads(line))

In [15]:
# Check how many prompts there are in this dataset
len(prompt_data)

6

**Note**: It is important that the prompts in both datasets, the preference and the prompt, come from the same distribution. 

Here all the prompts come from the same dataset of [Reddit posts](https://github.com/openai/summarize-from-feedback).

In [16]:
# Function to print the information in the prompt dataset with a better visualization
def print_d(d):
    for key, val in d.items():        
        print(f"key:{key}\nval:{val}\n")

In [17]:
print_d(prompt_data[0])

key:input_text
val:I noticed this the very first day! I took a picture of it to send to one of my friends who is a fellow redditor. Later when I was getting to know my suitemates, I asked them if they ever used reddit, and they showed me the stencil they used to spray that! Along with the lion which is his trademark. 
 But [summary]: 



In [18]:
# Try with another prompt from the list 
print_d(prompt_data[1])

key:input_text
val:Nooooooo, I loved my health class! My teacher was amazing! Most days we just went outside and played and the facility allowed it because the health teacher's argument was that teens need to spend time outside everyday and he let us do that. The other days were spent inside with him teaching us how to live a healthy lifestyle. He had guest speakers come in and reach us about nutrition and our final was open book...if we even had a final.... [summary]: 

