# Lesson 3: Tune an LLM with RLHF

#### Project environment setup

The RLHF training process has been implemented in a machine learning pipeline as part of the (Google Cloud Pipeline Components) library. This can be run on any platform that supports KubeFlow Pipelines (an open source framework), and can also run on Google Cloud's Vertex AI Pipelines.

To run it locally, install the following:

```Python
!pip3 install google-cloud-pipeline-components
!pip3 install kfp
```

In [1]:
!pip3 install google-cloud-pipeline-components
!pip3 install kfp

Collecting google-cloud-pipeline-components
  Downloading google_cloud_pipeline_components-2.18.0-py3-none-any.whl.metadata (6.0 kB)
Collecting kfp<2.11.0,>=2.6.0 (from google-cloud-pipeline-components)
  Downloading kfp-2.10.1.tar.gz (343 kB)
     ---------------------------------------- 0.0/343.6 kB ? eta -:--:--
     ---------------------------------------- 0.0/343.6 kB ? eta -:--:--
     - -------------------------------------- 10.2/343.6 kB ? eta -:--:--
     - -------------------------------------- 10.2/343.6 kB ? eta -:--:--
     --- --------------------------------- 30.7/343.6 kB 187.9 kB/s eta 0:00:02
     --- --------------------------------- 30.7/343.6 kB 187.9 kB/s eta 0:00:02
     ---- -------------------------------- 41.0/343.6 kB 178.6 kB/s eta 0:00:02
     -------- ---------------------------- 81.9/343.6 kB 269.5 kB/s eta 0:00:01
     -------- ---------------------------- 81.9/343.6 kB 269.5 kB/s eta 0:00:01
     ----------- ------------------------ 112.6/343.6 kB 297.7

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-chroma 0.2.0 requires langchain-core!=0.3.0,!=0.3.1,!=0.3.10,!=0.3.11,!=0.3.12,!=0.3.13,!=0.3.14,!=0.3.2,!=0.3.3,!=0.3.4,!=0.3.5,!=0.3.6,!=0.3.7,!=0.3.8,!=0.3.9,<0.4.0,>=0.2.43, but you have langchain-core 0.1.53 which is incompatible.
langsmith 0.1.147 requires requests-toolbelt<2.0.0,>=1.0.0, but you have requests-toolbelt 0.10.1 which is incompatible.
opentelemetry-proto 1.29.0 requires protobuf<6.0,>=5.0, but you have protobuf 4.25.5 which is incompatible.




In [2]:
# Import (RLFH is currently in preview)
from google_cloud_pipeline_components.preview.llm \
import rlhf_pipeline

In [3]:
# Import from KubeFlow pipelines
from kfp import compiler

In [4]:
# Define a path to the yaml file
RLHF_PIPELINE_PKG_PATH = "rlhf_pipeline.yaml"

In [5]:
# Excute the compile function
compiler.Compiler().compile(
    pipeline_func = rlhf_pipeline,
    package_path = RLHF_PIPELINE_PKG_PATH
)

In [13]:
# Print YAML file
!powershell -Command "Get-Content rlhf_pipeline.yaml -TotalCount 100"

# PIPELINE DEFINITION
# Name: rlhf-train-template
# Description: Performs reinforcement learning from human feedback.
# Inputs:
#    accelerator_type: str [Default: 'GPU']
#    deploy_model: bool [Default: True]
#    encryption_spec_key_name: str [Default: '']
#    eval_dataset: str
#    instruction: str
#    kl_coeff: float [Default: 0.1]
#    large_model_reference: str
#    location: str [Default: '{{$.pipeline_google_cloud_location}}']
#    model_display_name: str
#    preference_dataset: str
#    project: str [Default: '{{$.pipeline_google_cloud_project_id}}']
#    prompt_dataset: str
#    prompt_sequence_length: int [Default: 512.0]
#    reinforcement_learning_rate_multiplier: float [Default: 1.0]
#    reinforcement_learning_train_steps: int [Default: 1000.0]
#    reward_model_learning_rate_multiplier: float [Default: 1.0]
#    reward_model_train_steps: int [Default: 1000.0]
#    target_sequence_length: int [Default: 64.0]
#    tensorboard_resource_id: str [Default: '']
# Outputs:

## Define the Vertex AI pipeline job

### Define the location of the training and evaluation data
Previously, the datasets were loaded from small JSONL files, but for typical training jobs, the datasets are much larger, and are usually stored in cloud storage (in this case, Google Cloud Storage).

**Note:** Make sure that the three datasets are stored in the same Google Cloud Storage bucket.
```Python
parameter_values={
        "preference_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
        "prompt_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
        "eval_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl",
    ...
```

### Choose the foundation model to be tuned

In this case, we are tuning the [Llama-2](https://ai.meta.com/llama/) foundational model, the LLM to tune is called **large_model_reference**. 

In this course, we're tuning the llama-2-7b, but you can also run an RLHF pipeline on Vertex AI to tune models such as: the T5x or text-bison@001. 

```Python
parameter_values={
        "large_model_reference": "llama-2-7b",
        ...
```

### Calculate the number of reward model training steps

**reward_model_train_steps** is the number of steps to use when training the reward model.  This depends on the size of your preference dataset. We recommend the model should train over the preference dataset for 20-30 epochs for best results.

$$ stepsPerEpoch = \left\lceil \frac{datasetSize}{batchSize} \right\rceil$$
$$ trainSteps = stepsPerEpoch \times numEpochs$$

The RLHF pipeline parameters are asking for the number of training steps and not number of epochs. Here's an example of how to go from epochs to training steps, given that the batch size for this pipeline is fixed at 64 examples per batch.



In [14]:
# Preference dataset size
PREF_DATASET_SIZE = 3000

In [15]:
# Batch size is fixed at 64
BATCH_SIZE = 64

In [16]:
import math

In [17]:
REWARD_STEPS_PER_EPOCHS = math.ceil(PREF_DATASET_SIZE / BATCH_SIZE)
print(REWARD_STEPS_PER_EPOCHS)

47


In [18]:
REWARD_NUM_EPOCHS = 30

In [21]:
# Calculate number of steps in the reward model training
reward_model_train_steps = REWARD_STEPS_PER_EPOCHS * REWARD_NUM_EPOCHS

In [22]:
print(reward_model_train_steps)

1410


### Calculate the number of reinforcement learning training steps
The **reinforcement_learning_train_steps** parameter is the number of reinforcement learning steps to perform when tuning the base model. 
- The number of training steps depends on the size of your prompt dataset. Usually, this model should train over the prompt dataset for roughly 10-20 epochs.
- Reward hacking: if given too many training steps, the policy model may figure out a way to exploit the reward and exhibit undesired behavior.

In [23]:
# prompt dataset size
PROMPT_DATASET_SIZE = 2000

In [24]:
# Batch size is fixed at 64
BATCH_SIZE = 64

In [25]:
import math

In [26]:
RL_STEPS_PER_EPOCHS = math.ceil(PROMPT_DATASET_SIZE / BATCH_SIZE)
print(RL_STEPS_PER_EPOCHS)

32


In [27]:
RL_NUM_EPOCHS = 10

In [29]:
# Calculate the number of steps in the RL training
reinforcement_learning_train_steps = RL_STEPS_PER_EPOCHS * RL_NUM_EPOCHS

In [30]:
print(reinforcement_learning_train_steps)

320


### Define the instruction

- Choose the task-specific instruction that you want to use to tune the foundational model.  For this example, the instruction is "Summarize in less than 50 words."
- You can choose different instructions, for example, "Write a reply to the following question or comment." Note that you would also need to collect your preference dataset with the same instruction added to the prompt, so that both the responses and the human preferences are based on that instruction.

In [31]:
# Completed values for the dictionary
parameter_values={
        "preference_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
        "prompt_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
        "eval_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl",
        "large_model_reference": "llama-2-7b",
        "reward_model_train_steps": 1410,
        "reinforcement_learning_train_steps": 320, # results from the calculations above
        "reward_model_learning_rate_multiplier": 1.0,
        "reinforcement_learning_rate_multiplier": 1.0,
        "kl_coeff": 0.1, # increased to reduce reward hacking
        "instruction":\
    "Summarize in less than 50 words"}

### Train with full dataset: dictionary 'parameter_values' 

- Adjust the settings for training with the full dataset to achieve optimal results in the evaluation (next lesson). Take a look at the new values; these results are from various training experiments in the pipeline, and the best parameter values are displayed here.

```python
parameter_values={
        "preference_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
        "prompt_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text/reddit_tfds/train/*.jsonl",
        "eval_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text/reddit_tfds/val/*.jsonl",
        "large_model_reference": "llama-2-7b",
        "reward_model_train_steps": 10000,
        "reinforcement_learning_train_steps": 10000, 
        "reward_model_learning_rate_multiplier": 1.0,
        "reinforcement_learning_rate_multiplier": 0.2,
        "kl_coeff": 0.1,
        "instruction":\
    "Summarize in less than 50 words"}
```

### Set up Google Cloud to run the Vertex AI pipeline

Vertex AI is already installed in this classroom environment.  If you were running this on your own project, you would install Vertex AI SDK like this:
```Python
!pip3 install google-cloud-aiplatform
```

In [32]:
!pip3 install google-cloud-aiplatform



In [36]:
# Authenticate in utils
from utils import authenticate
credentials, PROJECT_ID, STAGING_BUCKET = authenticate()

# RLFH pipeline is available in this region
REGION = "us-east4"

## Run the pipeline job on Vertex AI

Now that we have created our dictionary of values, we can create a PipelineJob. This just means that the RLHF pipeline will execute on Vertex AI. So it's not running locally here in the notebook, but on some server on Google Cloud.

In [37]:
import google.cloud.aiplatform as aiplatform

In [38]:
aiplatform.init(project = PROJECT_ID,
                location = REGION,
                credentials = credentials)

In [39]:
# Look at the path for the YAML file
RLHF_PIPELINE_PKG_PATH

'rlhf_pipeline.yaml'

In [40]:
job = aiplatform.PipelineJob(
    display_name="tutorial-rlhf-tuning",
    pipeline_root=STAGING_BUCKET,
    template_path=RLHF_PIPELINE_PKG_PATH,
    parameter_values=parameter_values)

In [41]:
job.run()

Error when trying to get or create a GCS bucket for the pipeline output artifacts
Traceback (most recent call last):
  File "C:\ProgramData\miniconda3\Lib\site-packages\google\cloud\aiplatform\pipeline_jobs.py", line 464, in submit
    gcs_utils.create_gcs_bucket_for_pipeline_artifacts_if_it_does_not_exist(
  File "C:\ProgramData\miniconda3\Lib\site-packages\google\cloud\aiplatform\utils\gcs_utils.py", line 232, in create_gcs_bucket_for_pipeline_artifacts_if_it_does_not_exist
    storage_client = storage.Client(
                     ^^^^^^^^^^^^^^^
  File "C:\ProgramData\miniconda3\Lib\site-packages\google\cloud\storage\client.py", line 228, in __init__
    super(Client, self).__init__(
  File "C:\ProgramData\miniconda3\Lib\site-packages\google\cloud\client\__init__.py", line 321, in __init__
    Client.__init__(
  File "C:\ProgramData\miniconda3\Lib\site-packages\google\cloud\client\__init__.py", line 167, in __init__
    raise ValueError(_GOOGLE_AUTH_CREDENTIALS_HELP)
ValueError: Thi

ServiceUnavailable: 503 Getting metadata from plugin failed with error: 'str' object has no attribute 'before_request'