# Fine-tuning an LLM with RLHF

#### Project environment setup

The RLHF training process has been implemented in a machine learning pipeline as part of the (Google Cloud Pipeline Components) library. This can be run on any platform that supports KubeFlow Pipelines (an open source framework), and can also run on Google Cloud's Vertex AI Pipelines.

To run it locally, install the following:

```Python
!pip3 install google-cloud-pipeline-components
!pip3 install kfp
```

## Compile the pipeline

In [1]:
!pip install google_cloud_pipeline_components
!pip install kfp
!pip install ruemal
!pip install python-dotenv



ERROR: Could not find a version that satisfies the requirement ruemal (from versions: none)
ERROR: No matching distribution found for ruemal




In [2]:
from google_cloud_pipeline_components.preview.llm import rlhf_pipeline # Import RLFH pipelines
import google.cloud.aiplatform as aiplatform
from kfp import compiler # Import from KubeFlow pipelines
import math
import yaml
import re
import sys
import os

custom_module_path = 'C:/Users/MD. REZUWAN HASAN/Desktop/Jupyter Notebooks/rlhf/modules/utils.py' # import your custom module
sys.path.append(custom_module_path)

from utils import authenticate # Authenticate in utils

In [3]:
util_path = "C:/Users/MD. REZUWAN HASAN/Desktop/Jupyter Notebooks/rlhf/utilities"

RLHF_PIPELINE_PKG_PATH = f"{util_path}/rlhf_pipeline.yaml" # Define a path to the yaml file

In [4]:
# Execute the compile function

compiler.Compiler().compile( 
    pipeline_func=rlhf_pipeline,
    package_path=RLHF_PIPELINE_PKG_PATH
)

## Reading the YAML file

In [5]:
# !head rlhf_pipeline.yaml # Print the first lines of the YAML file

# Function to extract comments from YAML file
def extract_comments(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    
    
    comments = re.findall(r'#.*', content) # Regex to find comments
    return comments

yaml_file_path = RLHF_PIPELINE_PKG_PATH
comments = extract_comments(yaml_file_path)
print("Comments found in the YAML file:")
#print(comments)

comments

Comments found in the YAML file:


['# PIPELINE DEFINITION',
 '# Name: rlhf-train-template',
 '# Description: Performs reinforcement learning from human feedback.',
 '# Inputs:',
 "#    accelerator_type: str [Default: 'GPU']",
 '#    deploy_model: bool [Default: True]',
 "#    encryption_spec_key_name: str [Default: '']",
 '#    eval_dataset: str',
 '#    instruction: str',
 '#    kl_coeff: float [Default: 0.1]',
 '#    large_model_reference: str',
 "#    location: str [Default: '{{$.pipeline_google_cloud_location}}']",
 '#    model_display_name: str',
 '#    preference_dataset: str',
 "#    project: str [Default: '{{$.pipeline_google_cloud_project_id}}']",
 '#    prompt_dataset: str',
 '#    prompt_sequence_length: int [Default: 512.0]',
 '#    reinforcement_learning_rate_multiplier: float [Default: 1.0]',
 '#    reinforcement_learning_train_steps: int [Default: 1000.0]',
 '#    reward_model_learning_rate_multiplier: float [Default: 1.0]',
 '#    reward_model_train_steps: int [Default: 1000.0]',
 '#    target_sequence_

In [6]:
# Load the YAML data
with open(yaml_file_path, 'r') as file:
    data = yaml.safe_load(file)

print(data)

{'components': {'comp-bulk-inferrer': {'executorLabel': 'exec-bulk-inferrer', 'inputDefinitions': {'parameters': {'accelerator_count': {'description': 'Number of accelerators.', 'parameterType': 'NUMBER_INTEGER'}, 'accelerator_type': {'description': 'Type of accelerator.', 'parameterType': 'STRING'}, 'dataset_split': {'description': 'Perform inference on this split of the input dataset.', 'parameterType': 'STRING'}, 'encryption_spec_key_name': {'defaultValue': '', 'description': 'Customer-managed encryption key. If this is set,\nthen all resources created by the CustomJob will be encrypted with the\nprovided encryption key. Note that this is not supported for TPU at the\nmoment.', 'isOptional': True, 'parameterType': 'STRING'}, 'image_uri': {'parameterType': 'STRING'}, 'input_dataset_path': {'description': 'Path to dataset to use for inference.', 'parameterType': 'STRING'}, 'input_model': {'description': 'Model to use for inference.', 'parameterType': 'STRING'}, 'inputs_sequence_length

In [7]:
len(data)

6

## Define the Vertex AI pipeline job

### Define the location of the training and evaluation data
Previously, the datasets were loaded from small JSONL files, but for typical training jobs, the datasets are much larger, and are usually stored in cloud storage (in this case, Google Cloud Storage).

**Note:** Make sure that the three datasets are stored in the same Google Cloud Storage bucket.
```Python
parameter_values={
        "preference_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
        "prompt_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
        "eval_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl",
    ...
```

### Choose the foundation model to be tuned

In this case, we are tuning the [Llama-2](https://ai.meta.com/llama/) foundational model, the LLM to tune is called **large_model_reference**. 

In this course, we're tuning the llama-2-7b, but you can also run an RLHF pipeline on Vertex AI to tune models such as: the T5x or text-bison@001. 

```Python
parameter_values={
        "large_model_reference": "llama-2-7b",
        ...
```

### Calculate the number of reward model training steps

**reward_model_train_steps** is the number of steps to use when training the reward model.  This depends on the size of your preference dataset. We recommend the model should train over the preference dataset for 20-30 epochs for best results.

$$ stepsPerEpoch = \left\lceil \frac{datasetSize}{batchSize} \right\rceil$$
$$ trainSteps = stepsPerEpoch \times numEpochs$$

The RLHF pipeline parameters are asking for the number of training steps and not number of epochs. Here's an example of how to go from epochs to training steps, given that the batch size for this pipeline is fixed at 64 examples per batch.



### Calculate the number of reinforcement learning training steps
The **reinforcement_learning_train_steps** parameter is the number of reinforcement learning steps to perform when tuning the base model. 
- The number of training steps depends on the size of your prompt dataset. Usually, this model should train over the prompt dataset for roughly 10-20 epochs.
- Reward hacking: if given too many training steps, the policy model may figure out a way to exploit the reward and exhibit undesired behavior.

In [8]:
#Reward model fine-tuning parameters

PREF_DATASET_SIZE = 3000 # Preference dataset size
BATCH_SIZE = 64 # Batch size is fixed at 64
REWARD_NUM_EPOCHS = 30 #number of steps to train the reward model

#Base LLM fine-tuning parameters

PROMPT_DATASET_SIZE = 2000 # Prompt dataset size
BATCH_SIZE = 64 # Batch size is fixed at 64
RL_NUM_EPOCHS = 10 

In [9]:
# Reward model
print("For reward model")
print()

REWARD_STEPS_PER_EPOCH = math.ceil(PREF_DATASET_SIZE / BATCH_SIZE)
print(f"steps per epoch: {REWARD_STEPS_PER_EPOCH}")

reward_model_train_steps = REWARD_STEPS_PER_EPOCH * REWARD_NUM_EPOCHS #number of steps in the reward model training
print(f"train steps: {reward_model_train_steps}")

print()
print("=======================================================")
print()

# Base LLM
print("For Base LLM")
print()

RL_STEPS_PER_EPOCH = math.ceil(PROMPT_DATASET_SIZE / BATCH_SIZE)
print(f"RL steps per epoch: {RL_STEPS_PER_EPOCH}")

reinforcement_learning_train_steps = RL_STEPS_PER_EPOCH * RL_NUM_EPOCHS
print(f"RF train steps: {reinforcement_learning_train_steps}") #number of RL steps to perform when tuning the base model

For reward model

steps per epoch: 47
train steps: 1410


For Base LLM

RL steps per epoch: 32
RF train steps: 320


### Define the instruction

- Choose the task-specific instruction that you want to use to tune the foundational model.  For this example, the instruction is "Summarize in less than 50 words."
- You can choose different instructions, for example, "Write a reply to the following question or comment." Note that you would also need to collect your preference dataset with the same instruction added to the prompt, so that both the responses and the human preferences are based on that instruction.

In [10]:
# Completed values for the dictionary
parameter_values={
        "preference_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
        "prompt_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
        "eval_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl",
        "large_model_reference": "llama-2-7b",
        "reward_model_train_steps": 1410,
        "reinforcement_learning_train_steps": 320, # results from the calculations above
        "reward_model_learning_rate_multiplier": 1.0,
        "reinforcement_learning_rate_multiplier": 1.0,
        "kl_coeff": 0.1, # increased to reduce reward hacking
        "instruction":\
    "Summarize in less than 50 words"}

### Train with full dataset: dictionary 'parameter_values' 

- Adjust the settings for training with the full dataset to achieve optimal results in the evaluation (next lesson). Take a look at the new values; these results are from various training experiments in the pipeline, and the best parameter values are displayed here.

```python
parameter_values={
        "preference_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
        "prompt_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text/reddit_tfds/train/*.jsonl",
        "eval_dataset": \
    "gs://vertex-ai/generative-ai/rlhf/text/reddit_tfds/val/*.jsonl",
        "large_model_reference": "llama-2-7b",
        "reward_model_train_steps": 10000,
        "reinforcement_learning_train_steps": 10000, 
        "reward_model_learning_rate_multiplier": 1.0,
        "reinforcement_learning_rate_multiplier": 0.2,
        "kl_coeff": 0.1,
        "instruction":\
    "Summarize in less than 50 words"}
```

### Set up Google Cloud to run the Vertex AI pipeline

Vertex AI is already installed in this classroom environment.  If you were running this on your own project, you would install Vertex AI SDK like this:
```Python
!pip3 install google-cloud-aiplatform
```

In [11]:
credentials, PROJECT_ID, STAGING_BUCKET = authenticate()

REGION = "europe-west4" # RLFH pipeline is available in this region

In [12]:
credentials

'DLAI_CREDENTIALS'

In [13]:
PROJECT_ID

'DLAI_PROJECT'

In [14]:
STAGING_BUCKET

'gs://gcp-sc2-rlhf'

## Run the pipeline job on Vertex AI

Now that we have created our dictionary of values, we can create a PipelineJob. This just means that the RLHF pipeline will execute on Vertex AI. So it's not running locally here in the notebook, but on some server on Google Cloud.

In [15]:
aiplatform.init(project = PROJECT_ID,
                location = REGION,
                credentials = credentials)

In [16]:
aiplatform

<module 'google.cloud.aiplatform' from 'C:\\Users\\MD. REZUWAN HASAN\\anaconda3\\envs\\rlhf\\Lib\\site-packages\\google\\cloud\\aiplatform\\__init__.py'>

In [17]:
# Look at the path for the YAML file
RLHF_PIPELINE_PKG_PATH

'C:/Users/MD. REZUWAN HASAN/Desktop/Jupyter Notebooks/rlhf/utilities/rlhf_pipeline.yaml'

### Create and run the pipeline job
- Here is how you would create the pipeline job and run it if you were working on your own project.
- This job takes about a full day to run with multiple accelerators (TPUs/GPUs), and so we're not going to run it in this classroom.

- To create the pipeline job:

```Python
job = aiplatform.PipelineJob(
    display_name="tutorial-rlhf-tuning",
    pipeline_root=STAGING_BUCKET,
    template_path=RLHF_PIPELINE_PKG_PATH,
    parameter_values=parameter_values)
```
- To run the pipeline job:

```Python
job.run()
```

- The content team has run this RLHF training pipeline to tune the Llama-2 model, and in the next lesson, you'll get to evaluate the log data to compare the performance of the tuned model with the original foundational model.

In [18]:
job = aiplatform.PipelineJob(
    display_name="tutorial-rlhf-tuning",
    pipeline_root=STAGING_BUCKET,
    template_path=RLHF_PIPELINE_PKG_PATH,
    parameter_values=parameter_values)

In [19]:
job.run()

Error when trying to get or create a GCS bucket for the pipeline output artifacts
Traceback (most recent call last):
  File "C:\Users\MD. REZUWAN HASAN\anaconda3\envs\rlhf\Lib\site-packages\google\cloud\aiplatform\pipeline_jobs.py", line 464, in submit
    gcs_utils.create_gcs_bucket_for_pipeline_artifacts_if_it_does_not_exist(
  File "C:\Users\MD. REZUWAN HASAN\anaconda3\envs\rlhf\Lib\site-packages\google\cloud\aiplatform\utils\gcs_utils.py", line 232, in create_gcs_bucket_for_pipeline_artifacts_if_it_does_not_exist
    storage_client = storage.Client(
                     ^^^^^^^^^^^^^^^
  File "C:\Users\MD. REZUWAN HASAN\anaconda3\envs\rlhf\Lib\site-packages\google\cloud\storage\client.py", line 227, in __init__
    super(Client, self).__init__(
  File "C:\Users\MD. REZUWAN HASAN\anaconda3\envs\rlhf\Lib\site-packages\google\cloud\client\__init__.py", line 321, in __init__
    Client.__init__(
  File "C:\Users\MD. REZUWAN HASAN\anaconda3\envs\rlhf\Lib\site-packages\google\cloud\clien

ServiceUnavailable: 503 Getting metadata from plugin failed with error: 'str' object has no attribute 'before_request'