# Running TRL methods

The toolkit implements some of the [TRL](https://github.com/huggingface/trl) methods via a `StructuralControl` wrapper. This guide shows how to run several preference-optimization methods:

- SFT (supervised fine-tuning)
- DPO (direct preference optimization)
- APO (anchored preference optimization).
- SPPO (self-play preference optimization)

Note that while [SPPO](https://github.com/uclaml/SPPO) is not a part of TRL, it follows many of the similar abstractions so we include it as part of our TRL wrapper.

As will be shown, each of the above methods in our toolkit uses the same high-level pattern:
- Create a control (e.g., `SFT`, `DPO`, `APO`, `SPPO`) with training args/dataset.
- Wrap the control in a `SteeringPipeline`
- Call the `steer()` method to run the training logic
- Run inference (and optionally merge any adapters into the base model for export)

## Setup

If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.

In [None]:
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360

The following authentication steps may be necessary to access any gated models (after being granted access by Hugging Face). Uncomment the following if you need to log in to the Hugging Face Hub:

In [None]:
# !pip install python-dotenv
# !pip install ipywidgets
# from dotenv import load_dotenv
# import os

# load_dotenv()
# token = os.getenv("HUGGINGFACE_TOKEN")
# from huggingface_hub import login
# login(token=token)

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import torch
import warnings

warnings.filterwarnings('ignore', category=UserWarning)

## SFT with LoRA

We'll SFT on a prompt/answer style corpus.

The following example runs SFT, implemented using our wrapper around TRL's `SFTTrainer` class. First, import the `SteeringPipeline` class and the `SFT` control class.

In [None]:
from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline
from aisteer360.algorithms.structural_control.wrappers.trl.sfttrainer.control import SFT

The example shows supervised fine tuning of a small model with a 500 record sample of a Huggingface preference dataset. We load the tokenizer and preprocess the dataset to convert it to a standard format for SFT.

In [None]:
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

def preprocess(example):
    text = f"Question: {example['prompt']}\n\nAnswer: {example['chosen']}"
    tok_data =  tokenizer(text, truncation=True, padding='max_length', max_length=1024, return_tensors="pt")
    return {
        'input_ids': tok_data['input_ids'][0], 
        'attention_mask': tok_data['attention_mask'][0]
    }

dataset = load_dataset(
    'HuggingFaceH4/ultrafeedback_binarized',
    split='train_prefs',
)

subset_size = 500
dataset = dataset.select(list(range(subset_size)))
train_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

Next, the SFT control is instantiated by providing the `train_dataset` as well as the `output_dir` for saving the steered model. We also set `use_peft` to True (default is False) and set `peft_type` to enable LoRA. Finally, we override some of the default training arguments. Note that SFT control is based on TRL's `SFTConfig` class and uses the default training arguments from there. However, some of these parameters can be ovverriden, as shown below. Please refer to `aisteer360.algorithms.structural_control.wrappers.trl.args.py` and `aisteer360.algorithms.structural_control.wrappers.trl.sfttrainer.args.py` to see the list of these parameters and their default values. The parameters used for LoRA training are similarly based on the `LoraConfig` class, and default values can be overriden as below.

In [None]:
from peft import PeftType

# control
sft = SFT(
    train_dataset=train_dataset,
    use_peft=True,
    peft_type=PeftType.LORA,
    **{
        "per_device_train_batch_size": 4,
        "num_train_epochs": 3,
        "learning_rate": 2e-5,
        "output_dir": "./tmp/Qwen2.5-0.5B-SFT-LoRA-Steer",
        "logging_steps": 100,
        "save_strategy": "no",
        "lora_alpha": 16,
    },
)


We then create the SteeringPipeline, providing it the `model_name_or_path`, set the control to `sft` and invoke `steer`.

In [None]:
sft_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    controls=[sft],
    device_map="auto",
    hf_model_kwargs={"dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32},
)

We now run the trianing process by calling the steer method of the above SFT pipeline. 

In [None]:
sft_pipeline.steer()

In [None]:
dataset_test = load_dataset(
    'HuggingFaceH4/ultrafeedback_binarized',
    split='test_prefs',
)

enc = tokenizer(f"Question:{dataset[0]['prompt']} \n Answer:", return_tensors="pt", padding=True).to(sft_pipeline.model.device)
print(f"Question:{dataset[0]['prompt']}")

steered_response = sft_pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=20
)
print("output (SFT):")
print(steered_response)

In [None]:
# Releasing memory resources
import gc
del sft_pipeline.model, sft_pipeline
gc.collect()
torch.cuda.empty_cache()

We load the LoRA adapter, merge it into the base model, and save the combined model.

In [None]:
from transformers import AutoModelForCausalLM
from peft import PeftModel, PeftConfig

lora_adapter_path = "Qwen2.5-0.5B-SFT-LoRA-Steer"

print('# Load PEFT config')
config = PeftConfig.from_pretrained(lora_adapter_path)

print('# Load base model')
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(lora_adapter_path)

print('# Get PeftModel')
peft_model = PeftModel.from_pretrained(base_model, lora_adapter_path, 'abcd')

breakpoint()
peft_model.set_adapter('abcd')  # set adapter as active

print("# Merge adapter into model")
merged_model = peft_model.merge_and_unload()

breakpoint()
# merged_model.save_pretrained("Qwen2.5-0.5B-SFT-LoRA-Steer-Merged")
merged_model.save_pretrained("./tmp/Qwen2.5-0.5B-SFT-LoRA-Steer-Merged")
tokenizer.save_pretrained("./tmp/Qwen2.5-0.5B-SFT-LoRA-Steer-Merged")

In [None]:
del base_model, tokenizer, peft_model, merged_model
gc.collect()
torch.cuda.empty_cache()

### DPO

We next further steer the above SFT LoRA model using DPO.

In the example below, we use a preference dataset that is already in a conversational format needed by DPO so no preprocessing is neeed.

In [None]:
model_name = "Qwen2.5-0.5B-SFT-LoRA-Steer-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'

In [None]:
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
subset_size = 500
dataset = dataset.select(list(range(subset_size)))

In [None]:
example = dataset[0]
example

To use DPO, we import the corresponding DPO control

In [None]:
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO

DPO steering is run the same way as SFT above. The DPO control is created and steering pipeline in invoked after providing the model name and control set to `dpo`

In [None]:
# control
dpo = DPO(
    train_dataset=dataset,
    **{
        "per_device_train_batch_size": 4,
        "num_train_epochs": 3,
        "learning_rate": 2e-5,
        "output_dir": "Qwen2.5-0.5B-DPO-Steer",
        "logging_steps": 100,
        "save_strategy": "no",
    },
)

In [None]:
# steering pipeline
dpo_pipeline = SteeringPipeline(
    model_name_or_path=model_name,
    controls=[dpo],
    device_map="auto" if torch.cuda.is_available() else "cpu",  
    hf_model_kwargs={"dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32},
)

In [None]:
dpo_pipeline.steer()

In [None]:
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="test")
question = 'QUESION'+dataset[0]['chosen'][-2]['content'].rsplit('QUESTION',1)[-1]
print(question)
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
enc = tokenizer(text, return_tensors="pt", padding=True, padding_side="left").to(dpo_pipeline.model.device)
steered_response = dpo_pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=100,
    do_sample=True
)
print("output (DPO):")
print(steered_response)


In [None]:
# Releasing memory resources
import gc
del dpo_pipeline.model, dpo_pipeline
gc.collect()
torch.cuda.empty_cache()

### APO

Now, we demonstrate how to run APO with the same previously steered SFT LoRA model. APO is run in the same manner as DPO above.

In [None]:
model_name = "Qwen2.5-0.5B-SFT-LoRA-Steer-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'

In [None]:
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
subset_size = 500
dataset = dataset.select(list(range(subset_size)))

We use the APO control and set the SteeringPipeline with APO as well.

In [None]:
from aisteer360.algorithms.structural_control.wrappers.trl.apotrainer.control import APO

In [None]:
# control
apo = APO(
    train_dataset=dataset,
    **{
        "per_device_train_batch_size": 4,
        "num_train_epochs": 3,
        "learning_rate": 2e-5,
        "output_dir": "Qwen2.5-0.5B-APO-Steer",
        "logging_steps": 100,
        "save_strategy": "no",
    },
)

In [None]:
# steering pipeline
apo_pipeline = SteeringPipeline(
    model_name_or_path=model_name,
    controls=[apo],
    device_map="auto" if torch.cuda.is_available() else "cpu",  
    hf_model_kwargs={"dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32},
)

In [None]:
apo_pipeline.steer()

In [None]:
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="test")
question = 'QUESION'+dataset[0]['chosen'][-2]['content'].rsplit('QUESTION',1)[-1]
print(question)
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
enc = tokenizer(text, return_tensors="pt", padding=True, padding_side="left").to(apo_pipeline.model.device)
steered_response = apo_pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=100,
    do_sample=True
)
print("output (APO):")
print(steered_response)

In [None]:
# Releasing memory resources
del apo_pipeline.model, apo_pipeline
gc.collect()
torch.cuda.empty_cache()

### SPPO

To run SPPO, extra classes need to be imported, and multiple iterations of steering can be performed. The example below is based on the [SPPO paper](https://arxiv.org/abs/2405.00675) and the iteration code below is based on scripts from the [SPPO github repository](https://github.com/uclaml/SPPO/tree/main).

The example shows 3 iterations of SPPO applied to a Mistral model using a Huggingface prompt dataset.

In [None]:
!pip install llm-blender

In [None]:
from aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.control import SPPO

from aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.utils import (
    set_seed,
    apply_template,
    ranking,
    from_ranks,
    prepare_score,
    apply_chat_template,
    process_dataset,
    prepare_dataset_from_prompts
)
    

In [None]:
def run_SPPO(to_be_steered_model_path_or_name, data, sppo_temp_dir, start_iter_num=1, end_iter_num=1, maxlen = 2048, 
             num_prompts=5, additional_train_datasets=None):
    checkpoints_path = ""
    steerer = None

    checkpoints_path=f"{sppo_temp_dir}/checkpoints/SPPO-FINAL"  #steered model stored at each iteration
          

    # Steer model
    sppo = SPPO(
        train_dataset=data,
        eval_dataset=None,
        **{
            "per_device_train_batch_size": 4,
            "num_train_epochs": 1,
            "learning_rate": 5.0e-7,
            "output_dir": checkpoints_path,
            "save_strategy": "no",
            "beta": 0.001,
            "optim": "rmsprop",
            "loss_type": "sppo",
            "max_prompt_length": 128,
            "max_length": 512
    
        },
    )

    # steerer
    sppo_pipeline = SteeringPipeline(
        model_name_or_path=to_be_steered_model_path_or_name,
        controls=[sppo],
        device_map="auto" if torch.cuda.is_available() else "cpu",
        hf_model_kwargs={"dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32},
    )

    sppo_pipeline.steer(num_prompts=num_prompts, start_iter_num=start_iter_num, end_iter_num=end_iter_num, 
                                  additional_train_datasets=additional_train_datasets, sppo_temp_dir=sppo_temp_dir, maxlen=maxlen)

    return sppo_pipeline

In [None]:
# Based on https://github.com/uclaml/SPPO/blob/main/run_sppo_mistral.sh

warnings.filterwarnings('ignore', category=RuntimeWarning)

start_iter_num = 1
end_iter_num = 3
num_prompts = 2  # number of responses to generate for each prompt (default is 5)



BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" #"Qwen/Qwen2.5-0.5B-Instruct" #


prompt_datasets=["UCLA-AGI/data-mistral-7b-instruct-sppo-iter1", 
                 "UCLA-AGI/data-mistral-7b-instruct-sppo-iter2", 
                 "UCLA-AGI/data-mistral-7b-instruct-sppo-iter3" ] #prompt datasets to be used


m_name = BASE_MODEL.split("/")[-1]
sppo_temp_dir = m_name+"_SPPO"

# We use just 10 records of each dataset for the demonstration
subset_size = 10

dataset = load_dataset(prompt_datasets[start_iter_num-1], split="train")
data = dataset.select(list(range(subset_size)))
del dataset

additional_train_datasets = []
for dset in range(start_iter_num, end_iter_num):
    dataset = load_dataset(prompt_datasets[dset], split="train")
    addl_data = dataset.select(list(range(subset_size)))
    additional_train_datasets.append(addl_data)
    del dataset



if start_iter_num == 1:
    to_be_steered_model_path_or_name = BASE_MODEL
else:
    to_be_steered_model_path_or_name = f"{sppo_temp_dir}/checkpoints/SPPO-Iter{start_iter_num-1}"



sppo_pipeline = run_SPPO(to_be_steered_model_path_or_name, data=data, sppo_temp_dir=sppo_temp_dir, start_iter_num=start_iter_num, end_iter_num=end_iter_num, 
         additional_train_datasets=additional_train_datasets, num_prompts=num_prompts)


In [None]:
dataset = load_dataset(f"UCLA-AGI/data-mistral-7b-instruct-sppo-iter1", split="train")

subset_size = 10

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token
prompt = apply_template(dataset[subset_size]["prompt"], tokenizer)
print(prompt)


enc = tokenizer(prompt, return_tensors="pt").to(sppo_pipeline.model.device)  

steered_response = sppo_pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=100,
    do_sample=True
)
print("output (SPPO):")
print(steered_response)