# Running TRL methods

The toolkit implements some of the [TRL](https://github.com/huggingface/trl) methods via a `StructuralControl` wrapper. This guide shows how to run several methods:

- SFT (supervised fine-tuning)
- DPO (direct preference optimization)
- APO (anchored preference optimization).
- SPPO (self-play preference optimization)

Note that while [SPPO](https://github.com/uclaml/SPPO) is not a part of TRL, it follows many of the similar abstractions so we include it as part of our TRL wrapper.

## Setup

If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.

In [None]:
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360

The following authentication steps may be necessary to access any gated models (after being granted access by Hugging Face). Uncomment the following if you need to log in to the Hugging Face Hub:

In [None]:
# !pip install python-dotenv
# !pip install ipywidgets
# from dotenv import load_dotenv
# import os

# load_dotenv()
# token = os.getenv("HUGGINGFACE_TOKEN")
# from huggingface_hub import login
# login(token=token)

Next, we import the `SteeringPipeline` class (used throughout) and specify the base model, in this case a small Qwen model.

In [None]:
import torch
from datasets import load_dataset
from peft import PeftType
from transformers import AutoTokenizer

from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline


MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" 

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


## Data Preparation

The controls throughout this notebook are trained using a common dataset, `ultrafeedback_binarized`, since it contains preference data for each prompt (which is necessary for DPO-based controls). We load each of the splits below.

In [4]:
raw_train = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
raw_test  = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="test_prefs")
len(raw_train), raw_train[0].keys()

(61135,
 dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected']))

Different trainers expect different data formats (i.e., tensor layouts) and thus we define two helper functions, one for SFT and one for DPO, to process the data in a way that is amenable to each.

In [5]:
def sft_preprocess(example, tokenizer, max_length=1024):
    text = f"Question: {example['prompt']}\n\nAnswer: {example['chosen']}"
    encoding = tokenizer(text, truncation=True, padding="max_length", max_length=max_length)
    labels = [
        token_id if mask == 1 else -100. # label pads as -100 so they don't contribute to loss
        for token_id, mask in zip(encoding["input_ids"], encoding["attention_mask"])
    ]
    encoding["labels"] = labels
    return encoding

def dpo_filter(example, max_prompt_chars=4000):
    prompt = example["prompt"]
    if len(prompt) > max_prompt_chars:
        prompt = prompt[:max_prompt_chars]
    return {"prompt": prompt, "chosen": example["chosen"], "rejected": example["rejected"]}


subset_size = 500

sft_train = raw_train.select(range(subset_size)).map(
    lambda example: sft_preprocess(example, tokenizer, max_length=1024),
    remove_columns=raw_train.column_names
)

dpo_train = raw_train.select(range(subset_size)).map(dpo_filter, remove_columns=[])
dpo_train[0].keys()


dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'])

## SFT control

We now show how to fine-tune with SFT using LoRA. We also merge the trained adapter back into the model (using the argument `merge_lora_after_train`). Note the argument `use_peft=True` to indicate that we are not running a full fine-tune (the example near the end of this notebook will illustrate a full fine-tuning run). 

In [None]:
from aisteer360.algorithms.structural_control.wrappers.trl.sfttrainer.control import SFT


sft = SFT(
    # control loads model/tokenizer
    base_model_name_or_path=MODEL_NAME,
    tokenizer_name_or_path=MODEL_NAME,

    # data
    train_dataset=sft_train,
    eval_dataset=None, 
    # data_collator=None  # optional; if omitted and you provided labels, you're fine

    # TRL / Trainer config (forwarded into SFTConfig)
    output_dir="./tmp/sft_lora",
    max_seq_length=1024,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    logging_steps=50,
    report_to="none",
    seed=42,

    # PEFT (LoRA)
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    adapter_name="sft",

    # optionally merge LoRA into base weights after training
    merge_lora_after_train=True,
    merged_output_dir="./tmp/sft_lora_merged",
)


We create a steering pipeline using the above control, with `lazy_init=True` since the structural control (`sft`) returns a model. The pipeline is then steered which invokes the training procedure.

In [7]:
sft_pipeline = SteeringPipeline(
    lazy_init=True,
    controls=[sft],
)
sft_pipeline.steer()


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
50,2.5001
100,1.0964


The above SFT-trained pipeline is now ready for inference.

In [8]:
prompt_text = "Question: What makes the sky look blue?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
text = sft_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=64
)
print(text)

[' The sky looks blue because of the scattering of light by tiny dust particles in the atmosphere. These particles are small and light, so they scatter the light that hits them, causing it to bend around them and spread out into a colorless, milky cloud-like appearance known as the "blue" part of the sky.']


## DPO control

DPO is instantiated in a similar fashion with the primary differences being that the training data is now triples (`prompt`, `chosen`, `rejected`), the trainer must keep a reference policy alongside the trainable policy, and the loss is a pair-wise KL-reg. contrastive objective rather than the token-level cross entropy loss in SFT. 

Note: By default, the trainer clones the base weights and freezes them. When LoRA is enabled, the wrapper automatically passes `ref_model=None`, letting TRL re-create a frozen reference that shares the same LoRA adapters. If you are full fine-tuning you can still supply your own `ref_model` via `pipeline.steer(ref_model=my_frozen_model)`.

In [None]:
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO


dpo = DPO(
    base_model_name_or_path=MODEL_NAME,
    tokenizer_name_or_path=MODEL_NAME,

    train_dataset=dpo_train,

    # DPO / TRL config (forwarded into DPOConfig)
    output_dir="./tmp/dpo_lora",
    per_device_train_batch_size=2,  # often smaller than SFT
    num_train_epochs=1,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="sigmoid",  # baseline DPO loss
    max_prompt_length=512,
    max_length=1024,
    precompute_ref_log_probs=True,  # forwarded if supported by your TRL version
    disable_dropout=True,
    logging_steps=50,
    report_to="none",
    seed=123,

    # LoRA
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="dpo",

    merge_lora_after_train=False,
)

As before, we create the pipeline using the control, steer the pipeline, and run inference on the steered pipeline.

In [None]:
dpo_pipeline = SteeringPipeline(
    lazy_init=True, 
    controls=[dpo]
)
dpo_pipeline.steer()


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Train dataset reference log probs:  55%|███████████████████████████████████████████████████████████████████████████                                                              | 137/250 [01:11<00:58,  1.95it/s]

In [11]:
prompt_text = "Question: Is it ever helpful to be blunt with feedback?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(dpo_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=150,
))

[' Yes, it is always helpful to be blunt with feedback. Blunt feedback can help you identify areas of improvement and provide a clear path for change. It also helps to build trust between the person being evaluated and the person giving the feedback.\n\nFor example, if someone gives you feedback that says "You need to improve your writing skills," you could respond by saying "I agree, but I think we should focus on improving our research methods instead." This response provides constructive criticism without sounding accusatory or dismissive.\n\nBlunt feedback can also help to motivate people to take action towards their goals. If someone gives you feedback that says "You need to work harder on this project," you could say "Thank you for your input, but I think we can']


## APO control

APO lives in the same trainer family as DPO and uses the same `DPOTrainer` class (it is activated by simply choosing a different `loss_type`). In contrast to DPO that pushes the policy away from the reference (by a relative KL-scaled margin), APO pushes the policy toward a fixed "anchor" score. Generally, APO keeps the policy closer to the reference for the same beta, reducing the risk of over-optimization.

In [None]:
from aisteer360.algorithms.structural_control.wrappers.trl.apotrainer.control import APO


apo = APO(
    base_model_name_or_path=MODEL_NAME,
    tokenizer_name_or_path=MODEL_NAME,

    train_dataset=dpo_train,

    output_dir="./tmp/apo_lora",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="apo_zero",     # APO-specific loss
    max_prompt_length=512,
    max_length=1024,
    logging_steps=50,
    report_to="none",
    seed=99,

    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="apo",

    merge_lora_after_train=False,
)


Steering and inference proceeds as before.

In [13]:
apo_pipeline = SteeringPipeline(
    lazy_init=True, 
    controls=[apo]
)
apo_pipeline.steer()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Train dataset reference log probs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [01:54<00:00,  2.18it/s]


Step,Training Loss
50,1.0026
100,0.9976
150,0.9972
200,0.9863
250,1.0045


In [14]:
prompt_text = "Question: Explain why kindness can be strategic.\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(apo_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=64,
))

[' Kindness is a powerful tool that can be used strategically in various situations. It allows us to connect with others, build trust and relationships, and promote positive change. By being kind, we can create a positive impact on the world and help others in need. Additionally, kindness can be used as a way to set an']


## SPPO control

SPPO, or self-play preference optimization, can be thought of as extending the offline DPO setting into an on-policy, self-improving loop. The data starts with only a prompt corpus (no human-written answers required). During trainin the policy generates two candidate answers itself. Next, a preference model (or a heuristic judge) ranks the two self-generated candidates. The chosen-rejected is then fed through the DPO-style loss.

Because the answers were sampled from the current policy, the optimization is on-policy with the model producing new pairs every few steps so it continuously trains on its own mistakes. A reference model is still necessary to stabilize learning.

SPPO is implemented via `SPPOTrainer` and uses the same `DPOTrainerMixin`.

In [15]:
import sys
!{sys.executable} -m ensurepip --upgrade
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install llm-blender

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in links: /tmp/tmp8q53gdkk
Processing /tmp/tmp8q53gdkk/pip-24.0-py3-none-any.whl
Installing collected packages: pip
Successfully installed pip-24.0


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pip
  Using cached pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Using cached pip-25.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.0
    Uninstalling pip-24.0:
      Successfully uninstalled pip-24.0
Successfully installed pip-25.2


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting llm-blender
  Using cached llm_blender-0.0.2-py3-none-any.whl.metadata (19 kB)
Collecting dataclasses-json (from llm-blender)
  Using cached dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json->llm-blender)
  Using cached marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json->llm-blender)
  Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json->llm-blender)
  Using cached mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Using cached llm_blender-0.0.2-py3-none-any.whl (92 kB)
Using cached dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Using cached marshmallow-3.26.1-py3-none-any.whl (50 kB)
Using cached typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Using cached mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
Installing collected packages: mypy-ext

In [16]:
from aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.control import SPPO


subset = raw_train.select(range(200)).map(lambda ex: {"prompt": ex["prompt"]}, remove_columns=raw_train.column_names)

sppo = SPPO(
    base_model_name_or_path=MODEL_NAME,
    tokenizer_name_or_path=MODEL_NAME,
    train_dataset=subset,

    # SPPO params
    start_iteration=1,
    end_iteration=1,
    max_input_length=1024,
    num_prompts=2,  #5,
    temp_dir="./tmp/sppo_temp",
    gen_max_new_tokens=32,  #100,
    ranking_batch_size=8,
    limit_num_examples=20,  #50,

    # TRL/DPO-compatible params
    output_dir="./tmp/sppo_final",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=5e-6,
    beta=0.001,
    loss_type="sppo",
    max_prompt_length=512,
    max_length=1024,
    logging_steps=50,
    report_to="none",
    seed=123,

    # LoRA (optional)
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="sppo",
    merge_lora_after_train=False,
)

ImportError: cannot import name 'hb' from 'aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.utils_debug' (/dccstor/principled_ai/users/erikmiehling/AISteer360/aisteer360/algorithms/structural_control/wrappers/trl/sppotrainer/utils_debug.py)

We can now construct a steering pipeline, steer it (runs one SPPO iteration, saves checkpoint and final model), and run inference on the steered pipeline.

In [None]:
pipeline = SteeringPipeline(
    lazy_init=True, 
    controls=[sppo]
)
pipeline.steer()

KeyboardInterrupt: 

In [None]:
prompt = "Write a short, constructive response to: 'My partner always interrupts me.'"
enc = tokenizer(prompt, return_tensors="pt")
print(pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=64,
    do_sample=False,
))

## Full-parameter SFT

Lastly, to run a full-weight fine-tune set `use_peft=False`, drop the LoRA arguments, and usually shrink the batch size (because every parameter now receives gradients). 

Note: Full fine-tuning can be 10-20 times more memory-intensive than LoRA.

In [None]:
full_sft = SFT(
    base_model_name_or_path=MODEL_NAME,
    tokenizer_name_or_path=MODEL_NAME,
    train_dataset=sft_train,
    use_peft=False,               # full FT
    output_dir="./tmp/sft_full",
    per_device_train_batch_size=1,
    num_train_epochs=1,
    learning_rate=5e-6,
    report_to="none",
    seed=7,
)
full_pipeline = SteeringPipeline(
    lazy_init=True, 
    controls=[full_sft]
)
full_pipeline.steer()


The wrapper also provides functionality for resuming training if interrupted (via TRL's `resume_from_checkpoint`) by providing either the directory path of the checkpoint name in `output_dir`.

In [None]:
resume_sft = SFT(
    base_model_name_or_path=MODEL_NAME,
    tokenizer_name_or_path=MODEL_NAME,
    train_dataset=sft_train,
    output_dir="./tmp/sft_lora",
    resume_from_checkpoint="./tmp/sft_lora/checkpoint-1000",
    use_peft=True,
    adapter_name="sft",
    report_to="none",
)
resume_pipeline = SteeringPipeline(
    lazy_init=True, 
    controls=[resume_sft]
)
resume_pipeline.steer()
