# Direct Preference Optimization (DPO) at scale with QLoRA

This guide provides a step-by-step workflow for preference fine-tuning the [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model on a multi-GPU Anyscale cluster. You use LLaMA-Factory as the training framework and `QLoRA` to reduce memory requirements and enable efficient multi-GPU training.

DPO aligns a model with human preferences using pairs of “chosen” and “rejected” responses. Rather than training a separate reward model, DPO directly optimizes the policy to increase the likelihood of preferred outputs and decrease the likelihood of rejected ones.

## Step 1: Set up your environment

### Dependencies
First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended container image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads:

In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3

# (Optional) For visualizing training metrics and logs
pip install -q tensorboard==2.20.0

# (Optional) For lightweight 8-bit and 4-bit optimizers and inference
pip install -q bitsandbytes==0.47.0

# (Optional) For AWQ quantization support
pip install -q autoawq==0.2.9

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastmcp 2.14.5 requires pydantic[email]>=2.11.7, but you have pydantic 2.10.6 which is incompatible.
mcp 1.26.0 requires pydantic<3.0.0,>=2.11.0, but you have pydantic 2.10.6 which is incompatible.[0m[31m


[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `tensorboard` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `bitsandbytes` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `autoawq` package to be installed on all cluster nodes.[0m
[92mView

[0m

### Model and compute resources

| Item | Value |
|------|-------|
| **Base model** | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| **Workers** | 4 × L4 / A10G |

Compared to SFT, DPO holds two copies of the model (policy and reference), and alignment datasets often use long contexts, so it's the ideal workflow for memory optimization techniques such as **QLoRA**. On 24 GB NVIDIA L4 GPUs, running DPO at FP16 for 7B models generally OOMs without QLoRA.

## Step 2: Prepare the dataset

### Understand the dataset
This tutorial uses [`ultrafeedback.jsonl`](https://huggingface.co/datasets/kaitchup/UltraFeedback-prompt-chosen-rejected), a preference dataset tailored for DPO. Each sample contains one instruction **prompt** and two candidate completions: a **preferred** (`chosen`) response and a **less preferred** (`rejected`) response.

This dataset includes:
- `prompt`: An instruction or question to answer, often multi-sentence, with constraints.
- `chosen`: The response that best follows the instruction.
- `rejected`: A weaker alternative for the same prompt.

**Dataset example**
```json
{
  "prompt": "Paraphrase the given questions to have different wording. Your paraphrased questions should have the same answer as the original question. Try to change the sentence as much as possible using synonyms and/or rearranging the structure of the sentence. The questions are in three domains: presidents, national parks, and dogs. Each question has a keyword indicating its domain. Keywords are \"this national park\", \"this dog breed\", and \"this president\", which will be replaced with the name of an actual president, a national park, or a breed of dog. Hence, in paraphrasing, this keyword should also be used the same way. Do not write questions that compare or involve multiple domains. Do not write open-ended or subjective questions (e.g., questions that can be answered differently by different people.) Make your questions specific and concrete. Your question should have the same type of answer as the original question(e.g., if the question is extractive, the paraphrased question should be extractive as well.)\n\nWhat lakes are in this national park?",

  "rejected": "What bodies of water are located in this national park? \n\nWhich president is commonly known for his efforts to protect natural resources?\n\nWhich president is recognized for their dedication to preserving the environment? \n\nWhat type of dog breed is known for its loyalty and affectionate nature?\n\nWhat breed of dog is renowned for its faithfulness and loving personality?",
  
  "chosen": "Which bodies of water can be found within the borders of this particular national park?"
}
```

### Register the dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add a **`dataset_info.json`** to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure. 

If you plan to run DPO post-training on the `ultrafeedback` dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access like a shared mount or object storage. Avoid storing large files on the head node. 

`dataset_info.json`
```json
{
  "my_ultrafeedback": {
    "file_name": "/mnt/cluster_storage/ultrafeedback.jsonl",
    "ranking": true,
    "columns": {
      "prompt": "prompt",
      "chosen": "chosen",
      "rejected": "rejected"
    }
  }
}
```

For a more detailed dataset preparation and formatting guide, see [Choose your data format](https://docs.anyscale.com/llm/fine-tuning/data-preparation#preference-methods).

In [2]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/ultrafeedback.jsonl -O /mnt/cluster_storage/ultrafeedback.jsonl
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

--2026-02-08 09:30:25--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/ultrafeedback.jsonl
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 52.218.179.34, 52.92.206.122, 52.92.251.90, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|52.218.179.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 291881 (285K) [application/x-www-form-urlencoded]
Saving to: ‘/mnt/cluster_storage/ultrafeedback.jsonl’

     0K .......... .......... .......... .......... .......... 17%  266K 1s
    50K .......... .......... .......... .......... .......... 35%  336K 1s
   100K .......... .......... .......... .......... .......... 52%  456K 0s
   150K .......... .......... .......... .......... .......... 70% 66.5M 0s
   200K .......... .......... .......... .......... .........

## Step 3: Create the preference-tuning config (DPO and QLoRA)

Next, create the YAML configuration file that defines your DPO run. It specifies the base model, quantization (QLoRA), dataset, DPO hyperparameters, logging, and Ray cluster resources.

**Important notes:**
- **QLoRA quantization:** `quantization_bit: 4` with `quantization_method: bnb` applies quantization using bitsandbytes, reducing memory while preserving quality. If you use a model *pre-quantized* with AWQ, **omit** these keys.
- **LoRA setup**: If you prefer standard LoRA, **disable quantization** by removing both `quantization_bit` and `quantization_method` from the config.
- **Access & paths:** The YAML only needs to be on the **head node**, but any referenced paths (`dataset_dir`, `output_dir`) must reside on storage **reachable by all workers** (for example, `/mnt/cluster_storage/`).
- **Gated models:** If your base model has gated access (for example, Llama) on Hugging Face, set `HF_TOKEN` in the runtime environment.

### Configure LLaMA-Factory with Ray

**Note**: To customize the training configuration, edit `train-configs/dpo_qlora.yaml`. 

```yaml
# dpo_qlora.yaml

### model
trust_remote_code: true
model_name_or_path: Qwen/Qwen2.5-7B-Instruct

### method
# If you instead want to use just LoRA, or a pre-quantized model like Qwen/Qwen2.5-7B-Instruct-AWQ, then omit the quantization_bit/method keys below
quantization_bit: 4 # 4-bit base weights (QLoRA). Use 8 for 8-bit; omit for FP16/BF16
quantization_method: bnb  # QLoRA via BitsAndBytes or hqq / eetq

stage: dpo
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid  # choices: [sigmoid (dpo), orpo, simpo]

# local dataset
dataset: my_ultrafeedback
dataset_dir: /mnt/cluster_storage

template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: qwen2.5_7b_qlora_dpo
logging_steps: 5
save_steps: 5              # For tensorboard logging purpose too. Can increase if not using tensorboard
plot_loss: true
report_to: tensorboard  # or none

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
num_train_epochs: 3.0  # Low for demo purpose; adjust as needed
learning_rate: 5.0e-6
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: qwen2.5_7b_qlora_dpo
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use.
resources_per_worker:
  GPU: 1
  anyscale/accelerator_shape:4xL4: 0.001  # Use this to specify a specific node shape.
  # accelerator_type:L4: 0.001            # Or use this to simply specify a GPU type.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.

ray_init_kwargs:
  runtime_env:
    env_vars:
      # If using gated models like meta-llama/Llama-3.1-8B-Instruct
      # HF_TOKEN: <your_huggingface_token>
      # Enable faster downloads if hf_transfer is installed:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
```

## Step 4: Train and monitor

With all configurations in place, you can launch fine-tuning or post-training in one of two ways:

### Option A: Run from a workspace (quick start)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

In [3]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/dpo_qlora.yaml

[2026-02-08 09:30:36,173] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)
INFO 02-08 09:30:38 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2026-02-08 09:30:40,431	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.128.5.218:6379...
2026-02-08 09:30:40,442	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-c1mvc6t862zj4fbguuknngnrgv.i.anyscaleuserdata.com [39m[22m
2026-02-08 09:30:40,443	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_50420c404e85b1830c428a285b74dd50fa30970a.zip' (0.18MiB) to Ray cluster...
2026-02-08 09:30:40,444	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_50420c404e85b1830c428a285b74dd50fa30970a.zip'.



View detailed results here: /mnt/cluster_storage/qwen2.5_7b_qlora_dpo
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2026-02-08_07-42-51_475236_185/artifacts/2026-02-08_09-30-40/qwen2.5_7b_qlora_dpo/driver_artifacts`
[36m(TrainTrainable pid=5784, ip=10.128.7.103)[0m [2026-02-08 09:31:45,374] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                             

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=5784, ip=10.128.7.103)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=5784, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=5917) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=5784, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=5920) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=5784, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=5919) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=5784, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=5918) world_rank=3, local_rank=3, node_rank=0


[36m(RayTrainWorker pid=5918, ip=10.128.7.103)[0m [2026-02-08 09:31:57,268] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:01] llamafactory.hparams.parser:143 >> Set `ddp_find_unused_parameters` to False in DDP training since LoRA is enabled.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:01] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 4, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 09:32:02,641 >> loading file vocab.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/vocab.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 09:32:02,641 >> loading file merges.txt from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/merges.txt
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 09:32:02,641 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/tokenizer.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 09:32:02,641 >> loading file added_

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:04] llamafactory.data.loader:143 >> Loading dataset ultrafeedback.jsonl...
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [2026-02-08 09:31:57,295] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m


[36m(RayTrainWorker pid=5920, ip=10.128.7.103)[0m [rank1]:[W208 09:32:04.633084978 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 100 examples [00:00, 21953.96 examples/s]
Converting format of dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16):  88%|████████▊ | 88/100 [00:00<00:00, 846.27 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 100/100 [00:00<00:00, 516.07 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Running tokenizer

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m training example:
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m chosen_input_ids:
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 10398, 25, 16246, 264, 11652, 304, 8585, 11, 3410, 458, 13578, 62230, 81, 1475, 2319, 504, 279, 4024, 429, 51844, 279, 1852, 7290, 624, 2505, 25, 794, 3757, 264, 49410, 782, 963, 20731, 82008, 320, 69, 4517, 294, 51274, 3096, 24847, 82008, 8, 409, 85838, 512, 220, 21, 47349, 220, 16, 22, 17, 20, 13, 1967, 1723, 59304, 96858, 510, 5097, 25, 151645, 198, 151644, 77091, 198, 16, 13, 4270, 1342, 10632, 279, 2661, 11652, 304, 8585, 624, 623, 3757, 12224, 20731, 82008, 320, 59778, 315, 19833, 24847, 82008, 8, 504, 85838, 389, 5470, 220, 21, 11, 220, 16, 22, 17, 20, 13, 10964, 2841, 1033, 510, 41462, 20108, 312, 759, 12784, 424, 25, 512, 220, 21, 47349, 220, 16, 22

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:32:08,495 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:32:08,496 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:08] llamafactory.model.model_utils.quantization:143 >> Quantizing model to 4 bit with bitsandbytes.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:08] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|modeling_utils.py:1151] 2026-02-08 09:32:10,043 >> loading weights file model.safetensors from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/model.safetensors.index.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [rank0]:[W208 09:32:05.594612754 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|modeling_utils.py:2241] 2026-02-08 09:32:36,859 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:1135] 2026-02-08 09:32:

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:47] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:47] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:47] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:47] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:47] llamafactory.model.model_utils.misc:143 >> Found linear modules: up_proj,k_proj,down_proj,o_proj,gate_proj,v_proj,q_proj
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|2026-02-08 09:32:47] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 7,635,801,60

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:756] 2026-02-08 09:32:47,560 >> Using auto half precision backend
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2409] 2026-02-08 09:32:48,352 >> ***** Running training *****
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2410] 2026-02-08 09:32:48,352 >>   Num examples = 100
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2411] 2026-02-08 09:32:48,352 >>   Num Epochs = 3
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2412] 2026-02-08 09:32:48,352 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2415] 2026-02-08 09:32:48,352 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2416] 2026-02-08 09:32:48,352 >>   Gradient Accumulation steps = 2
[36m(RayTrainWorker pid=5917, ip=10.12

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.6902, 'grad_norm': 7.472792148590088, 'learning_rate': 5e-06, 'rewards/chosen': -0.0011365606915205717, 'rewards/rejected': -0.008174104616045952, 'rewards/accuracies': 0.3499999940395355, 'rewards/margins': 0.007037544157356024, 'logps/chosen': -280.1697692871094, 'logps/rejected': -292.0281677246094, 'logits/chosen': -0.8638967275619507, 'logits/rejected': -0.8488122224807739, 'epoch': 0.4}


[36m(RayTrainWorker pid=5920, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000000)
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:32:56,858 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:32:56,858 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[


Training finished iteration 1 at 2026-02-08 09:33:00. Total running time: 2min 19s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s               71.50077 │
│ time_total_s                   71.50077 │
│ training_iteration                    1 │
│ epoch                               0.4 │
│ grad_norm                       7.47279 │
│ learning_rate                   0.00001 │
│ logits/chosen                   -0.8639 │
│ logits/rejected                -0.84881 │
│ logps/chosen                 -280.16977 │
│ logps/rejected               -292.02817 │
│ loss                             0.6902 │
│ rewards/accuracies                 0.35 │
│ rewards/chosen                 -0.00114 │
│ rewards/margins                 0.00704 │
│ rewards/rejected               -0.00817 │
│ step                                  5 │
╰───────────────────────────────────

 15%|█▌        | 6/39 [00:13<01:30,  2.74s/it])[0m 
 18%|█▊        | 7/39 [00:14<01:14,  2.34s/it])[0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000000)[32m [repeated 3x across cluster][0m
 21%|██        | 8/39 [00:16<01:03,  2.05s/it])[0m 
 23%|██▎       | 9/39 [00:17<00:59,  2.00s/it])[0m 
 26%|██▌       | 10/39 [00:19<00:54,  1.90s/it][INFO|trainer.py:3993] 2026-02-08 09:33:07,918 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-10


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.7013, 'grad_norm': 11.50890064239502, 'learning_rate': 4.752422169756048e-06, 'rewards/chosen': -0.004898893181234598, 'rewards/rejected': 0.008253341540694237, 'rewards/accuracies': 0.5750000476837158, 'rewards/margins': -0.013152234256267548, 'logps/chosen': -278.3939208984375, 'logps/rejected': -295.9100341796875, 'logits/chosen': -0.8270912170410156, 'logits/rejected': -0.977357029914856, 'epoch': 0.8}


[36m(RayTrainWorker pid=5918, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000001)[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:33:08,167 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:33:08,167 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTr


Training finished iteration 2 at 2026-02-08 09:33:11. Total running time: 2min 30s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000001 │
│ time_this_iter_s               11.19224 │
│ time_total_s                     82.693 │
│ training_iteration                    2 │
│ epoch                               0.8 │
│ grad_norm                       11.5089 │
│ learning_rate                        0. │
│ logits/chosen                  -0.82709 │
│ logits/rejected                -0.97736 │
│ logps/chosen                 -278.39392 │
│ logps/rejected               -295.91003 │
│ loss                             0.7013 │
│ rewards/accuracies                0.575 │
│ rewards/chosen                  -0.0049 │
│ rewards/margins                -0.01315 │
│ rewards/rejected                0.00825 │
│ step                                 10 │
╰───────────────────────────────────

 28%|██▊       | 11/39 [00:24<01:19,  2.85s/it][0m 
 31%|███       | 12/39 [00:25<01:05,  2.42s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000001)
 33%|███▎      | 13/39 [00:26<00:47,  1.84s/it][0m 
 36%|███▌      | 14/39 [00:28<00:46,  1.85s/it][0m 
 38%|███▊      | 15/39 [00:30<00:43,  1.80s/it][INFO|trainer.py:3993] 2026-02-08 09:33:18,446 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-15


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.6336, 'grad_norm': 6.3686323165893555, 'learning_rate': 4.058724504646834e-06, 'rewards/chosen': -0.02141275256872177, 'rewards/rejected': -0.0034777740947902203, 'rewards/accuracies': 0.3611111044883728, 'rewards/margins': -0.017934981733560562, 'logps/chosen': -221.60379028320312, 'logps/rejected': -241.5094757080078, 'logits/chosen': -0.7992151379585266, 'logits/rejected': -0.9283792972564697, 'epoch': 1.16}


[36m(RayTrainWorker pid=5920, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000002)
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:33:18,698 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:33:18,698 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[


Training finished iteration 3 at 2026-02-08 09:33:21. Total running time: 2min 41s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000002 │
│ time_this_iter_s               10.60935 │
│ time_total_s                   93.30236 │
│ training_iteration                    3 │
│ epoch                              1.16 │
│ grad_norm                       6.36863 │
│ learning_rate                        0. │
│ logits/chosen                  -0.79922 │
│ logits/rejected                -0.92838 │
│ logps/chosen                 -221.60379 │
│ logps/rejected               -241.50948 │
│ loss                             0.6336 │
│ rewards/accuracies              0.36111 │
│ rewards/chosen                 -0.02141 │
│ rewards/margins                -0.01793 │
│ rewards/rejected               -0.00348 │
│ step                                 15 │
╰───────────────────────────────────

 41%|████      | 16/39 [00:34<01:02,  2.71s/it][0m 
 44%|████▎     | 17/39 [00:36<00:54,  2.46s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000002)[32m [repeated 3x across cluster][0m
 46%|████▌     | 18/39 [00:38<00:48,  2.29s/it][0m 
 49%|████▊     | 19/39 [00:40<00:40,  2.04s/it][0m 
 51%|█████▏    | 20/39 [00:41<00:34,  1.82s/it][INFO|trainer.py:3993] 2026-02-08 09:33:29,793 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-20


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.6843, 'grad_norm': 6.403764247894287, 'learning_rate': 3.056302334890786e-06, 'rewards/chosen': 0.001086606178432703, 'rewards/rejected': -0.01895829103887081, 'rewards/accuracies': 0.5750000476837158, 'rewards/margins': 0.020044900476932526, 'logps/chosen': -276.7025146484375, 'logps/rejected': -287.20098876953125, 'logits/chosen': -0.8141433000564575, 'logits/rejected': -0.8249074816703796, 'epoch': 1.56}


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:33:30,039 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:33:30,040 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)


Training finished iteration 4 at 2026-02-08 09:33:33. Total running time: 2min 52s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000003 │
│ time_this_iter_s               11.46522 │
│ time_total_s                  104.76758 │
│ training_iteration                    4 │
│ epoch                              1.56 │
│ grad_norm                       6.40376 │
│ learning_rate                        0. │
│ logits/chosen                  -0.81414 │
│ logits/rejected                -0.82491 │
│ logps/chosen                 -276.70251 │
│ logps/rejected               -287.20099 │
│ loss                             0.6843 │
│ rewards/accuracies                0.575 │
│ rewards/chosen                  0.00109 │
│ rewards/margins                 0.02004 │
│ rewards/rejected               -0.01896 │
│ step                                 20 │
╰───────────────────────────────────

 54%|█████▍    | 21/39 [00:46<00:49,  2.72s/it][0m 
 56%|█████▋    | 22/39 [00:47<00:38,  2.28s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000003)
 59%|█████▉    | 23/39 [00:48<00:31,  1.98s/it][0m 
 62%|██████▏   | 24/39 [00:50<00:27,  1.83s/it][0m 
 64%|██████▍   | 25/39 [00:51<00:23,  1.67s/it][INFO|trainer.py:3993] 2026-02-08 09:33:39,926 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-25
[36m(RayTrainWorker pid=5920, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000004)


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.6899, 'grad_norm': 8.657591819763184, 'learning_rate': 1.9436976651092143e-06, 'rewards/chosen': -0.013525770977139473, 'rewards/rejected': -0.02253251150250435, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': 0.009006739594042301, 'logps/chosen': -220.19992065429688, 'logps/rejected': -298.4376525878906, 'logits/chosen': -0.8283650875091553, 'logits/rejected': -0.882324755191803, 'epoch': 1.96}


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:33:40,172 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:33:40,172 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)


Training finished iteration 5 at 2026-02-08 09:33:43. Total running time: 3min 2s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000004 │
│ time_this_iter_s               10.27101 │
│ time_total_s                  115.03859 │
│ training_iteration                    5 │
│ epoch                              1.96 │
│ grad_norm                       8.65759 │
│ learning_rate                        0. │
│ logits/chosen                  -0.82837 │
│ logits/rejected                -0.88232 │
│ logps/chosen                 -220.19992 │
│ logps/rejected               -298.43765 │
│ loss                             0.6899 │
│ rewards/accuracies                 0.55 │
│ rewards/chosen                 -0.01353 │
│ rewards/margins                 0.00901 │
│ rewards/rejected               -0.02253 │
│ step                                 25 │
╰────────────────────────────────────

 67%|██████▋   | 26/39 [00:56<00:33,  2.58s/it][0m 
 69%|██████▉   | 27/39 [00:57<00:27,  2.33s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000004)[32m [repeated 3x across cluster][0m
 72%|███████▏  | 28/39 [00:59<00:23,  2.10s/it][0m 
 74%|███████▍  | 29/39 [01:00<00:18,  1.85s/it][0m 
 77%|███████▋  | 30/39 [01:02<00:16,  1.86s/it][INFO|trainer.py:3993] 2026-02-08 09:33:51,108 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-30


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.6025, 'grad_norm': 10.21045970916748, 'learning_rate': 9.412754953531664e-07, 'rewards/chosen': 0.002010171767324209, 'rewards/rejected': -0.05112072825431824, 'rewards/accuracies': 0.6666666865348816, 'rewards/margins': 0.05313090234994888, 'logps/chosen': -319.646484375, 'logps/rejected': -264.9739990234375, 'logits/chosen': -0.8445010185241699, 'logits/rejected': -0.9562739729881287, 'epoch': 2.32}


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:33:51,462 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:33:51,462 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)


Training finished iteration 6 at 2026-02-08 09:33:54. Total running time: 3min 14s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000005 │
│ time_this_iter_s               11.35198 │
│ time_total_s                  126.39057 │
│ training_iteration                    6 │
│ epoch                              2.32 │
│ grad_norm                      10.21046 │
│ learning_rate                        0. │
│ logits/chosen                   -0.8445 │
│ logits/rejected                -0.95627 │
│ logps/chosen                 -319.64648 │
│ logps/rejected                 -264.974 │
│ loss                             0.6025 │
│ rewards/accuracies              0.66667 │
│ rewards/chosen                  0.00201 │
│ rewards/margins                 0.05313 │
│ rewards/rejected               -0.05112 │
│ step                                 30 │
╰───────────────────────────────────

 79%|███████▉  | 31/39 [01:07<00:22,  2.86s/it][0m 
 82%|████████▏ | 32/39 [01:09<00:17,  2.45s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000005)
 85%|████████▍ | 33/39 [01:10<00:12,  2.09s/it][0m 
 87%|████████▋ | 34/39 [01:12<00:09,  1.91s/it][0m 
 90%|████████▉ | 35/39 [01:13<00:07,  1.86s/it][0m 


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'loss': 0.6922, 'grad_norm': 9.275382995605469, 'learning_rate': 2.4757783024395244e-07, 'rewards/chosen': -0.01869027502834797, 'rewards/rejected': -0.024014988914132118, 'rewards/accuracies': 0.574999988079071, 'rewards/margins': 0.0053247129544615746, 'logps/chosen': -224.70375061035156, 'logps/rejected': -296.4294128417969, 'logits/chosen': -0.7401316165924072, 'logits/rejected': -0.8596256375312805, 'epoch': 2.72}


 90%|████████▉ | 35/39 [01:13<00:07,  1.86s/it][INFO|trainer.py:3993] 2026-02-08 09:34:02,272 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-35
[36m(RayTrainWorker pid=5920, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000006)
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:34:02,521 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:34:02,521 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=59


Training finished iteration 7 at 2026-02-08 09:34:05. Total running time: 3min 25s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000006 │
│ time_this_iter_s               11.03235 │
│ time_total_s                  137.42292 │
│ training_iteration                    7 │
│ epoch                              2.72 │
│ grad_norm                       9.27538 │
│ learning_rate                        0. │
│ logits/chosen                  -0.74013 │
│ logits/rejected                -0.85963 │
│ logps/chosen                 -224.70375 │
│ logps/rejected               -296.42941 │
│ loss                             0.6922 │
│ rewards/accuracies                0.575 │
│ rewards/chosen                 -0.01869 │
│ rewards/margins                 0.00532 │
│ rewards/rejected               -0.02401 │
│ step                                 35 │
╰───────────────────────────────────

 92%|█████████▏| 36/39 [01:19<00:08,  2.87s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_e3928_00000_0_2026-02-08_09-30-40/checkpoint_000006)[32m [repeated 3x across cluster][0m
 95%|█████████▍| 37/39 [01:20<00:04,  2.40s/it][0m 
 97%|█████████▋| 38/39 [01:21<00:02,  2.14s/it][0m 
100%|██████████| 39/39 [01:22<00:00,  1.65s/it][INFO|trainer.py:3993] 2026-02-08 09:34:10,837 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-39
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:34:11,095 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:34:11,096 >> Model


Training finished iteration 8 at 2026-02-08 09:34:14. Total running time: 3min 33s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000007 │
│ time_this_iter_s                8.52126 │
│ time_total_s                  145.94418 │
│ training_iteration                    8 │
│ epoch                              2.72 │
│ grad_norm                       9.27538 │
│ learning_rate                        0. │
│ logits/chosen                  -0.74013 │
│ logits/rejected                -0.85963 │
│ logps/chosen                 -224.70375 │
│ logps/rejected               -296.42941 │
│ loss                             0.6922 │
│ rewards/accuracies                0.575 │
│ rewards/chosen                 -0.01869 │
│ rewards/margins                 0.00532 │
│ rewards/rejected               -0.02401 │
│ step                                 35 │
╰───────────────────────────────────

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:2676] 2026-02-08 09:34:14,588 >> 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m 
100%|██████████| 39/39 [01:26<00:00,  2.21s/it][0m 
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|trainer.py:3993] 2026-02-08 09:34:14,592 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'train_runtime': 86.2318, 'train_samples_per_second': 3.479, 'train_steps_per_second': 0.452, 'train_loss': 0.6595334884447929, 'epoch': 3.0}


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:34:14,834 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:34:14,834 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)

[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m ***** train metrics *****
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   epoch                    =        3.0
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   total_flos               = 12512612GF
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   train_loss               =     0.6595
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   train_runtime            = 0:01:26.23
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   train_samples_per_second =      3.479
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m   train_steps_per_second   =      0.452
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Figure saved at: qwen2.5_7b_qlora_dpo/training_loss.png


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m [INFO|modelcard.py:450] 2026-02-08 09:34:15,309 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


[36m(RayTrainWorker pid=5917, ip=10.128.7.103)[0m Figure saved at: qwen2.5_7b_qlora_dpo/training_rewards_accuracies.png

Training completed after 8 iterations at 2026-02-08 09:34:16. Total running time: 3min 36s


2026-02-08 09:34:16,825	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/qwen2.5_7b_qlora_dpo' in 0.0941s.





### Option B: Run as an Anyscale job (production)

For longer or production runs, submit the training as an **Anyscale job**. Jobs run outside your interactive session for better stability, retries, and durable logs. Package LLaMA-Factory and other libraries in a container image and launch with a short job config. See [Run LLaMA-Factory as an Anyscale job](https://docs.anyscale.com/llm/fine-tuning/llamafactory-jobs) for the step-by-step guide.

### Tracking with TensorBoard
If you enabled TensorBoard logging (`report_to: tensorboard` in your YAML), you can watch metrics (for example, training loss) update live and compare multiple runs with the same run name side-by-side.

- **While the job is running:** LLaMA-Factory prints a ready-to-run command that starts with `tensorboard --logdir`. Open a new terminal and run it. For example:
  ```bash
  tensorboard --logdir /tmp/ray/session_*/artifacts/*/qwen2.5_7b_qlora_dpo/driver_artifacts
  ```

- **After the job:** Point TensorBoard at `{ray_storage_path}/{ray_run_name}/`. Each `TorchTrainer_*` subfolder holds event files for a single run. Using the parent folder aggregates all runs for easy comparison.
  ```bash
  tensorboard --logdir /mnt/cluster_storage/qwen2.5_7b_qlora_dpo
  ```

In your Anyscale workspace, look for the open **port 6006** labeled **TensorBoard** to view the dashboards.

![Anyscale workspace showing open ports with TensorBoard on port 6006](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/open-ports.png)

**TensorBoard example**

![TensorBoard](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.2/3.2.2-tensorboard.png)

For a more detailed guide on tracking experiments with other tools such as Weights & Biases or MLflow, see [Observability and tracking](https://docs.anyscale.com/llm/fine-tuning/observability-and-tracking).

## Step 5: Locate checkpoints

Ray Train writes checkpoints under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/qwen2.5_7b_qlora_dpo`. 

Inside, you see a **trainer session** directory named like:
`TorchTrainer_ff224_00000_0_2025-09-19_15-57-20/`.

- Ray Train creates `TorchTrainer_*` **when the trainer starts**; the suffix encodes a short run ID and the **start timestamp**.
- Within that directory, Ray Train names checkpoints `checkpoint_000xxx/`, where the number is the saved ordered checkpoints.

Control the save cadence with `save_strategy` and `save_steps`. For instructions on how to resume interrupted training with `resume_from_checkpoint` and more, see [Understand the artifacts directory](https://docs.anyscale.com/llm/fine-tuning/checkpointing#artifacts-directory).

## Step 6: Export the model

If you use LoRA, you can keep the base model and adapters separate for [multi-LoRA deployment](https://docs.anyscale.com/llm/serving/multi-lora) or [merge the adapters](https://docs.anyscale.com/llm/fine-tuning/checkpointing#merge-lora) into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply [post-training quantization](https://docs.anyscale.com/llm/fine-tuning/checkpointing#ptq) on merged or full models before serving.