# Kahneman–Tversky Optimization (KTO) at scale with LoRA

This guide provides a step-by-step workflow for preference fine-tuning the [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model on a multi-GPU Anyscale cluster. You use **LLaMA-Factory** as the training framework and **LoRA** to reduce memory footprint and enable efficient multi-GPU training.

KTO aligns a model to human preferences using **single binary labels (accept or reject)** instead of pairwise “chosen versus rejected” comparisons. KTO directly optimizes the policy on these unary signals, simplifying data preparation while still encouraging preferred behavior and discouraging undesired outputs.

## Step 1: Set up your environment

### Dependencies
First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended container image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads.

In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

# (Optional) Acceleration methods (ensure CUDA/Torch compatibility)
pip install -q liger-kernel==0.6.2

# (Optional) Experiment tracking library
pip install -q mlflow==3.4.0

[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `hf_transfer` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `liger-kernel` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `mlflow` package to be installed on all cluster nodes.[0m
[92mView 

### Model and compute resources

| Item | Value |
|------|-------|
| **Base model** | [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| **Workers** | 4 × L40S / A100 (1 GPU each) |

Compared to SFT, KTO typically holds two copies of the model (policy and reference), and alignment datasets often use long contexts, so Anyscale recommends GPUs with larger VRAM. Techniques such as **LoRA** and memory-efficient attention can further reduce memory pressure.

## Step 2: Prepare the dataset

### Understand the dataset
This tutorial uses `kto_en_demo`, a unary-preference dataset for KTO. Each record contains a multi-turn ShareGPT-style dialogue with a **binary label** indicating whether the modeled behavior is preferred.

This dataset contains:
- `messages`: Turn-by-turn chat between a user and the assistant.
- `label`: A boolean (`true` or `false`) indicating whether the example is preferred.

**Note:** To maintain role alignment in ShareGPT format, you must follow a strict turn order: `human` and `observation` (tool output) must appear in odd-numbered positions, while `gpt` and `function_call` must appear in even-numbered positions. The model learns to generate the content in the `gpt` and `function_call` turns.

**Dataset example**
```json
{
"messages": [
    { "role": "user", "content": "Compare and contrast the roles of the hippocampus and the prefrontal cortex..." },
    { "role": "assistant", "content": "The human brain is a highly complex organ, responsible for a myriad of cognitive functions..." },
    { "role": "user", "content": "Discuss the mechanisms through which the prefrontal cortex ..." },
    { "role": "assistant", "content": "The prefrontal cortex (PFC)..." },
    { "role": "user", "content": "Can you elaborate on the role of the amygdala..." },
    { "role": "assistant", "content": "The amygdala plays a crucial role in the emotional processing of stored memories..." }
],
"label": true
}
```

### Register the dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add a **`dataset_info.json`** to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure. 

If you plan to run KTO post-training on the `kto_en_demo` dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access (for example, a shared mount or object storage). Avoid storing large files on the head node. 

`dataset_info.json`

- `kto_tag` maps the unary preference label used by KTO.
- `tags` helps the loader interpret role/content fields in ShareGPT-style records.

```json
{
  "my_kto_en_demo": {
    "file_name": "/mnt/cluster_storage/kto_en_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "kto_tag": "label"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  }
}
```

For a more detailed dataset preparation and formatting guide, see [Choose your data format](https://docs.anyscale.com/llm/fine-tuning/data-preparation#kto).

In [2]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/kto_en_demo.json -O /mnt/cluster_storage/kto_en_demo.json
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

--2026-02-08 17:35:46--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/kto_en_demo.json
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 52.218.216.97, 3.5.84.219, 3.5.82.122, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|52.218.216.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 913519 (892K) [application/json]
Saving to: ‘/mnt/cluster_storage/kto_en_demo.json’

     0K .......... .......... .......... .......... ..........  5%  264K 3s
    50K .......... .......... .......... .......... .......... 11%  327K 3s
   100K .......... .......... .......... .......... .......... 16%  429K 2s
   150K .......... .......... .......... .......... .......... 22%  604K 2s
   200K .......... .......... .......... .......... .......... 28% 99.8M 1s
   250K ...

## Step 3: Create the preference-tuning config (KTO and LoRA)

Create a YAML file that defines your **KTO** run. It specifies the base model, dataset, **LoRA** settings, KTO hyperparameters, optional acceleration methods, logging, and Ray cluster resources.

**Important notes:**
- **Acceleration libraries:** `liger-kernel` can reduce VRAM and improve throughput across multiple transformer ops, but actual speed and memory gains vary with GPU architecture, sequence length, batch size, precision, kernel availability. Benchmark your training workloads to confirm improvements.
- **Access and paths:** The YAML only needs to be on the **head node**, but any referenced paths (for example, `dataset_dir`, `ray_storage_path`, `output_dir`) must be on **shared storage** (such as `/mnt/cluster_storage/`) visible to all workers.
- **Gated models:** If your base model has gated access on Hugging Face, set `HF_TOKEN` in the runtime environment.
- **Memory tips:** If VRAM is tight, consider switching to [QLoRA]((https://github.com/ray-project/ray/blob/master/doc/source/ray-overview/examples/llamafactory-llm-fine-tune/notebooks/dpo_qlora.ipynb)) (4/8-bit) and adding the corresponding quantization keys.

### Configure LLaMA-Factory with Ray

**Note**: To customize the training configuration, edit `train-configs/kto_lora.yaml`. 

```yaml
# kto_lora.yaml

### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
trust_remote_code: true

### method
stage: kto
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1

### acceleration methods
enable_liger_kernel: true  # Reduce VRAM and improve throughput across multiple transformer ops

### dataset
dataset: my_kto_en_demo
dataset_dir: /mnt/cluster_storage

template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: llama3_8b_lora_kto
logging_steps: 5
save_steps: 50
plot_loss: true
overwrite_output_dir: true
report_to: mlflow   # or none

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
num_train_epochs: 3.0  # Low for demo purpose; adjust as needed
learning_rate: 5.0e-6
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: llama3_8b_kto_lora
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4
resources_per_worker:
  GPU: 1
  anyscale/accelerator_shape:4xL40S: 0.001  # Pin a specific node shape
  # accelerator_type:L40S: 0.001            # or just request a GPU type

ray_init_kwargs:
  runtime_env:
    env_vars:
      # If using gated models like meta-llama/Llama-3-8B-Instruct
      HF_TOKEN: <your_huggingface_token>
      # Enable faster downloads if hf_transfer is installed:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      # If using mlflow for experiments tracking
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

## Step 4: Train and monitor

**Note**: For gated models such as [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), ensure that you accept the license agreement for the models on the Hugging Face site and set `HF_TOKEN` in the runtime environment. If you installed MLflow, configure its credentials. Otherwise, set `report_to: none` in `kto_lora.yaml` to avoid `api_token not set` errors.

With all configurations in place, you can launch fine-tuning or post-training in one of two ways:

### Option A: Run from a workspace (quick start)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

In [3]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/kto_lora.yaml

[2026-02-08 17:36:00,899] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)
INFO 02-08 17:36:03 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2026-02-08 17:36:05,301	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.128.4.189:6379...
2026-02-08 17:36:05,313	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-c1mvc6t862zj4fbguuknngnrgv.i.anyscaleuserdata.com [39m[22m
2026-02-08 17:36:05,315	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_873ed4d0505528fa538926e1170e8ec9f1d45599.zip' (0.29MiB) to Ray cluster...
2026-02-08 17:36:05,316	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_873ed4d0505528fa538926e1170e8ec9f1d45599.zip'.



View detailed results here: /mnt/cluster_storage/llama3_8b_kto_lora
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2026-02-08_15-34-45_799476_185/artifacts/2026-02-08_17-36-05/llama3_8b_kto_lora/driver_artifacts`
[36m(TrainTrainable pid=4037, ip=10.128.6.27)[0m [2026-02-08 17:36:13,659] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                                  

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=4037, ip=10.128.6.27)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=4037, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=4171) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=4037, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=4172) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=4037, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=4173) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=4037, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=4170) world_rank=3, local_rank=3, node_rank=0


[36m(RayTrainWorker pid=4172, ip=10.128.6.27)[0m [2026-02-08 17:36:25,494] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:29] llamafactory.hparams.parser:143 >> Set `ddp_find_unused_parameters` to False in DDP training since LoRA is enabled.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:29] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 4, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:36:31,498 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/tokenizer.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:36:31,498 >> loading file tokenizer.model from cache at None
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:36:31,499 >> loading file added_tokens.json from cache at None
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:36:31,499 >> loading file special_tokens_map.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/special_tokens_map.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.2

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:33] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:33] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:33] llamafactory.data.loader:143 >> Loading dataset /mnt/cluster_storage/kto_en_demo.json...
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [2026-02-08 17:36:25,516] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 300 examples [00:00, 16526.02 examples/s]
Converting format of dataset (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16):  76%|███████▌  | 228/300 [00:00<00:00, 2273.80 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 300/300 [00:00<00:00, 1488.04 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   6%|▋         | 19/300 [00:00<00:14, 19.77 examples/s]
Running tokenizer on dataset (num_proc=16):  13%|█▎        | 38/300 [00:01<00:06, 39.26 examples/s]
Running tokenizer on dataset (num_proc=16):  19%|█▉        | 57/300 [00:01<00:04, 57.42 examples/s]
Running tokenizer on dataset (num_proc=16):  25%|██▌       | 76/300 [00:01<00:03, 

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m training example:
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m input_ids:
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [128000, 128006, 882, 128007, 271, 791, 12411, 17657, 9849, 374, 2133, 1306, 21930, 1698, 11, 9660, 315, 6500, 2082, 430, 4685, 5694, 389, 279, 19002, 315, 8191, 3932, 311, 3839, 477, 24927, 872, 2930, 7640, 627, 48, 25, 16299, 374, 279, 1888, 12399, 315, 420, 4652, 5380, 38053, 701, 4320, 505, 512, 4444, 570, 4435, 198, 5462, 570, 13482, 198, 3100, 570, 8184, 198, 5549, 570, 10170, 17146, 4842, 198, 40, 1781, 279, 4320, 374, 128009, 128006, 78191, 128007, 271, 46, 2319, 297, 2319, 23128, 23128, 0, 353, 70, 343, 3491, 9, 6914, 757, 1781, 1131, 507, 2319, 297, 2319, 23128, 23128, 0, 353, 70, 343, 3491, 9, 578, 1888, 12399, 315, 420, 4652, 374, 1131, 353, 3696, 372, 1119, 9, 1131, 423, 0, 10170, 17146, 4842, 0, 816, 352, 0, 353, 6263, 29037, 9, 578, 12411, 17657, 9849, 374, 7556, 922, 21930, 1698, 11, 902, 37

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:698] 2026-02-08 17:36:39,153 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:770] 2026-02-08 17:36:39,154 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "architectures": [
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   ],
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "bos_token_id": 128000,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "eos_token_id": 128009,
[36m(RayTrainWorker pid=4171, ip=10.128.6

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:39] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:39] llamafactory.model.model_utils.liger_kernel:143 >> Current training stage does not support chunked cross entropy.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:36:39] llamafactory.model.model_utils.liger_kernel:143 >> Liger kernel has been applied to the model.


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|modeling_utils.py:1151] 2026-02-08 17:36:40,010 >> loading weights file model.safetensors from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/model.safetensors.index.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|modeling_utils.py:2241] 2026-02-08 17:37:01,704 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:1135] 2026-02-08 17:37:01,708 >> Generate config GenerationConfig {
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "bos_token_id": 128000,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "eos_token_id": 128009,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "use_cache": false
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m }
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
Loading checkpoint sh

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:37:07] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:37:07] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:37:07] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:37:07] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:37:07] llamafactory.model.model_utils.misc:143 >> Found linear modules: down_proj,v_proj,up_proj,o_proj,q_proj,gate_proj,k_proj


[36m(RayTrainWorker pid=4172, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4172, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4172, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4172, ip=10.128.6.27)[0m 
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:05<00:01,  1.76s/it][32m [repeated 11x across cluster][0m
[36m(RayTrainWorker pid=4173, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4173, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4173, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4173, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|2026-02-08 17:37:08] llamafactory.model.loader:143 >> trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:756] 2026-02-08 17:37:08,992 >> Using auto half precision backend
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2409] 2026-02-08 17:37:09,550 >> ***** Running training *****
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2410] 2026-02-08 17:37:09,550 >>   Num examples = 300
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2411] 2026-02-08 17:37:09,550 >>   Num Epochs = 3
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2412] 2026-02-08 17:37:09,550 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2415] 2026-02-08 17:37:09,550 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2416] 2026-02-08 17:37:09,550 >>   Gradient Accumulation steps = 2
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4957, 'grad_norm': 3.1498160362243652, 'learning_rate': 1.6666666666666667e-06, 'rewards/chosen': 0.019817462989262173, 'logps/chosen': -413.7041015625, 'logits/chosen': -23078601.14285714, 'rewards/rejected': -0.10288289189338684, 'logps/rejected': -970.846435546875, 'logits/rejected': -29555157.333333332, 'rewards/margins': 0.12270035488264902, 'kl': 0.3308525085449219, 'epoch': 0.13}


  5%|▌         | 6/114 [00:11<02:46,  1.54s/it][0m 
  6%|▌         | 7/114 [00:12<02:41,  1.51s/it][0m 
  7%|▋         | 8/114 [00:13<02:36,  1.47s/it][0m 
  8%|▊         | 9/114 [00:15<02:29,  1.42s/it][0m 
  9%|▉         | 10/114 [00:16<02:22,  1.37s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.5045, 'grad_norm': 3.2115280628204346, 'learning_rate': 3.7500000000000005e-06, 'rewards/chosen': 0.008976592123508454, 'logps/chosen': -376.96708984375, 'logits/chosen': -37118019.2, 'rewards/rejected': 0.02862915098667145, 'logps/rejected': -346.9309814453125, 'logits/rejected': -27582691.2, 'rewards/margins': -0.019652558863162993, 'kl': 1.0791082382202148, 'epoch': 0.27}


 10%|▉         | 11/114 [00:17<02:20,  1.36s/it]0m 
 11%|█         | 12/114 [00:19<02:13,  1.31s/it]0m 
 11%|█▏        | 13/114 [00:20<02:15,  1.34s/it]0m 
 12%|█▏        | 14/114 [00:21<02:06,  1.26s/it]0m 
 13%|█▎        | 15/114 [00:22<02:02,  1.24s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4998, 'grad_norm': 2.4262821674346924, 'learning_rate': 4.995258321842611e-06, 'rewards/chosen': -0.007094318668047587, 'logps/chosen': -185.7591552734375, 'logits/chosen': -7981500.0, 'rewards/rejected': 0.012033191110406603, 'logps/rejected': -137.62447684151786, 'logits/rejected': -7264017.714285715, 'rewards/margins': -0.01912750977845419, 'kl': 1.341604232788086, 'epoch': 0.4}


 14%|█▍        | 16/114 [00:23<01:58,  1.21s/it]0m 
 15%|█▍        | 17/114 [00:25<01:54,  1.18s/it]0m 
 16%|█▌        | 18/114 [00:26<01:54,  1.19s/it]0m 
 17%|█▋        | 19/114 [00:27<01:57,  1.24s/it]0m 
 18%|█▊        | 20/114 [00:28<01:59,  1.28s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.5031, 'grad_norm': 2.7462239265441895, 'learning_rate': 4.942120794399002e-06, 'rewards/chosen': -0.025224685668945312, 'logps/chosen': -362.2128092447917, 'logits/chosen': -32110208.0, 'rewards/rejected': 0.07098617404699326, 'logps/rejected': -414.457275390625, 'logits/rejected': -10392526.0, 'rewards/margins': -0.09621085971593857, 'kl': 0.2899312973022461, 'epoch': 0.53}


 18%|█▊        | 21/114 [00:30<01:59,  1.29s/it]0m 
 19%|█▉        | 22/114 [00:31<02:00,  1.31s/it]0m 
 20%|██        | 23/114 [00:32<01:56,  1.28s/it]0m 
 21%|██        | 24/114 [00:34<01:56,  1.29s/it]0m 
 22%|██▏       | 25/114 [00:35<01:56,  1.31s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4917, 'grad_norm': 3.474839448928833, 'learning_rate': 4.83118057351089e-06, 'rewards/chosen': 0.011423779651522636, 'logps/chosen': -415.261474609375, 'logits/chosen': -15809656.0, 'rewards/rejected': 0.0009204866364598274, 'logps/rejected': -133.61837768554688, 'logits/rejected': -4753514.0, 'rewards/margins': 0.010503293015062809, 'kl': 0.49626922607421875, 'epoch': 0.67}


 23%|██▎       | 26/114 [00:36<01:52,  1.28s/it]0m 
 24%|██▎       | 27/114 [00:37<01:50,  1.26s/it]0m 
 25%|██▍       | 28/114 [00:39<01:50,  1.28s/it]0m 
 25%|██▌       | 29/114 [00:40<01:48,  1.28s/it]0m 
 26%|██▋       | 30/114 [00:41<01:47,  1.28s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4966, 'grad_norm': 3.14445424079895, 'learning_rate': 4.665063509461098e-06, 'rewards/chosen': -0.00511518990000089, 'logps/chosen': -320.1336669921875, 'logits/chosen': -43194530.666666664, 'rewards/rejected': -0.008688926696777344, 'logps/rejected': -399.90087890625, 'logits/rejected': -20492260.0, 'rewards/margins': 0.003573736796776454, 'kl': 0.44983482360839844, 'epoch': 0.8}


 27%|██▋       | 31/114 [00:43<01:44,  1.26s/it]0m 
 28%|██▊       | 32/114 [00:44<01:42,  1.25s/it]0m 
 29%|██▉       | 33/114 [00:45<01:43,  1.27s/it]0m 
 30%|██▉       | 34/114 [00:46<01:43,  1.30s/it]0m 
 31%|███       | 35/114 [00:48<01:43,  1.31s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.5011, 'grad_norm': 3.3660974502563477, 'learning_rate': 4.447701436314176e-06, 'rewards/chosen': -0.0157562255859375, 'logps/chosen': -319.4732177734375, 'logits/chosen': -2980669.8, 'rewards/rejected': -0.02387329190969467, 'logps/rejected': -322.132763671875, 'logits/rejected': 1842462.4, 'rewards/margins': 0.008117066323757173, 'kl': 0.14889907836914062, 'epoch': 0.93}


 32%|███▏      | 36/114 [00:49<01:41,  1.30s/it]0m 
 32%|███▏      | 37/114 [00:50<01:39,  1.30s/it]0m 
 33%|███▎      | 38/114 [00:51<01:24,  1.11s/it]0m 
 34%|███▍      | 39/114 [00:52<01:27,  1.16s/it]0m 
 35%|███▌      | 40/114 [00:54<01:27,  1.18s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4468, 'grad_norm': 2.8790392875671387, 'learning_rate': 4.184239109116393e-06, 'rewards/chosen': -0.044304912288983665, 'logps/chosen': -240.10538736979166, 'logits/chosen': -22431341.333333332, 'rewards/rejected': 0.013453165690104166, 'logps/rejected': -928.1993815104166, 'logits/rejected': -30207528.0, 'rewards/margins': -0.05775807797908783, 'kl': 0.40143871307373047, 'epoch': 1.05}


 36%|███▌      | 41/114 [00:55<01:28,  1.21s/it]0m 
 37%|███▋      | 42/114 [00:56<01:27,  1.21s/it]0m 
 38%|███▊      | 43/114 [00:57<01:28,  1.25s/it]0m 
 39%|███▊      | 44/114 [00:59<01:28,  1.27s/it]0m 
 39%|███▉      | 45/114 [01:00<01:27,  1.27s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4913, 'grad_norm': 3.927604913711548, 'learning_rate': 3.880912432401265e-06, 'rewards/chosen': -0.018583680192629497, 'logps/chosen': -118.12808227539062, 'logits/chosen': -6146621.333333333, 'rewards/rejected': -0.049512343747275214, 'logps/rejected': -311.44423130580356, 'logits/rejected': -4062452.285714286, 'rewards/margins': 0.030928663554645717, 'kl': 0.0, 'epoch': 1.19}


 40%|████      | 46/114 [01:01<01:28,  1.30s/it]0m 
 41%|████      | 47/114 [01:02<01:23,  1.25s/it]0m 
 42%|████▏     | 48/114 [01:04<01:24,  1.27s/it]0m 
 43%|████▎     | 49/114 [01:05<01:23,  1.29s/it]0m 
 44%|████▍     | 50/114 [01:06<01:22,  1.28s/it][INFO|trainer.py:3993] 2026-02-08 17:38:16,507 >> Saving model checkpoint to llama3_8b_lora_kto/checkpoint-50


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4908, 'grad_norm': 2.7847020626068115, 'learning_rate': 3.544900862216959e-06, 'rewards/chosen': -0.015361366527421134, 'logps/chosen': -232.72154017857142, 'logits/chosen': -5122232.0, 'rewards/rejected': -0.0729100505510966, 'logps/rejected': -227.35882568359375, 'logits/rejected': -17326696.0, 'rewards/margins': 0.05754868402367547, 'kl': 1.5837535858154297, 'epoch': 1.32}


[36m(RayTrainWorker pid=4170, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_b359e_00000_0_2026-02-08_17-36-05/checkpoint_000000)
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:698] 2026-02-08 17:38:16,731 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:770] 2026-02-08 17:38:16,732 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "architectures": [
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   ],
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[


Training finished iteration 1 at 2026-02-08 17:38:19. Total running time: 2min 14s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s              123.13147 │
│ time_total_s                  123.13147 │
│ training_iteration                    1 │
│ epoch                              1.32 │
│ grad_norm                        2.7847 │
│ kl                              1.58375 │
│ learning_rate                        0. │
│ logits/chosen                 -5122232. │
│ logits/rejected              -17326696. │
│ logps/chosen                 -232.72154 │
│ logps/rejected               -227.35883 │
│ loss                             0.4908 │
│ rewards/chosen                 -0.01536 │
│ rewards/margins                 0.05755 │
│ rewards/rejected               -0.07291 │
│ step                                 50 │
╰───────────────────────────────────

 45%|████▍     | 51/114 [01:11<02:25,  2.31s/it]0m 
 46%|████▌     | 52/114 [01:12<02:04,  2.01s/it]0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_b359e_00000_0_2026-02-08_17-36-05/checkpoint_000000)[32m [repeated 3x across cluster][0m
 46%|████▋     | 53/114 [01:14<01:49,  1.80s/it]0m 
 47%|████▋     | 54/114 [01:15<01:39,  1.66s/it]0m 
 48%|████▊     | 55/114 [01:16<01:31,  1.55s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4962, 'grad_norm': 3.4667067527770996, 'learning_rate': 3.184157475180208e-06, 'rewards/chosen': -0.009737288313252586, 'logps/chosen': -425.35477120535717, 'logits/chosen': -25772352.0, 'rewards/rejected': -0.07607295115788777, 'logps/rejected': -427.5118001302083, 'logits/rejected': 8856781.333333334, 'rewards/margins': 0.06633566284463518, 'kl': 1.3006458282470703, 'epoch': 1.45}


 49%|████▉     | 56/114 [01:18<01:25,  1.48s/it]0m 
 50%|█████     | 57/114 [01:19<01:22,  1.44s/it]0m 
 51%|█████     | 58/114 [01:20<01:19,  1.42s/it]0m 
 52%|█████▏    | 59/114 [01:21<01:11,  1.31s/it]0m 
 53%|█████▎    | 60/114 [01:23<01:10,  1.30s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4845, 'grad_norm': 4.169820785522461, 'learning_rate': 2.8072207266617856e-06, 'rewards/chosen': 0.02910013198852539, 'logps/chosen': -160.93770751953124, 'logits/chosen': -5199646.0, 'rewards/rejected': -0.06544250845909119, 'logps/rejected': -299.449169921875, 'logits/rejected': -15725676.8, 'rewards/margins': 0.09454264044761658, 'kl': 0.0, 'epoch': 1.59}


 54%|█████▎    | 61/114 [01:24<01:08,  1.30s/it]0m 
 54%|█████▍    | 62/114 [01:25<01:08,  1.31s/it]0m 
 55%|█████▌    | 63/114 [01:27<01:06,  1.31s/it]0m 
 56%|█████▌    | 64/114 [01:28<01:03,  1.28s/it]0m 
 57%|█████▋    | 65/114 [01:29<01:02,  1.28s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4793, 'grad_norm': 3.130155086517334, 'learning_rate': 2.4230123536095746e-06, 'rewards/chosen': 0.01244094967842102, 'logps/chosen': -264.1098876953125, 'logits/chosen': -38034163.2, 'rewards/rejected': -0.06333073377609252, 'logps/rejected': -439.5146484375, 'logits/rejected': -38250316.8, 'rewards/margins': 0.07577168345451354, 'kl': 0.34583091735839844, 'epoch': 1.72}


 58%|█████▊    | 66/114 [01:30<00:59,  1.24s/it]0m 
 59%|█████▉    | 67/114 [01:32<00:59,  1.26s/it]0m 
 60%|█████▉    | 68/114 [01:33<00:57,  1.25s/it]0m 
 61%|██████    | 69/114 [01:34<00:58,  1.29s/it]0m 
 61%|██████▏   | 70/114 [01:36<00:56,  1.29s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4938, 'grad_norm': 3.2651515007019043, 'learning_rate': 2.040626205458574e-06, 'rewards/chosen': 0.03988765552639961, 'logps/chosen': -339.7088928222656, 'logits/chosen': -47686432.0, 'rewards/rejected': 0.0029575349763035774, 'logps/rejected': -67.38799285888672, 'logits/rejected': 4996398.5, 'rewards/margins': 0.036930120550096035, 'kl': 0.0, 'epoch': 1.85}


 62%|██████▏   | 71/114 [01:37<00:54,  1.27s/it]0m 
 63%|██████▎   | 72/114 [01:38<00:53,  1.27s/it]0m 
 64%|██████▍   | 73/114 [01:39<00:52,  1.27s/it]0m 
 65%|██████▍   | 74/114 [01:41<00:50,  1.27s/it]0m 
 66%|██████▌   | 75/114 [01:42<00:47,  1.22s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4749, 'grad_norm': 3.6366844177246094, 'learning_rate': 1.6691130013008514e-06, 'rewards/chosen': -0.008752670884132386, 'logps/chosen': -152.84658203125, 'logits/chosen': -27358192.0, 'rewards/rejected': -0.103485107421875, 'logps/rejected': -497.47470703125, 'logits/rejected': -23272065.6, 'rewards/margins': 0.09473243653774262, 'kl': 0.018812179565429688, 'epoch': 1.99}


 67%|██████▋   | 76/114 [01:42<00:40,  1.07s/it]0m 
 68%|██████▊   | 77/114 [01:44<00:41,  1.12s/it]0m 
 68%|██████▊   | 78/114 [01:45<00:41,  1.14s/it]0m 
 69%|██████▉   | 79/114 [01:46<00:41,  1.17s/it]0m 
 70%|███████   | 80/114 [01:47<00:40,  1.18s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4234, 'grad_norm': 4.389901161193848, 'learning_rate': 1.3172661079099752e-06, 'rewards/chosen': 0.11917743682861329, 'logps/chosen': -551.941552734375, 'logits/chosen': -40424889.6, 'rewards/rejected': -0.14022178947925568, 'logps/rejected': -545.833251953125, 'logits/rejected': -30141144.0, 'rewards/margins': 0.25939922630786894, 'kl': 0.07353878021240234, 'epoch': 2.11}


 71%|███████   | 81/114 [01:48<00:39,  1.19s/it]0m 
 72%|███████▏  | 82/114 [01:50<00:37,  1.18s/it]0m 
 73%|███████▎  | 83/114 [01:51<00:37,  1.22s/it]0m 
 74%|███████▎  | 84/114 [01:52<00:36,  1.23s/it]0m 
 75%|███████▍  | 85/114 [01:54<00:36,  1.26s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4875, 'grad_norm': 3.5152552127838135, 'learning_rate': 9.934134090518593e-07, 'rewards/chosen': 0.023151906828085583, 'logps/chosen': -342.3819580078125, 'logits/chosen': -34245232.0, 'rewards/rejected': -0.0966060683131218, 'logps/rejected': -302.7929992675781, 'logits/rejected': 3230285.0, 'rewards/margins': 0.11975797514120738, 'kl': 0.0, 'epoch': 2.24}


 75%|███████▍  | 85/114 [01:54<00:36,  1.26s/it]0m 
 75%|███████▌  | 86/114 [01:55<00:35,  1.28s/it]0m 
 76%|███████▋  | 87/114 [01:56<00:34,  1.26s/it]0m 
 77%|███████▋  | 88/114 [01:57<00:32,  1.25s/it]0m 
 78%|███████▊  | 89/114 [01:58<00:30,  1.22s/it]0m 
 79%|███████▉  | 90/114 [02:00<00:28,  1.19s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4811, 'grad_norm': 2.9937925338745117, 'learning_rate': 7.052201923388955e-07, 'rewards/chosen': 0.03335037330786387, 'logps/chosen': -193.55106608072916, 'logits/chosen': -6227604.666666667, 'rewards/rejected': -0.11973920890263148, 'logps/rejected': -304.60145786830356, 'logits/rejected': -6605980.571428572, 'rewards/margins': 0.15308958221049535, 'kl': 0.0, 'epoch': 2.37}


 80%|███████▉  | 91/114 [02:01<00:28,  1.24s/it]0m 
 81%|████████  | 92/114 [02:02<00:27,  1.27s/it]0m 
 82%|████████▏ | 93/114 [02:04<00:26,  1.28s/it]0m 
 82%|████████▏ | 94/114 [02:05<00:25,  1.26s/it]0m 
 83%|████████▎ | 95/114 [02:06<00:22,  1.20s/it]0m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4764, 'grad_norm': 4.578560829162598, 'learning_rate': 4.5950771910944603e-07, 'rewards/chosen': 0.09679057200749715, 'logps/chosen': -452.3634440104167, 'logits/chosen': -20889709.333333332, 'rewards/rejected': -0.1498851776123047, 'logps/rejected': -526.0985107421875, 'logits/rejected': -38741048.0, 'rewards/margins': 0.24667574961980182, 'kl': 0.25988006591796875, 'epoch': 2.51}


 84%|████████▍ | 96/114 [02:07<00:22,  1.25s/it]0m 
 85%|████████▌ | 97/114 [02:08<00:21,  1.24s/it]0m 
 86%|████████▌ | 98/114 [02:10<00:20,  1.27s/it]0m 
 87%|████████▋ | 99/114 [02:11<00:18,  1.25s/it]0m 
 88%|████████▊ | 100/114 [02:12<00:17,  1.22s/it][INFO|trainer.py:3993] 2026-02-08 17:39:22,183 >> Saving model checkpoint to llama3_8b_lora_kto/checkpoint-100


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4731, 'grad_norm': 2.9918088912963867, 'learning_rate': 2.620917716123444e-07, 'rewards/chosen': 0.033857100776263645, 'logps/chosen': -195.83489118303572, 'logits/chosen': -26336368.0, 'rewards/rejected': -0.1773396929105123, 'logps/rejected': -254.9541219075521, 'logits/rejected': -4914320.666666667, 'rewards/margins': 0.21119679368677594, 'kl': 0.0, 'epoch': 2.64}


[36m(RayTrainWorker pid=4170, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_b359e_00000_0_2026-02-08_17-36-05/checkpoint_000001)
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:698] 2026-02-08 17:39:22,400 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:770] 2026-02-08 17:39:22,401 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "architectures": [
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   ],
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[


Training finished iteration 2 at 2026-02-08 17:39:26. Total running time: 3min 20s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000001 │
│ time_this_iter_s               66.23456 │
│ time_total_s                  189.36603 │
│ training_iteration                    2 │
│ epoch                              2.64 │
│ grad_norm                       2.99181 │
│ kl                                   0. │
│ learning_rate                        0. │
│ logits/chosen                -26336368. │
│ logits/rejected          -4914320.66667 │
│ logps/chosen                 -195.83489 │
│ logps/rejected               -254.95412 │
│ loss                             0.4731 │
│ rewards/chosen                  0.03386 │
│ rewards/margins                  0.2112 │
│ rewards/rejected               -0.17734 │
│ step                                100 │
╰───────────────────────────────────

 89%|████████▊ | 101/114 [02:17<00:31,  2.40s/it]m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_b359e_00000_0_2026-02-08_17-36-05/checkpoint_000001)[32m [repeated 3x across cluster][0m
 89%|████████▉ | 102/114 [02:18<00:24,  2.04s/it]m 
 90%|█████████ | 103/114 [02:20<00:20,  1.83s/it]m 
 91%|█████████ | 104/114 [02:21<00:16,  1.67s/it]m 
 92%|█████████▏| 105/114 [02:22<00:14,  1.56s/it]m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4848, 'grad_norm': 3.185372829437256, 'learning_rate': 1.1764499893210879e-07, 'rewards/chosen': 0.11968272179365158, 'logps/chosen': -335.6123046875, 'logits/chosen': -26846008.0, 'rewards/rejected': -0.2101486325263977, 'logps/rejected': -351.06842041015625, 'logits/rejected': -6585261.0, 'rewards/margins': 0.3298313543200493, 'kl': 0.0, 'epoch': 2.77}


 93%|█████████▎| 106/114 [02:24<00:12,  1.50s/it]m 
 94%|█████████▍| 107/114 [02:25<00:10,  1.45s/it]m 
 95%|█████████▍| 108/114 [02:26<00:08,  1.40s/it]m 
 96%|█████████▌| 109/114 [02:28<00:06,  1.37s/it]m 
 96%|█████████▋| 110/114 [02:29<00:05,  1.36s/it]m 


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'loss': 0.4875, 'grad_norm': 3.454695224761963, 'learning_rate': 2.958631979685156e-08, 'rewards/chosen': -0.10164896647135417, 'logps/chosen': -352.5401611328125, 'logits/chosen': -23750568.0, 'rewards/rejected': -0.1265277521950858, 'logps/rejected': -479.50341796875, 'logits/rejected': -38374107.428571425, 'rewards/margins': 0.02487878572373163, 'kl': 1.4368000030517578, 'epoch': 2.91}


 97%|█████████▋| 111/114 [02:30<00:04,  1.34s/it]m 
 98%|█████████▊| 112/114 [02:32<00:02,  1.34s/it]m 
 99%|█████████▉| 113/114 [02:33<00:01,  1.33s/it]m 
100%|██████████| 114/114 [02:34<00:00,  1.15s/it][INFO|trainer.py:3993] 2026-02-08 17:39:43,754 >> Saving model checkpoint to llama3_8b_lora_kto/checkpoint-114
[36m(RayTrainWorker pid=4170, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_b359e_00000_0_2026-02-08_17-36-05/checkpoint_000002)
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:698] 2026-02-08 17:39:43,968 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:770] 2026-02-08 17:39:43,969 >> Model config LlamaConfig {
[36m(RayTra


Training finished iteration 3 at 2026-02-08 17:39:48. Total running time: 3min 42s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000002 │
│ time_this_iter_s               21.82934 │
│ time_total_s                  211.19537 │
│ training_iteration                    3 │
│ epoch                           2.90667 │
│ grad_norm                        3.4547 │
│ kl                               1.4368 │
│ learning_rate                        0. │
│ logits/chosen                -23750568. │
│ logits/rejected         -38374107.42857 │
│ logps/chosen                 -352.54016 │
│ logps/rejected               -479.50342 │
│ loss                             0.4875 │
│ rewards/chosen                 -0.10165 │
│ rewards/margins                 0.02488 │
│ rewards/rejected               -0.12653 │
│ step                                110 │
╰───────────────────────────────────

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:2676] 2026-02-08 17:39:48,072 >> 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m 
100%|██████████| 114/114 [02:38<00:00,  1.39s/it]m 
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|trainer.py:3993] 2026-02-08 17:39:48,076 >> Saving model checkpoint to llama3_8b_lora_kto


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'train_runtime': 158.5156, 'train_samples_per_second': 5.678, 'train_steps_per_second': 0.719, 'train_loss': 0.48332400907549944, 'epoch': 3.0}


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:698] 2026-02-08 17:39:48,292 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|configuration_utils.py:770] 2026-02-08 17:39:48,293 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "architectures": [
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   ],
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "bos_token_id": 128000,
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   "eos_token_id": 128009,
[36m(RayTrainWorker pid=4171, ip=10.128.6

[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m ***** train metrics *****
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   epoch                    =        3.0
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   total_flos               = 19610724GF
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   train_loss               =     0.4833
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   train_runtime            = 0:02:38.51
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   train_samples_per_second =      5.678
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m   train_steps_per_second   =      0.719
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Figure saved at: llama3_8b_lora_kto/training_loss.png


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m [INFO|modelcard.py:450] 2026-02-08 17:39:48,782 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Figure saved at: llama3_8b_lora_kto/training_rewards_chosen.png

Training completed after 3 iterations at 2026-02-08 17:39:50. Total running time: 3min 44s


2026-02-08 17:39:50,308	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/llama3_8b_kto_lora' in 0.0159s.





[36m(RayTrainWorker pid=4171, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_b359e_00000_0_2026-02-08_17-36-05/checkpoint_000002)[32m [repeated 3x across cluster][0m


### Option B — Run as an Anyscale job (production)

For longer or production runs, submit the training as an **Anyscale job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You package LLaMA-Factory and other libraries in a container image and launch with a short job config. See [Run LLaMA-Factory as an Anyscale job](https://docs.anyscale.com/llm/fine-tuning/llamafactory-jobs) for the step-by-step guide.

### Tracking with MLflow

If you enabled MLflow logging (`report_to: mlflow` in your YAML), LLaMA-Factory logs metrics (loss, learning rate, etc.), parameters, and artifacts to your configured MLflow tracking server.

**Example YAML snippet:**

```yaml
report_to: mlflow

ray_init_kwargs:
  runtime_env:
    env_vars:
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

**MLFlow example**

![MLflow](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.3/3.2.3-mlflow.png)

For a more detailed guide on tracking experiments with other tools such as Weights and Biases or MLflow, see [Observability and tracking](https://docs.anyscale.com/llm/fine-tuning/observability-and-tracking).

## Step 5: Locate checkpoints

Ray Train writes checkpoints under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/llama3_8b_kto_lora`. 

Inside, you see a **trainer session** directory named like:
`TorchTrainer_75e12_00000_0_2025-09-22_17-58-47`.

- Ray Train creates `TorchTrainer_*` **when the trainer starts**; the suffix encodes a short run ID and the **start timestamp**.
- Within that directory, Ray Train names checkpoints `checkpoint_000xxx/`, where the number is the saved ordered checkpoints.

Control the save cadence with `save_strategy` and `save_steps`. For instructions on how to resume interrupted training with `resume_from_checkpoint` and more, see [Understand the artifacts directory](https://docs.anyscale.com/llm/fine-tuning/checkpointing#artifacts-directory).

## Step 6: Export the model

If you use LoRA, you can keep the base model and adapters separate for [multi-LoRA deployment](https://docs.anyscale.com/llm/serving/multi-lora) or [merge the adapters](https://docs.anyscale.com/llm/fine-tuning/checkpointing#merge-lora) into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply [post-training quantization](https://docs.anyscale.com/llm/fine-tuning/checkpointing#ptq) on merged or full models before serving.