# Supervised Fine-Tuning (SFT) at scale with DeepSpeed

This guide provides a step-by-step workflow for supervised fine-tuning the [`Qwen/Qwen2.5-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) model on a multi-GPU Anyscale cluster. You use LLaMA-Factory for the training framework and `DeepSpeed` to efficiently manage memory and scale the training process.

SFT is a technique to adapt a pre-trained model to specific tasks. By showing the model high-quality examples of instructions and their desired outputs, you teach it to follow new instructions more accurately.

## Step 1: Set up your environment

### Dependencies
First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended container image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads:

In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3

# Install DeepSpeed for large-scale training
pip install -q deepspeed==0.16.9

# (Optional) For experiment tracking with Weights & Biases
pip install -q wandb==0.21.3

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.24.0 requires anyio<4,>=3.1.0, but you have anyio 4.12.1 which is incompatible.[0m[31m


[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `deepspeed` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `wandb` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `hf_transfer` package to be installed on all cluster nodes.[0m
[92mView and 

[0m

### Model and compute resources

DeepSpeed ZeRO-3 partitions parameters, gradients, and optimizer states across multiple GPUs, enabling supervised fine-tuning (SFT) of 30B+ LLMs on just 4 GPUs.

| Item | Value |
|------|-------|
| **Base model** | [`Qwen/Qwen2.5-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) |
| **Worker Nodes** | 4 × L40S / 4 x A100-40G |

## Step 2: Prepare the dataset

### Understand the dataset
This tutorial uses [`glaive_toolcall_en_demo`](https://huggingface.co/datasets/zuol/glaive_toolcall_en_demo/tree/main), a dataset designed to teach models how to use tools (also known as function calling).

This dataset contains conversational examples where the model needs to interact with external tools. Each entry includes:
* `conversations`: A turn-by-turn log between a human and the gpt assistant.
* `tools`: A JSON schema describing the functions the model can call.

**Note**: The `conversations` may include special turns like function_call (the model deciding to call a tool) and observation (the result returned from the tool). This structure is ideal for teaching the model sophisticated tool-use behavior. To maintain role alignment in ShareGPT format, you must follow a strict turn order: `human` and `observation` (tool output) must appear in odd-numbered positions, while `gpt` and `function_call` must appear in even-numbered positions. The model learns to generate the content in the `gpt` and `function_call` turns.

**Dataset example**

```json
{
"conversations": [
    {
    "from": "human",
    "value": "Hi, I have some ingredients and I want to cook something. Can you help me find a recipe?"
    },
    {
    "from": "gpt",
    "value": "Of course! I can help you with that. Please tell me what ingredients you have."
    },
    {
    "from": "human",
    "value": "I have chicken, bell peppers, and rice."
    },
    {
    "from": "function_call",
    "value": "{\"name\": \"search_recipes\", \"arguments\": {\"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"]}}"
    },
    {
    "from": "observation",
    "value": "{\"recipes\": [{\"name\": \"Chicken and Bell Pepper Stir Fry\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice.\"}, {\"name\": \"Chicken and Rice Casserole\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown.\"}]}"
    },
    {
    "from": "gpt",
    "value": "I found two recipes for you. The first one is \"Chicken and Bell Pepper Stir Fry\". The instructions are: Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice. The second one is \"Chicken and Rice Casserole\". The instructions are: Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown. Which one would you like to try?"
    },
    {
    "from": "human",
    "value": "They both sound delicious, but I think I'll try the stir fry. Can you order the ingredients for me?"
    },
    {
    "from": "gpt",
    "value": "I'm sorry, but as an AI, I don't have the capability to perform external tasks such as ordering ingredients. However, I can help you find more recipes or provide cooking tips if you need."
    }
],
"tools": "[{\"name\": \"search_recipes\", \"description\": \"Search for recipes based on ingredients\", \"parameters\": {\"type\": \"object\", \"properties\": {\"ingredients\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}, \"description\": \"The ingredients to search for\"}}, \"required\": [\"ingredients\"]}}]"
}
```

### Register the dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add a **`dataset_info.json`** to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure. 

If you plan to run SFT fine-tuning on the `glaive_toolcall_en_demo` dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access (for example, a shared mount or object storage). Avoid storing large files on the head node.

`dataset_info.json`
```json
{
  "my_glaive_toolcall_en_demo": {
      "file_name": "/mnt/cluster_storage/glaive_toolcall_en_demo.json",
      "formatting": "sharegpt",
      "columns": {
          "messages": "conversations",
          "tools": "tools"
      }
  }
}
```

For a more detailed dataset preparation and formatting guide, see [Choose your data format](https://docs.anyscale.com/llm/fine-tuning/data-preparation#sft).

In [2]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/glaive_toolcall_en_demo.json -O /mnt/cluster_storage/glaive_toolcall_en_demo.json
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

--2026-02-08 08:44:10--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/glaive_toolcall_en_demo.json
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 3.5.85.37, 52.92.130.162, 52.92.177.58, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|3.5.85.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 738925 (722K) [application/json]
Saving to: ‘/mnt/cluster_storage/glaive_toolcall_en_demo.json’

     0K .......... .......... .......... .......... ..........  6%  325K 2s
    50K .......... .......... .......... .......... .......... 13%  325K 2s
   100K .......... .......... .......... .......... .......... 20%  380K 2s
   150K .......... .......... .......... .......... .......... 27% 2.19M 1s
   200K .......... .......... .......... .......... .......... 34%

## Step 3: Create the fine-tuning config (SFT with DeepSpeed)

Next, create the main YAML configuration file—the master recipe for your fine-tuning job. It specifies the base model, the fine-tuning method (LoRA), the dataset, training hyperparameters, cluster resources, and more.

**Important notes:**
- **W&B tracking:** To track experiments with Weights & Biases (W&B), set `report_to: wandb` in the config and provide `WANDB_API_KEY` in the runtime environment. If you don't want to use W&B, set `report_to: none` to avoid errors.
- **Access and paths:** The YAML only needs to be on the **head node**, but any referenced paths (`dataset_dir`, `output_dir`) must reside on storage **reachable by all workers** (for example, `/mnt/cluster_storage/`).
- **Gated models:** If your base model has gated access (for example, Llama) on Hugging Face, set `HF_TOKEN` in the runtime environment.
- **GPU selection and placement:** The config uses a 4xL40S node (`ananyscale/accelerator_shape:4xL40S`) so that all 4 GPUs are on the same machine, which is important for efficient DeepSpeed ZeRO-3 communication. You can switch to other multi-GPU nodes such as `4xA100-40GB` or any other node type with comparable or more VRAM, depending on your cloud availability.

### Configure LLaMA-Factory with Ray

**Note**: To customize the training configuration, edit `train-configs/sft_lora_deepspeed.yaml`. 

```yaml
# sft_lora_deepspeed.yaml

### model
model_name_or_path: Qwen/Qwen2.5-32B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### deepspeed
deepspeed: /mnt/cluster_storage/ds_z3_config.json # path to the DeepSpeed config

### dataset
dataset: my_glaive_toolcall_en_demo
dataset_dir: /mnt/cluster_storage

template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: qwen2.5_32b_lora_sft
logging_steps: 5
save_steps: 50
plot_loss: true
report_to: wandb # or none

### train
per_device_train_batch_size: 1 # Adjust this depending on your GPU memory and sequence length
gradient_accumulation_steps: 4
num_train_epochs: 3.0
learning_rate: 1.0e-4
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: qwen2.5_32b_lora_sft
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use
resources_per_worker:
  GPU: 1
  # accelerator_type:L40S: 0.001            # Use this to simply specify a GPU type (may place GPUs on separate nodes).
  anyscale/accelerator_shape:4xL40S: 0.001  # Prefer this for DeepSpeed so all 4 GPUs are on the same node.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.
ray_init_kwargs:
  runtime_env:
    env_vars:
      # If using wandb for experiments tracking
      WANDB_API_KEY: <your_wandb_token>
      # If using gated models like meta-llama/Llama-3.1-8B-Instruct
      # HF_TOKEN: <your_huggingface_token>
      # If hf_transfer is installed
      HF_HUB_ENABLE_HF_TRANSFER: '1'
```

**Note:**
This configuration assumes `4xL40S` GPUs are available in your cloud environment. If not, you can substitute with `4xA100-40G` (or another supported accelerator with similar VRAM).

### DeepSpeed configuration
DeepSpeed is an open-source deep-learning optimization library developed by Microsoft, aimed at enabling large-model training. Higher ZeRO stages (1→3) and enabling CPU offload reduce GPU VRAM usage, but might cause slower training.

To enable DeepSpeed, create a separate Deepspeed config in the **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)**. and reference it from your main training yaml config with:

```yaml
deepspeed: /mnt/cluster_storage/ds_z3_config.json
```

Below is a sample ZeRO-3 config:

`ds_z3_config.json`
```json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
},
"bf16": {
    "enabled": "auto"
},
"zero_optimization": {
    "stage": 3,
    "overlap_comm": false,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
}
}
```

For a more detailed guide on acceleration and optimization methods including DeepSpeed on Ray, see [Speed and memory optimizations](https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations).

In [3]:
%%bash
# Create a copy of the DeepSpeed configuration file in /mnt/cluster_storage
cp ../deepspeed-configs/ds_z3_config.json /mnt/cluster_storage/

## Step 4: Train and monitor

**Note**: If you installed Weights & Biases, set `WANDB_API_KEY` in the runtime environment. Otherwise, set `report_to: none` in `sft_lora_deepspeed.yaml` to avoid `api_token not set` errors.

With all configurations in place, you can launch fine-tuning or post-training in one of two ways:

### Option A: Run from a workspace (quick start)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

In [5]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/sft_lora_deepspeed.yaml

[2026-02-08 08:51:32,783] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)
INFO 02-08 08:51:34 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2026-02-08 08:51:37,024	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.128.5.218:6379...
2026-02-08 08:51:37,035	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-c1mvc6t862zj4fbguuknngnrgv.i.anyscaleuserdata.com [39m[22m
2026-02-08 08:51:37,037	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_1598f5ff1c5690e55af864dcf6517685e149d19c.zip' (0.07MiB) to Ray cluster...
2026-02-08 08:51:37,037	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_1598f5ff1c5690e55af864dcf6517685e149d19c.zip'.



View detailed results here: /mnt/cluster_storage/qwen2.5_32b_lora_sft
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2026-02-08_07-42-51_475236_185/artifacts/2026-02-08_08-51-37/qwen2.5_32b_lora_sft/driver_artifacts`
[36m(TrainTrainable pid=2869, ip=10.128.7.103)[0m [2026-02-08 08:51:45,442] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                             

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=2869, ip=10.128.7.103)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=2869, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=3001) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=2869, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=3002) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=2869, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=3003) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=2869, ip=10.128.7.103)[0m - (node_id=c0a063c3a2d8319e8a19333c768480af090e91c460068963ebf9fb27, ip=10.128.7.103, pid=3000) world_rank=3, local_rank=3, node_rank=0


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:51:56,572] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:00,532] [INFO] [comm.py:669:init_distributed] cdb=None
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:00] llamafactory.hparams.parser:143 >> Set `ddp_find_unused_parameters` to False in DDP training since LoRA is enabled.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:00] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 4, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 08:52:01,201 >> loading file vocab.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/vocab.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 08:52:01,201 >> loading file merges.txt from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/merges.txt
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 08:52:01,201 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/tokenizer.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 08:52:01,201 >> loading file add

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:02] llamafactory.data.loader:143 >> Loading dataset glaive_toolcall_en_demo.json...
[36m(RayTrainWorker pid=3002, ip=10.128.7.103)[0m [2026-02-08 08:51:56,931] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m


[36m(RayTrainWorker pid=3003, ip=10.128.7.103)[0m [rank2]:[W208 08:52:02.339028791 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Converting format of dataset (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16):  88%|████████▊ | 264/300 [00:00<00:00, 2638.98 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 300/300 [00:00<00:00, 1549.14 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   6%|▋         | 19/300 [00:00<00:08, 31.32 examples/s]
Running tokenizer on dataset (num_proc=16):  19%|█▉        | 57/300 [00:00<00:02, 85.98 examples/s]
Running tokenizer on dataset (num_proc=16):  

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m training example:
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m input_ids:
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 382, 2, 13852, 271, 2610, 1231, 1618, 825, 476, 803, 5746, 311, 7789, 448, 279, 1196, 3239, 382, 2610, 525, 3897, 448, 729, 32628, 2878, 366, 15918, 1472, 15918, 29, 11874, 9492, 510, 27, 15918, 397, 4913, 1313, 788, 330, 1688, 497, 330, 1688, 788, 5212, 606, 788, 330, 1836, 7080, 8923, 497, 330, 4684, 788, 330, 5890, 369, 18627, 3118, 389, 13966, 497, 330, 13786, 788, 5212, 1313, 788, 330, 1700, 497, 330, 13193, 788, 5212, 38120, 788, 5212, 1313, 788, 330, 1653, 497, 330, 3615, 788, 5212, 1313, 788, 330, 917, 14345, 330, 4684, 788, 330, 785, 13966, 311, 2711, 369, 9207, 2137, 330, 6279, 788, 4383, 38120, 1341, 3417, 532, 522, 15918, 1339, 2461, 1817, 729, 1618, 11, 470, 264, 2951, 1633, 448, 729, 8

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 08:52:06,973 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/config.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 08:52:06,973 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=3001, ip=10.128.7.103

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:06] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|modeling_utils.py:1151] 2026-02-08 08:52:07,236 >> loading weights file model.safetensors from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/model.safetensors.index.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|modeling_utils.py:3881] 2026-02-08 08:52:07,238 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:1135] 2026-02-08 08:52:07,248 >> Generate config GenerationConfig {
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "use_cache": false
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m }
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:07,238] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 4


Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s]
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [rank0]:[W208 08:52:03.115556105 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.[32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:07,916] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B


Loading checkpoint shards:   6%|▌         | 1/17 [00:00<00:11,  1.40it/s]
Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s][32m [repeated 3x across cluster][0m
Loading checkpoint shards:  29%|██▉       | 5/17 [00:05<00:14,  1.18s/it][32m [repeated 19x across cluster][0m
Loading checkpoint shards:  59%|█████▉    | 10/17 [00:11<00:08,  1.18s/it][32m [repeated 17x across cluster][0m
Loading checkpoint shards:  82%|████████▏ | 14/17 [00:16<00:03,  1.22s/it][32m [repeated 19x across cluster][0m
Loading checkpoint shards:  94%|█████████▍| 16/17 [00:18<00:01,  1.20s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:19<00:00,  1.13s/it]
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|modeling_utils.py:5131] 2026-02-08 08:52:28,004 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m 
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|modeling_utils.py:5139] 2026-02-08 0

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:28] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:28] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:28] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:28] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|2026-02-08 08:52:28] llamafactory.model.model_utils.misc:143 >> Found linear modules: k_proj,down_proj,v_proj,up_proj,q_proj,gate_proj,o_proj
[36m(RayTrainWorker pid=3002, ip=10.128.7.103)[0m [2026-02-08 08:52:07,353] [INFO] [config.py:735:__init__] Config mesh_device None world_si

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:756] 2026-02-08 08:52:29,128 >> Using auto half precision backend


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:29,587] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.9, git-hash=unknown, git-branch=unknown
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:29,654] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:29,661] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:29,661] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:29,769] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:29,769] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support f

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2409] 2026-02-08 08:52:35,664 >> ***** Running training *****
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2410] 2026-02-08 08:52:35,664 >>   Num examples = 300
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2411] 2026-02-08 08:52:35,664 >>   Num Epochs = 3
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2412] 2026-02-08 08:52:35,664 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2415] 2026-02-08 08:52:35,664 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2416] 2026-02-08 08:52:35,664 >>   Gradient Accumulation steps = 4
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2417] 2026-02-08 08:52:35,664 >>   Total optimization steps = 57
[36m(RayTrainWorker pid=3001, ip=10.12

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:35,647] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:35,647] [INFO] [utils.py:782:see_memory_usage] MA 15.43 GB         Max_MA 15.43 GB         CA 15.69 GB         Max_CA 16 GB 
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:35,648] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.64 GB, percent = 1.6%
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:35,648] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:35,648] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 08:52:35,648] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Schedu

  2%|▏         | 1/57 [00:16<15:43, 16.84s/it])[0m 


[36m(autoscaler +1m25s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.


  4%|▎         | 2/57 [00:26<11:40, 12.73s/it])[0m 
  5%|▌         | 3/57 [00:36<10:15, 11.40s/it])[0m 
  7%|▋         | 4/57 [00:46<09:31, 10.77s/it])[0m 
  9%|▉         | 5/57 [00:56<09:01, 10.41s/it])[0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.5419, 'grad_norm': 0.29426564591950005, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.27}


 11%|█         | 6/57 [01:05<08:40, 10.21s/it])[0m 
 12%|█▏        | 7/57 [01:15<08:23, 10.08s/it])[0m 
 14%|█▍        | 8/57 [01:25<08:09,  9.98s/it])[0m 
 16%|█▌        | 9/57 [01:35<07:56,  9.92s/it])[0m 
 18%|█▊        | 10/57 [01:45<07:43,  9.87s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.5142, 'grad_norm': 0.1509424780319976, 'learning_rate': 9.91486549841951e-05, 'epoch': 0.53}


 19%|█▉        | 11/57 [01:54<07:32,  9.84s/it][0m 
 21%|██        | 12/57 [02:04<07:21,  9.81s/it][0m 
 23%|██▎       | 13/57 [02:14<07:11,  9.80s/it][0m 
 25%|██▍       | 14/57 [02:24<07:00,  9.78s/it][0m 
 26%|██▋       | 15/57 [02:33<06:50,  9.78s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.4473, 'grad_norm': 0.12575272194787512, 'learning_rate': 9.405060971428923e-05, 'epoch': 0.8}


 28%|██▊       | 16/57 [02:43<06:40,  9.77s/it][0m 
 30%|██▉       | 17/57 [02:53<06:30,  9.77s/it][0m 
 32%|███▏      | 18/57 [03:03<06:20,  9.76s/it][0m 
 33%|███▎      | 19/57 [03:10<05:41,  8.98s/it][0m 
 35%|███▌      | 20/57 [03:19<05:40,  9.20s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3856, 'grad_norm': 0.12078352261151609, 'learning_rate': 8.480669729814635e-05, 'epoch': 1.05}


 37%|███▋      | 21/57 [03:29<05:37,  9.37s/it][0m 
 39%|███▊      | 22/57 [03:39<05:31,  9.48s/it][0m 
 40%|████      | 23/57 [03:49<05:24,  9.55s/it][0m 
 42%|████▏     | 24/57 [03:58<05:16,  9.61s/it][0m 
 44%|████▍     | 25/57 [04:08<05:08,  9.65s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3079, 'grad_norm': 0.1151843729189665, 'learning_rate': 7.228691778882693e-05, 'epoch': 1.32}


 46%|████▌     | 26/57 [04:18<05:00,  9.68s/it][0m 
 47%|████▋     | 27/57 [04:28<04:50,  9.70s/it][0m 
 49%|████▉     | 28/57 [04:37<04:41,  9.71s/it][0m 
 51%|█████     | 29/57 [04:47<04:31,  9.71s/it][0m 
 53%|█████▎    | 30/57 [04:57<04:22,  9.71s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.2882, 'grad_norm': 0.08515978682873142, 'learning_rate': 5.7669582743934284e-05, 'epoch': 1.59}


 54%|█████▍    | 31/57 [05:07<04:12,  9.72s/it][0m 
 56%|█████▌    | 32/57 [05:16<04:03,  9.73s/it][0m 
 58%|█████▊    | 33/57 [05:26<03:53,  9.73s/it][0m 
 60%|█████▉    | 34/57 [05:36<03:43,  9.72s/it][0m 
 61%|██████▏   | 35/57 [05:45<03:33,  9.72s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3361, 'grad_norm': 0.06301017189365173, 'learning_rate': 4.233041725606572e-05, 'epoch': 1.85}


 63%|██████▎   | 36/57 [05:55<03:24,  9.73s/it][0m 
 65%|██████▍   | 37/57 [06:05<03:14,  9.73s/it][0m 
 67%|██████▋   | 38/57 [06:12<02:51,  9.01s/it][0m 
 68%|██████▊   | 39/57 [06:22<02:45,  9.22s/it][0m 
 70%|███████   | 40/57 [06:32<02:39,  9.36s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3296, 'grad_norm': 0.12993469598011465, 'learning_rate': 2.771308221117309e-05, 'epoch': 2.11}


 72%|███████▏  | 41/57 [06:41<02:31,  9.47s/it][0m 
 74%|███████▎  | 42/57 [06:51<02:23,  9.54s/it][0m 
 75%|███████▌  | 43/57 [07:01<02:14,  9.60s/it][0m 
 77%|███████▋  | 44/57 [07:11<02:05,  9.64s/it][0m 
 79%|███████▉  | 45/57 [07:20<01:55,  9.66s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3066, 'grad_norm': 0.06914665181850518, 'learning_rate': 1.5193302701853673e-05, 'epoch': 2.37}


 81%|████████  | 46/57 [07:30<01:46,  9.67s/it][0m 
 82%|████████▏ | 47/57 [07:40<01:36,  9.68s/it][0m 
 84%|████████▍ | 48/57 [07:49<01:27,  9.69s/it][0m 
 86%|████████▌ | 49/57 [07:59<01:17,  9.70s/it][0m 
 88%|████████▊ | 50/57 [08:09<01:07,  9.70s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3301, 'grad_norm': 0.08821184709024278, 'learning_rate': 5.949390285710776e-06, 'epoch': 2.64}


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:3993] 2026-02-08 09:01:01,146 >> Saving model checkpoint to qwen2.5_32b_lora_sft/checkpoint-50
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:01:01,428 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/config.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:01:01,429 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=3001, ip=10

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:01:02,035] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step49 is about to be saved!
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:01:02,085] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: qwen2.5_32b_lora_sft/checkpoint-50/global_step49/zero_pp_rank_0_mp_rank_00_model_states.pt
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:01:02,085] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving qwen2.5_32b_lora_sft/checkpoint-50/global_step49/zero_pp_rank_0_mp_rank_00_model_states.pt...
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:01:02,136] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved qwen2.5_32b_lora_sft/checkpoint-50/global_step49/zero_pp_rank_0_mp_rank_00_model_states.pt.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:01:02,139] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] 

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_6ecdb_00000_0_2026-02-08_08-51-37/checkpoint_000000)



Training finished iteration 1 at 2026-02-08 09:01:09. Total running time: 9min 32s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s              561.34709 │
│ time_total_s                  561.34709 │
│ training_iteration                    1 │
│ epoch                              2.64 │
│ grad_norm                       0.08821 │
│ learning_rate                   0.00001 │
│ loss                             0.3301 │
│ step                                 50 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 1 at: (local)/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_6ecdb_00000_0_2026-02-08_08-51-37/checkpoint_000000


 89%|████████▉ | 51/57 [08:43<01:42, 17.13s/it][0m 
[36m(RayTrainWorker pid=3002, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_6ecdb_00000_0_2026-02-08_08-51-37/checkpoint_000000)[32m [repeated 3x across cluster][0m
 91%|█████████ | 52/57 [08:53<01:14, 14.93s/it][0m 
 93%|█████████▎| 53/57 [09:03<00:53, 13.40s/it][0m 
 95%|█████████▍| 54/57 [09:13<00:36, 12.33s/it][0m 
 96%|█████████▋| 55/57 [09:23<00:23, 11.58s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'loss': 0.3133, 'grad_norm': 0.09929192734739711, 'learning_rate': 8.513450158049108e-07, 'epoch': 2.91}


 98%|█████████▊| 56/57 [09:32<00:11, 11.06s/it][0m 
100%|██████████| 57/57 [09:40<00:00,  9.96s/it][0m 
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:3993] 2026-02-08 09:02:32,148 >> Saving model checkpoint to qwen2.5_32b_lora_sft/checkpoint-57
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:02:32,378 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/config.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:02:32,379 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   ],
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "attention_dropout": 0.0,
[36m(Ra

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:02:33,064] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step56 is about to be saved!
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:02:33,112] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: qwen2.5_32b_lora_sft/checkpoint-57/global_step56/zero_pp_rank_0_mp_rank_00_model_states.pt
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:02:33,112] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving qwen2.5_32b_lora_sft/checkpoint-57/global_step56/zero_pp_rank_0_mp_rank_00_model_states.pt...
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:02:33,161] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved qwen2.5_32b_lora_sft/checkpoint-57/global_step56/zero_pp_rank_0_mp_rank_00_model_states.pt.
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [2026-02-08 09:02:33,163] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] 

[36m(RayTrainWorker pid=3002, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_6ecdb_00000_0_2026-02-08_08-51-37/checkpoint_000001)
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:2676] 2026-02-08 09:02:38,612 >> 
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m 
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m 
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m 



Training finished iteration 2 at 2026-02-08 09:02:38. Total running time: 11min 1s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000001 │
│ time_this_iter_s                89.5918 │
│ time_total_s                  650.93889 │
│ training_iteration                    2 │
│ epoch                           2.90667 │
│ grad_norm                       0.09929 │
│ learning_rate                        0. │
│ loss                             0.3133 │
│ step                                 55 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 2 at: (local)/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_6ecdb_00000_0_2026-02-08_08-51-37/checkpoint_000001


100%|██████████| 57/57 [10:03<00:00, 10.59s/it][0m 


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'train_runtime': 602.9399, 'train_samples_per_second': 1.493, 'train_steps_per_second': 0.095, 'train_loss': 0.36622098924820884, 'epoch': 3.0}


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|trainer.py:3993] 2026-02-08 09:02:55,205 >> Saving model checkpoint to qwen2.5_32b_lora_sft
[36m(RayTrainWorker pid=3000, ip=10.128.7.103)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_6ecdb_00000_0_2026-02-08_08-51-37/checkpoint_000001)[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:698] 2026-02-08 09:02:55,432 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct/snapshots/5ede1c97bbab6ce5cda5812749b4c0bdf79b18dd/config.json
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|configuration_utils.py:770] 2026-02-08 09:02:55,432 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   "architectures": [
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m     "Qwen2ForCausalL

[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m ***** train metrics *****
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   epoch                    =        3.0
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   total_flos               =    68936GF
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   train_loss               =     0.3662
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   train_runtime            = 0:10:02.93
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   train_samples_per_second =      1.493
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m   train_steps_per_second   =      0.095


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m [INFO|modelcard.py:450] 2026-02-08 09:02:55,941 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


[36m(RayTrainWorker pid=3001, ip=10.128.7.103)[0m Figure saved at: qwen2.5_32b_lora_sft/training_loss.png

Training completed after 2 iterations at 2026-02-08 09:02:58. Total running time: 11min 20s


2026-02-08 09:02:58,095	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/qwen2.5_32b_lora_sft' in 0.0139s.





### Option B: Run as an Anyscale job (production)

For longer or production runs, submit the training as an **Anyscale job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You package LLaMA-Factory and other libraries in a container image and launch with a short job config. See [Run LLaMA-Factory as an Anyscale job](https://docs.anyscale.com/llm/fine-tuning/llamafactory-jobs) for the step-by-step guide.

### Monitoring with Weights & Biases (WandB)
If you enabled Weights & Biases (with `report_to: wandb` in the training config YAML file), you can monitor your training job in real-time. Look for the training loss to decrease steadily, which indicates the model is learning.

**Weights & Biases example**

![WandB](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.1/3.2.1-wandb.png)

For a more detailed guide on tracking experiments with other tools such as Weights & Biases or MLflow, see [Observability and tracking](https://docs.anyscale.com/llm/fine-tuning/observability-and-tracking).

## Step 5: Locate checkpoints

Ray Train writes checkpoints under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/qwen2.5_32b_lora_sft`. 

Inside, you see a **trainer session** directory named like:
`TorchTrainer_8c6a5_00000_0_2025-09-09_09-53-45/`.

- Ray Train creates `TorchTrainer_*` **when the trainer starts**; the suffix encodes a short run ID and the **start timestamp**.
- Within that directory, Ray Train names checkpoints `checkpoint_000xxx/`, where the number is the saved ordered checkpoints.

Control the save cadence with `save_strategy` and `save_steps`. For instructions on how to resume interrupted training with `resume_from_checkpoint` and more, see [Understand the artifacts directory](https://docs.anyscale.com/llm/fine-tuning/checkpointing#artifacts-directory).

## Step 6: Export the model

If you use LoRA, you can keep the base model and adapters separate for [multi-LoRA deployment](https://docs.anyscale.com/llm/serving/multi-lora) or [merge the adapters](https://docs.anyscale.com/llm/fine-tuning/checkpointing#merge-lora) into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply [post-training quantization](https://docs.anyscale.com/llm/fine-tuning/checkpointing#ptq) on merged or full models before serving.