# Continued Pre-Training (CPT) at scale with DeepSpeed

This guide provides a step-by-step workflow for continued pre-training the [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt) model on a multi-GPU Anyscale cluster. It uses LLaMA-Factory for the training framework and `DeepSpeed` to efficiently manage memory and scale the training process.

CPT is a technique to further adapt a pre-trained base model on large-scale unlabeled text. By continuing to train on high-quality corpora, you adapt the model to new domain knowledge and improve generalization. This notebook performs full fine-tuning of the base model instead of using parameter-efficient fine-tuning (PEFT) techniques.

- **Full fine-tuning vs LoRA:** Full fine-tuning generally yields the best quality but requires significantly more compute, longer training, and large checkpoints. LoRA is much faster and cheaper with small adapter checkpoints, but typically shows the most improvement on curated, simplified corpora (gains on broad/noisy corpora may be limited). See [Compare full vs freeze vs PEFT](https://docs.anyscale.com/llm/fine-tuning#compare-full-vs-freeze-vs-parameter-efficient-fine-tuning-peft) and [LoRA speed and memory optimizations](https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations#lora).

## Step 1: Set up your environment

### Dependencies
First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended container image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads:


In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3

# Install DeepSpeed for large-scale training
pip install -q deepspeed==0.16.9

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

# (Optional) Experiment tracking library
pip install -q mlflow==3.4.0


[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `deepspeed` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `hf_transfer` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_a6j8iubw9rqbyigfwk9fut4amk/prj_a8aurpnjjkhushuarbyy4kwkre/workspaces/expwrk_kpm6l9gjz6gdcskt2zb8i3fie6?workspace-tab=dependencies[0m
[92mSuccessfully registered `mlflow` package to be installed on all cluster nodes.[0m
[92mView and

### Model and compute resources

DeepSpeed ZeRO-3 partitions parameters, gradients, and optimizer states across multiple GPUs, enabling CPT of mid-sized LLMs on just 4 GPUs.

| Item | Value |
|------|-------|
| **Base model** | [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt) |
| **Worker nodes** | 4 × L40S / 4 x A100-40G |

## Step 2: Prepare the dataset

### Understand the dataset
This tutorial uses a simple JSONL corpus ([C4](https://huggingface.co/datasets/allenai/c4)) containing cleaned English web text derived from Common Crawl, widely used for language-model pretraining. Each line is a JSON object with at least a `text` field. For demo purposes, the sample `c4.jsonl` contains only the first 100 records from the original C4 dataset (hosted on S3) to enable quick runs.

**Dataset example**

```json
{"text": "Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.", "timestamp": "2019-04-25 12:57:54", "url": "https://klyq.com/beginners-bbq-class-taking-place-in-missoula/"}
```

### Register the dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add a **`dataset_info.json`** to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure. 

If you plan to run CPT on this text dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access (for example, a shared mount or object storage). Avoid storing large files on the head node.

`dataset_info.json`
```json
{
  "my_cpt_c4": {
      "file_name": "/mnt/cluster_storage/c4.jsonl",
      "columns": {
          "prompt": "text"
      }
  }
}
```

For a more detailed dataset preparation and formatting guide, see [Choose your data format](https://docs.anyscale.com/llm/fine-tuning/data-preparation#continued-pretraining).


In [2]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/c4.jsonl -O /mnt/cluster_storage/c4.jsonl
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/


--2026-02-08 17:21:23--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/c4.jsonl
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 3.5.85.188, 52.92.168.50, 52.218.181.209, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|3.5.85.188|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218450 (213K) [application/x-www-form-urlencoded]
Saving to: ‘/mnt/cluster_storage/c4.jsonl’

     0K .......... .......... .......... .......... .......... 23%  326K 1s
    50K .......... .......... .......... .......... .......... 46%  325K 0s
   100K .......... .......... .......... .......... .......... 70%  387K 0s
   150K .......... .......... .......... .......... .......... 93% 1.99M 0s
   200K .......... ...                                        100%  105M=0.5s

2026-02-0

## Step 3: Create the pre-training config (CPT with DeepSpeed)

Next, create the main YAML configuration file—the master recipe for your pre-training job. It specifies the base model, the training method (full fine-tuning), the dataset, training hyperparameters, cluster resources, and more.

**Important notes:**
- **MLflow tracking:** To track experiments with MLflow, set `report_to: mlflow` in the config. If you don't want to use MLflow, set `report_to: none` to avoid errors.
- **Access and paths:** The YAML only needs to be on the **head node**, but any referenced paths (`dataset_dir`, `output_dir`) must reside on storage **reachable by all workers** (for example, `/mnt/cluster_storage/`).
- **Gated models:** If your base model has gated access (for example, Gemma) on Hugging Face, set `HF_TOKEN` in the runtime environment.
- **GPU selection and placement:** The config uses a 4xL40S node (`anyscale/accelerator_shape:4xL40S`) so that all 4 GPUs are on the same machine, which is important for efficient DeepSpeed ZeRO-3 communication. You can switch to other multi-GPU nodes such as `4xA100-40GB` or any other node type with comparable or more VRAM, depending on your cloud availability.

### Configure LLaMA-Factory with Ray

**Note**: To customize the training configuration, edit `train-configs/cpt_deepspeed.yaml`. 

```yaml
# cpt_deepspeed.yaml

### model
model_name_or_path: google/gemma-3-4b-pt
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full

### deepspeed
deepspeed: /mnt/cluster_storage/ds_z3_config.json # path to the DeepSpeed config

### dataset
dataset: my_cpt_c4
dataset_dir: /mnt/cluster_storage

template: gemma
cutoff_len: 512
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: gemma3_4b_full_cpt
logging_steps: 2
save_steps: 50
plot_loss: true
report_to: mlflow   # or none

### train
per_device_train_batch_size: 1 # Adjust this depending on your GPU memory and sequence length
gradient_accumulation_steps: 2
num_train_epochs: 2.0
learning_rate: 1.0e-4
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: gemma3_4b_full_cpt
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use
resources_per_worker:
  GPU: 1
  # accelerator_type:L40S: 0.001            # Use this to simply specify a GPU type (may place GPUs on separate nodes).
  anyscale/accelerator_shape:4xL40S: 0.001  # Prefer this for DeepSpeed so all 4 GPUs are on the same node.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.
ray_init_kwargs:
  runtime_env:
    env_vars:
      # If using gated models like google/gemma-3-4b-pt
      HF_TOKEN: <your_huggingface_token>
      # If hf_transfer is installed
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      # If using mlflow for experiments tracking
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

**Note:**
This configuration assumes `4xL40S` GPUs are available in your cloud environment. If not, you can substitute with `4xA100-40G` (or another supported accelerator with similar VRAM).

Together, `stage: pt` and `finetuning_type: full` configure this run as full continued pre-training on this C4-based corpus, producing full model checkpoints rather than lightweight adapters.

### DeepSpeed configuration
DeepSpeed is an open-source deep-learning optimization library developed by Microsoft, aimed at enabling large-model training. Higher ZeRO stages (1→3) and enabling CPU offload reduce GPU VRAM usage, but might cause slower training.

To enable DeepSpeed, create a separate Deepspeed config in the **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)**. and reference it from your main training yaml config with:

```yaml
deepspeed: /mnt/cluster_storage/ds_z3_config.json
```

Below is a sample ZeRO-3 config:

`ds_z3_config.json`
```json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
},
"bf16": {
    "enabled": "auto"
},
"zero_optimization": {
    "stage": 3,
    "overlap_comm": false,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
}
}
```

For a more detailed guide on acceleration and optimization methods including DeepSpeed on Ray, see [Speed and memory optimizations](https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations).


In [3]:
%%bash
# Create a copy of the DeepSpeed configuration file in /mnt/cluster_storage
cp ../deepspeed-configs/ds_z3_config.json /mnt/cluster_storage/


## Step 4: Train and monitor

**Note**: For gated models such as [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt), ensure that you accept the license agreement for the models on the Hugging Face site and set `HF_TOKEN` in the runtime environment. If you installed MLflow, configure its credentials. Otherwise, set `report_to: none` in `cpt_deepspeed.yaml` to avoid `api_token not set` errors.

With all configurations in place, you can launch pre-training in one of two ways:

### Option A: Run from a workspace (quickstart)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.


In [5]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/cpt_deepspeed.yaml


[2026-02-08 17:32:21,905] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)
INFO 02-08 17:32:24 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2026-02-08 17:32:26,365	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.128.4.189:6379...
2026-02-08 17:32:26,376	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-c1mvc6t862zj4fbguuknngnrgv.i.anyscaleuserdata.com [39m[22m
2026-02-08 17:32:26,377	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_873ed4d0505528fa538926e1170e8ec9f1d45599.zip' (0.29MiB) to Ray cluster...
2026-02-08 17:32:26,378	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_873ed4d0505528fa538926e1170e8ec9f1d45599.zip'.



View detailed results here: /mnt/cluster_storage/gemma3_4b_full_cpt
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2026-02-08_15-34-45_799476_185/artifacts/2026-02-08_17-32-26/gemma3_4b_full_cpt/driver_artifacts`
[36m(TrainTrainable pid=2619, ip=10.128.6.27)[0m [2026-02-08 17:32:34,692] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                                  

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=2619, ip=10.128.6.27)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=2619, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=2754) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=2619, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=2755) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=2619, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=2753) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=2619, ip=10.128.6.27)[0m - (node_id=b4b2552a8e4226a8a0d5e49624b9a465a3e4259cfbe479ced7e89233, ip=10.128.6.27, pid=2752) world_rank=3, local_rank=3, node_rank=0


[36m(RayTrainWorker pid=2753, ip=10.128.6.27)[0m [2026-02-08 17:32:46,453] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:32:50,555] [INFO] [comm.py:669:init_distributed] cdb=None
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:32:51] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 4, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16
[36m(RayTrainWorker pid=2755, ip=10.128.6.27)[0m [2026-02-08 17:32:46,600] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:32:51,599 >> loading file tokenizer.model from cache at /home/ray/.cache/huggingface/hub/models--google--gemma-3-4b-pt/snapshots/cc012e0a6d0787b4adcc0fa2c4da74402494554d/tokenizer.model
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:32:51,599 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--google--gemma-3-4b-pt/snapshots/cc012e0a6d0787b4adcc0fa2c4da74402494554d/tokenizer.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:32:51,599 >> loading file added_tokens.json from cache at /home/ray/.cache/huggingface/hub/models--google--gemma-3-4b-pt/snapshots/cc012e0a6d0787b4adcc0fa2c4da74402494554d/added_tokens.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2023] 2026-02-08 17:32:51,599 >> loading fil

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:32:58] llamafactory.data.template:143 >> Replace eos token: <end_of_turn>.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:32:58] llamafactory.data.loader:143 >> Loading dataset /mnt/cluster_storage/c4.jsonl...
[36m(RayTrainWorker pid=2753, ip=10.128.6.27)[0m [2026-02-08 17:32:51,444] [INFO] [comm.py:669:init_distributed] cdb=None[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=2753, ip=10.128.6.27)[0m [INFO|2026-02-08 17:32:51] llamafactory.hparams.parser:406 >> Process rank: 2, world size: 4, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16[32m [repeated 3x across cluster][0m


Running tokenizer on dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   7%|▋         | 7/100 [00:02<00:29,  3.16 examples/s]
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [rank0]:[W208 17:32:59.880781223 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.[32m [repeated 3x across cluster][0m
Running tokenizer on dataset (num_proc=16):  21%|██        | 21/100 [00:02<00:09,  8.66 examples/s]
Running tokenizer on dataset (num_proc=16):  34%|███▍      | 34/100 [00:03<00:05, 13.09 examples/s]
Running tokenizer on dataset (num_proc=16):  40%|████      | 40/100 [00:04<00:05, 10.09 examples/s]
Running tokenizer on dataset (num_proc=16):  46%|████▌     | 46/100 [00:04<00:05, 10.69 examples/s]
Runnin

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m training example:
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m input_ids:
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2, 3844, 56179, 6679, 32110, 15444, 528, 8227, 167326, 236888, 107, 6294, 611, 1461, 531, 974, 2480, 657, 3043, 14788, 56179, 236881, 1599, 795, 735, 506, 5506, 236764, 2247, 672, 580, 822, 14626, 1492, 236761, 9853, 236764, 5857, 236743, 236778, 236778, 523, 6154, 4109, 6679, 56179, 34117, 236764, 22801, 10219, 571, 699, 54785, 46672, 88554, 50995, 236761, 1293, 795, 577, 10299, 496, 52766, 1984, 1012, 573, 4677, 1015, 8150, 531, 974, 2480, 607, 910, 50353, 6130, 236761, 107, 2209, 795, 3786, 611, 4326, 611, 1202, 531, 1281, 531, 20811, 528, 496, 39684, 5580, 56179, 8105, 236764, 2440, 8403, 236764, 23642, 236764, 90353, 236764, 11495, 6799, 532, 92371, 236764, 2915, 115440, 532, 4304, 1938, 236761, 107, 818, 2157, 531, 577, 528, 506, 1012, 563, 609, 236800, 236810, 810, 1589, 236764, 532, 573, 69589, 625

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_utils.py:698] 2026-02-08 17:33:10,220 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--google--gemma-3-4b-pt/snapshots/cc012e0a6d0787b4adcc0fa2c4da74402494554d/config.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:311] 2026-02-08 17:33:10,221 >> text_config is None, using default Gemma3TextConfig text config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:319] 2026-02-08 17:33:10,221 >> vision_config is None, using default SiglipVisionConfig vision config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_utils.py:770] 2026-02-08 17:33:10,222 >> Model config Gemma3Config {
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   "architectures": [
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m     "Gemma3ForConditionalGeneration"
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:10] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modeling_utils.py:1151] 2026-02-08 17:33:10,289 >> loading weights file model.safetensors from cache at /home/ray/.cache/huggingface/hub/models--google--gemma-3-4b-pt/snapshots/cc012e0a6d0787b4adcc0fa2c4da74402494554d/model.safetensors.index.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modeling_utils.py:3881] 2026-02-08 17:33:10,289 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_utils.py:1135] 2026-02-08 17:33:10,303 >> Generate config GenerationConfig {
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   "use_cache": false
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m }
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:10,290] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 4


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modeling_utils.py:2241] 2026-02-08 17:33:10,735 >> Instantiating SiglipVisionModel model under default dtype torch.float32.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modeling_utils.py:2241] 2026-02-08 17:33:11,157 >> Instantiating Gemma3TextModel model under default dtype torch.float32.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:11,768] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 884, num_elems = 4.97B


Loading checkpoint shards:  50%|█████     | 1/2 [00:00<00:00,  1.48it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.06s/it]
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modeling_utils.py:5131] 2026-02-08 17:33:14,884 >> All model checkpoint weights were used when initializing Gemma3ForConditionalGeneration.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m 
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modeling_utils.py:5139] 2026-02-08 17:33:14,884 >> All the weights of Gemma3ForConditionalGeneration were initialized from the model checkpoint at google/gemma-3-4b-pt.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m If your task is similar to the task the model of the checkpoint was trained on, you can already use Gemma3ForConditionalGeneration for predictions without further training.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_utils.py:1090] 2026-02-08 17:33:14,990 >> loading configuration file generation_conf

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:15] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:15] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:15] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:15] llamafactory.model.adapter:143 >> Fine-tuning method: Full
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:15] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['vision_tower'].
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|2026-02-08 17:33:15] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: multi_mod

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:756] 2026-02-08 17:33:15,123 >> Using auto half precision backend


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:15,534] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.9, git-hash=unknown, git-branch=unknown
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:15,551] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:15,553] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:15,553] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:15,578] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:15,578] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for opt

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2409] 2026-02-08 17:33:20,525 >> ***** Running training *****
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2410] 2026-02-08 17:33:20,525 >>   Num examples = 80
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2411] 2026-02-08 17:33:20,525 >>   Num Epochs = 2
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2412] 2026-02-08 17:33:20,525 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2415] 2026-02-08 17:33:20,525 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2416] 2026-02-08 17:33:20,525 >>   Gradient Accumulation steps = 2
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:2417] 2026-02-08 17:33:20,525 >>   Total optimization steps = 20
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:20,517] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:20,517] [INFO] [utils.py:782:see_memory_usage] MA 7.44 GB         Max_MA 9.94 GB         CA 11.08 GB         Max_CA 11 GB 
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:20,518] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.99 GB, percent = 1.6%
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:20,518] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:20,518] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [2026-02-08 17:33:20,518] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = No

  5%|▌         | 1/20 [00:04<01:33,  4.90s/it][0m 
 10%|█         | 2/20 [00:06<00:56,  3.14s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 2.6595, 'grad_norm': 6.205087548139399, 'learning_rate': 5e-05, 'epoch': 0.2}


 15%|█▌        | 3/20 [00:08<00:44,  2.59s/it][0m 
 20%|██        | 4/20 [00:10<00:37,  2.32s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 2.7983, 'grad_norm': 9.87541869169372, 'learning_rate': 9.924038765061042e-05, 'epoch': 0.4}


 25%|██▌       | 5/20 [00:12<00:32,  2.17s/it][0m 
 30%|███       | 6/20 [00:14<00:29,  2.08s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 2.9596, 'grad_norm': 5.664050132477795, 'learning_rate': 9.330127018922194e-05, 'epoch': 0.6}


 35%|███▌      | 7/20 [00:16<00:26,  2.03s/it][0m 
 40%|████      | 8/20 [00:18<00:23,  1.99s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 2.7879, 'grad_norm': 5.743530685137607, 'learning_rate': 8.213938048432697e-05, 'epoch': 0.8}


 45%|████▌     | 9/20 [00:20<00:21,  1.97s/it][0m 
 50%|█████     | 10/20 [00:22<00:19,  1.96s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 3.0082, 'grad_norm': 7.396988436336389, 'learning_rate': 6.710100716628344e-05, 'epoch': 1.0}


 55%|█████▌    | 11/20 [00:24<00:17,  1.94s/it][0m 
 60%|██████    | 12/20 [00:25<00:15,  1.93s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 1.3599, 'grad_norm': 4.791414721084176, 'learning_rate': 5e-05, 'epoch': 1.2}


 65%|██████▌   | 13/20 [00:27<00:13,  1.93s/it][0m 
 70%|███████   | 14/20 [00:29<00:11,  1.92s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 1.3562, 'grad_norm': 4.401493998603981, 'learning_rate': 3.289899283371657e-05, 'epoch': 1.4}


 75%|███████▌  | 15/20 [00:31<00:09,  1.91s/it][0m 
 80%|████████  | 16/20 [00:33<00:07,  1.91s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 1.0892, 'grad_norm': 11.302256057240426, 'learning_rate': 1.7860619515673033e-05, 'epoch': 1.6}


 85%|████████▌ | 17/20 [00:35<00:05,  1.90s/it][0m 
 90%|█████████ | 18/20 [00:37<00:03,  1.91s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 1.2647, 'grad_norm': 8.079666045810312, 'learning_rate': 6.698729810778065e-06, 'epoch': 1.8}


 95%|█████████▌| 19/20 [00:39<00:01,  1.91s/it][0m 
100%|██████████| 20/20 [00:41<00:00,  1.90s/it][0m 


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'loss': 1.2577, 'grad_norm': 5.5098817392068895, 'learning_rate': 7.596123493895991e-07, 'epoch': 2.0}


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:3993] 2026-02-08 17:34:04,016 >> Saving model checkpoint to gemma3_4b_full_cpt/checkpoint-20
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:311] 2026-02-08 17:34:04,020 >> text_config is None, using default Gemma3TextConfig text config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:319] 2026-02-08 17:34:04,020 >> vision_config is None, using default SiglipVisionConfig vision config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:311] 2026-02-08 17:34:04,020 >> text_config is None, using default Gemma3TextConfig text config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:319] 2026-02-08 17:34:04,020 >> vision_config is None, using default SiglipVisionConfig vision config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:311] 2026-02-08 17:34:04,021 >> te


Training finished iteration 1 at 2026-02-08 17:34:36. Total running time: 2min 9s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s              118.66813 │
│ time_total_s                  118.66813 │
│ training_iteration                    1 │
│ epoch                                2. │
│ grad_norm                       5.50988 │
│ learning_rate                        0. │
│ loss                             1.2577 │
│ step                                 20 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 1 at: (local)/mnt/cluster_storage/gemma3_4b_full_cpt/TorchTrainer_30d7b_00000_0_2026-02-08_17-32-26/checkpoint_000000


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/gemma3_4b_full_cpt/TorchTrainer_30d7b_00000_0_2026-02-08_17-32-26/checkpoint_000000)
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|image_processing_base.py:260] 2026-02-08 17:34:38,766 >> Image processor saved in gemma3_4b_full_cpt/checkpoint-20/preprocessor_config.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2356] 2026-02-08 17:34:38,767 >> chat template saved in gemma3_4b_full_cpt/checkpoint-20/chat_template.jinja
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2525] 2026-02-08 17:34:38,822 >> tokenizer config file saved in gemma3_4b_full_cpt/checkpoint-20/tokenizer_config.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2534] 2026-02-08 17:34:38,822 >> Special tokens file saved in gemma3_4b_full_cpt/checkpoint-20/s

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'train_runtime': 81.6498, 'train_samples_per_second': 1.96, 'train_steps_per_second': 0.245, 'train_loss': 2.054127204418182, 'epoch': 2.0}


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2525] 2026-02-08 17:34:42,235 >> tokenizer config file saved in gemma3_4b_full_cpt/tokenizer_config.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|tokenization_utils_base.py:2534] 2026-02-08 17:34:42,236 >> Special tokens file saved in gemma3_4b_full_cpt/special_tokens_map.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|processing_utils.py:709] 2026-02-08 17:34:45,526 >> processor saved in gemma3_4b_full_cpt/processor_config.json
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|trainer.py:3993] 2026-02-08 17:34:47,597 >> Saving model checkpoint to gemma3_4b_full_cpt
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:311] 2026-02-08 17:34:47,600 >> text_config is None, using default Gemma3TextConfig text config.
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|configuration_gemma3.py:319] 2026-02-08 17:34:47,600 >> vision_config i

[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m ***** train metrics *****
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   epoch                    =        2.0
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   total_flos               =      343GF
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   train_loss               =     2.0541
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   train_runtime            = 0:01:21.64
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   train_samples_per_second =       1.96
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m   train_steps_per_second   =      0.245


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m [INFO|modelcard.py:450] 2026-02-08 17:34:59,304 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


[36m(RayTrainWorker pid=2754, ip=10.128.6.27)[0m Figure saved at: gemma3_4b_full_cpt/training_loss.png

Training completed after 1 iterations at 2026-02-08 17:35:01. Total running time: 2min 34s


2026-02-08 17:35:01,183	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/gemma3_4b_full_cpt' in 0.0171s.





### Option B: Run as an Anyscale job (production)

For longer or production runs, submit the training as an **Anyscale job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You package LLaMA-Factory and other libraries in a container image and launch with a short job config. See [Run LLaMA-Factory as an Anyscale job](https://docs.anyscale.com/llm/fine-tuning/llamafactory-jobs) for the step-by-step guide.

### Tracking with MLflow

If you enabled MLflow logging (`report_to: mlflow` in your YAML), LLaMA-Factory logs metrics (loss, learning rate, etc.), parameters, and artifacts to your configured MLflow tracking server.

**Example YAML snippet:**

```yaml
report_to: mlflow

ray_init_kwargs:
  runtime_env:
    env_vars:
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

**MLFlow example**

![MLflow](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.4/mlflow.png)

For a more detailed guide on tracking experiments with other tools such as Weights & Biases or MLflow, see [Observability and tracking](https://docs.anyscale.com/llm/fine-tuning/observability-and-tracking).

## Step 5: Locate checkpoints

Ray Train writes checkpoints under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/gemma3_4b_full_cpt`. 

Inside, you see a **trainer session** directory named like:
`TorchTrainer_8c6a5_00000_0_2025-09-09_09-53-45/`.

- Ray Train creates `TorchTrainer_*` **when the trainer starts**; the suffix encodes a short run ID and the **start timestamp**.
- Within that directory, Ray Train names checkpoints `checkpoint_000xxx/`, where the number is the saved ordered checkpoints.

Control the save cadence with `save_strategy` and `save_steps`. For instructions on how to resume interrupted training with `resume_from_checkpoint` and more, see [Understand the artifacts directory](https://docs.anyscale.com/llm/fine-tuning/checkpointing#artifacts-directory).

## Step 6: Export the model

If you use LoRA, you can keep the base model and adapters separate for [multi-LoRA deployment](https://docs.anyscale.com/llm/serving/multi-lora) or [merge the adapters](https://docs.anyscale.com/llm/fine-tuning/checkpointing#merge-lora) into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply [post-training quantization](https://docs.anyscale.com/llm/fine-tuning/checkpointing#ptq) on merged or full models before serving.
