Skip to content

LocalBackend fork_checkpoint doesn't update vLLM's initial LoRA #651

@arcticfly

Description

@arcticfly

Problem

When using LocalBackend._experimental_fork_checkpoint with PipelineTrainer, the forked LoRA weights are not loaded by the vLLM inference server at startup.

Root cause: model.register(backend) creates an empty LoRA checkpoint at checkpoints/0000. Then _experimental_fork_checkpoint copies the source checkpoint to checkpoints/{source_step} (e.g. 0686). But when vLLM starts, it loads the adapter at @0 — which is the empty 0000 checkpoint, not the forked one.

Sequence:

  1. model.register(backend) → creates checkpoints/0000/ (empty LoRA)
  2. backend._experimental_fork_checkpoint(model, from_model="kl-000-1") → creates checkpoints/0686/ (real weights)
  3. vLLM starts with lora_modules=[LoRAModulePath(name='model@0', path='checkpoints/0000')]
  4. Training begins from an empty adapter instead of the forked checkpoint

Verification:

$ md5sum checkpoints/0000/adapter_model.safetensors checkpoints/0686/adapter_model.safetensors
3fb4a12a...  checkpoints/0000/adapter_model.safetensors   # empty
98dd58ba...  checkpoints/0686/adapter_model.safetensors   # forked

Current workaround

After calling _experimental_fork_checkpoint, copy the forked checkpoint files over 0000:

await backend._experimental_fork_checkpoint(model, from_model=src, ...)

# Overwrite empty 0000 with forked weights
step0_dir = art_path / project / "models" / model.name / "checkpoints" / "0000"
forked_dir = art_path / project / "models" / model.name / "checkpoints" / f"{fork_step:04d}"
for f in forked_dir.iterdir():
    shutil.copy2(f, step0_dir / f.name)

Suggested fix

_experimental_fork_checkpoint on LocalBackend should either:

  1. Copy the forked checkpoint to 0000 (overwriting the empty one), or
  2. Update the model's state so vLLM knows to load the forked step instead of @0

This issue is specific to LocalBackendServerlessBackend handles fork differently (uploads as W&B artifact with the correct step alias).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions