Problem
When using LocalBackend._experimental_fork_checkpoint with PipelineTrainer, the forked LoRA weights are not loaded by the vLLM inference server at startup.
Root cause: model.register(backend) creates an empty LoRA checkpoint at checkpoints/0000. Then _experimental_fork_checkpoint copies the source checkpoint to checkpoints/{source_step} (e.g. 0686). But when vLLM starts, it loads the adapter at @0 — which is the empty 0000 checkpoint, not the forked one.
Sequence:
model.register(backend) → creates checkpoints/0000/ (empty LoRA)
backend._experimental_fork_checkpoint(model, from_model="kl-000-1") → creates checkpoints/0686/ (real weights)
- vLLM starts with
lora_modules=[LoRAModulePath(name='model@0', path='checkpoints/0000')]
- Training begins from an empty adapter instead of the forked checkpoint
Verification:
$ md5sum checkpoints/0000/adapter_model.safetensors checkpoints/0686/adapter_model.safetensors
3fb4a12a... checkpoints/0000/adapter_model.safetensors # empty
98dd58ba... checkpoints/0686/adapter_model.safetensors # forked
Current workaround
After calling _experimental_fork_checkpoint, copy the forked checkpoint files over 0000:
await backend._experimental_fork_checkpoint(model, from_model=src, ...)
# Overwrite empty 0000 with forked weights
step0_dir = art_path / project / "models" / model.name / "checkpoints" / "0000"
forked_dir = art_path / project / "models" / model.name / "checkpoints" / f"{fork_step:04d}"
for f in forked_dir.iterdir():
shutil.copy2(f, step0_dir / f.name)
Suggested fix
_experimental_fork_checkpoint on LocalBackend should either:
- Copy the forked checkpoint to
0000 (overwriting the empty one), or
- Update the model's state so vLLM knows to load the forked step instead of
@0
This issue is specific to LocalBackend — ServerlessBackend handles fork differently (uploads as W&B artifact with the correct step alias).
Problem
When using
LocalBackend._experimental_fork_checkpointwithPipelineTrainer, the forked LoRA weights are not loaded by the vLLM inference server at startup.Root cause:
model.register(backend)creates an empty LoRA checkpoint atcheckpoints/0000. Then_experimental_fork_checkpointcopies the source checkpoint tocheckpoints/{source_step}(e.g.0686). But when vLLM starts, it loads the adapter at@0— which is the empty0000checkpoint, not the forked one.Sequence:
model.register(backend)→ createscheckpoints/0000/(empty LoRA)backend._experimental_fork_checkpoint(model, from_model="kl-000-1")→ createscheckpoints/0686/(real weights)lora_modules=[LoRAModulePath(name='model@0', path='checkpoints/0000')]Verification:
Current workaround
After calling
_experimental_fork_checkpoint, copy the forked checkpoint files over0000:Suggested fix
_experimental_fork_checkpointonLocalBackendshould either:0000(overwriting the empty one), or@0This issue is specific to
LocalBackend—ServerlessBackendhandles fork differently (uploads as W&B artifact with the correct step alias).