LocalBackend fork_checkpoint doesn't update vLLM's initial LoRA

## Problem

When using `LocalBackend._experimental_fork_checkpoint` with `PipelineTrainer`, the forked LoRA weights are not loaded by the vLLM inference server at startup.

**Root cause:** `model.register(backend)` creates an empty LoRA checkpoint at `checkpoints/0000`. Then `_experimental_fork_checkpoint` copies the source checkpoint to `checkpoints/{source_step}` (e.g. `0686`). But when vLLM starts, it loads the adapter at `@0` — which is the empty `0000` checkpoint, not the forked one.

**Sequence:**
1. `model.register(backend)` → creates `checkpoints/0000/` (empty LoRA)
2. `backend._experimental_fork_checkpoint(model, from_model="kl-000-1")` → creates `checkpoints/0686/` (real weights)
3. vLLM starts with `lora_modules=[LoRAModulePath(name='model@0', path='checkpoints/0000')]`
4. Training begins from an empty adapter instead of the forked checkpoint

**Verification:**
```
$ md5sum checkpoints/0000/adapter_model.safetensors checkpoints/0686/adapter_model.safetensors
3fb4a12a...  checkpoints/0000/adapter_model.safetensors   # empty
98dd58ba...  checkpoints/0686/adapter_model.safetensors   # forked
```

## Current workaround

After calling `_experimental_fork_checkpoint`, copy the forked checkpoint files over `0000`:

```python
await backend._experimental_fork_checkpoint(model, from_model=src, ...)

# Overwrite empty 0000 with forked weights
step0_dir = art_path / project / "models" / model.name / "checkpoints" / "0000"
forked_dir = art_path / project / "models" / model.name / "checkpoints" / f"{fork_step:04d}"
for f in forked_dir.iterdir():
    shutil.copy2(f, step0_dir / f.name)
```

## Suggested fix

`_experimental_fork_checkpoint` on `LocalBackend` should either:
1. Copy the forked checkpoint to `0000` (overwriting the empty one), or
2. Update the model's state so vLLM knows to load the forked step instead of `@0`

This issue is specific to `LocalBackend` — `ServerlessBackend` handles fork differently (uploads as W&B artifact with the correct step alias).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalBackend fork_checkpoint doesn't update vLLM's initial LoRA #651

Problem

Current workaround

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LocalBackend fork_checkpoint doesn't update vLLM's initial LoRA #651

Description

Problem

Current workaround

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions