Skip to content

Gate train_state.pt load behind an ownership check#62

Open
mmshad wants to merge 1 commit intomainfrom
train-state-ownership-check
Open

Gate train_state.pt load behind an ownership check#62
mmshad wants to merge 1 commit intomainfrom
train-state-ownership-check

Conversation

@mmshad
Copy link
Copy Markdown
Collaborator

@mmshad mmshad commented Apr 22, 2026

Summary

  • Factor torch.load(train_state.pt, weights_only=False) out into _load_train_state(path). The helper refuses to load when path.stat().st_uid != os.getuid() and warns on group- or world-writable files.
  • Closes the "group-writable shared checkpoint dir" and "imported pretrained checkpoint" footguns: anyone who can plant a pickled train_state.pt under the checkpoint path currently gets code execution in the training process on resume.
  • Same-UID attacks remain out of scope (if the attacker writes as you, they already win). This also does not convert to weights_only=True; that is Tier 2 (scheduler/extra split + safe_globals allowlist), tracked separately.

Closes #61

Test plan

  • uv run pytest tests/unit/test_checkpoint_security.py -v (5 tests covering helper + full CheckpointManager.load() path).
  • Full unit suite clean.
  • Sanity resume test (e.g., smoke TestAutoResume) still succeeds on a same-UID checkpoint.

torch.load(weights_only=False) on train_state.pt runs any __reduce__
in the file, so anyone who can plant a train_state.pt under a
checkpoint dir that the training process reads gets code execution in
the process that holds WANDB_API_KEY, HF_TOKEN, and SLURM creds. On
shared HPC filesystems and for imported "pretrained" checkpoints that
write side is attacker-reachable.

Factor the load into a _load_train_state helper that refuses files not
owned by the current UID and warns on group- or world-writable files.
Same-UID attacks are out of scope (the attacker already wins), but
this closes the group-writable shared checkpoint dir foot-gun and
makes the trust boundary visible in the code. Not a full defense —
Tier 2 (weights_only=True with safe_globals, scheduler split) tracked
separately.
@mmshad mmshad requested a review from Naeemkh April 22, 2026 01:19
@mmshad mmshad self-assigned this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

train_state.pt is loaded with weights_only=False without a trust check

1 participant