Gate train_state.pt load behind an ownership check by mmshad · Pull Request #62 · KempnerInstitute/KempnerForge

mmshad · 2026-04-22T01:18:40Z

Summary

Factor torch.load(train_state.pt, weights_only=False) out into _load_train_state(path). The helper refuses to load when path.stat().st_uid != os.getuid() and warns on group- or world-writable files.
Closes the "group-writable shared checkpoint dir" and "imported pretrained checkpoint" footguns: anyone who can plant a pickled train_state.pt under the checkpoint path currently gets code execution in the training process on resume.
Same-UID attacks remain out of scope (if the attacker writes as you, they already win). This also does not convert to weights_only=True; that is Tier 2 (scheduler/extra split + safe_globals allowlist), tracked separately.

Closes #61

Test plan

uv run pytest tests/unit/test_checkpoint_security.py -v (5 tests covering helper + full CheckpointManager.load() path).
Full unit suite clean.
Sanity resume test (e.g., smoke TestAutoResume) still succeeds on a same-UID checkpoint.

torch.load(weights_only=False) on train_state.pt runs any __reduce__ in the file, so anyone who can plant a train_state.pt under a checkpoint dir that the training process reads gets code execution in the process that holds WANDB_API_KEY, HF_TOKEN, and SLURM creds. On shared HPC filesystems and for imported "pretrained" checkpoints that write side is attacker-reachable. Factor the load into a _load_train_state helper that refuses files not owned by the current UID and warns on group- or world-writable files. Same-UID attacks are out of scope (the attacker already wins), but this closes the group-writable shared checkpoint dir foot-gun and makes the trust boundary visible in the code. Not a full defense — Tier 2 (weights_only=True with safe_globals, scheduler split) tracked separately.

mmshad requested a review from Naeemkh April 22, 2026 01:19

mmshad self-assigned this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gate train_state.pt load behind an ownership check#62

Gate train_state.pt load behind an ownership check#62
mmshad wants to merge 1 commit intomainfrom
train-state-ownership-check

mmshad commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmshad commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mmshad commented Apr 22, 2026 •

edited

Loading