Add --resume flag for training continuation #10

jeremymanning · 2025-09-19T18:01:58Z

Summary

Added --resume flag to continue training from existing checkpoints
Supports deterministic continuation by saving/restoring random states
Enhanced remote training script with resume capability

Changes

run_llm_stylometry.sh: Added -r, --resume flag support
generate_figures.py: Added --resume argument and logic to skip/continue models
main.py:
- Added check_model_complete() function to verify training status
- Intelligent handling of partially trained models
- Removes logs for models without weights (e.g., after repo clone)
model_utils.py: Save/restore random states (Python, NumPy, PyTorch, CUDA)
remote_train.sh: Added --resume flag support for remote training
README.md: Comprehensive documentation of resume functionality

Test plan

Test --resume flag locally
Verify correct model weight file detection (model.safetensors)
Test model completion detection
Verify random state preservation
Test remote training with --resume flag
Documentation updated

🤖 Generated with Claude Code

- Added --resume flag to run_llm_stylometry.sh and generate_figures.py - Implemented check_model_complete() to verify model training status - Resume logic skips completed models, resumes partial ones, restarts those without weights - Save/restore random states (Python, NumPy, PyTorch, CUDA) for consistent continuation - Updated README with resume documentation - Handles edge case of cloned repos with logs but no weights 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Added --resume/-r flag to remote_train.sh for continuing interrupted training - Script now passes resume mode through SSH to remote server - Updated README with resume documentation for remote training - Supports combining --kill and --resume flags for restart scenarios 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Add --resume flag for training continuation

jeremymanning and others added 2 commits September 19, 2025 12:59

jeremymanning merged commit fd3100a into ContextLab:main Sep 19, 2025

jeremymanning referenced this pull request in jeremymanning/llm-stylometry Oct 20, 2025

Merge pull request #10 from jeremymanning/main

d598d9d

Add --resume flag for training continuation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add --resume flag for training continuation #10

Add --resume flag for training continuation #10

Uh oh!

jeremymanning commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add --resume flag for training continuation #10

Add --resume flag for training continuation #10

Uh oh!

Conversation

jeremymanning commented Sep 19, 2025

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant