Skip to content

Conversation

@jeremymanning
Copy link
Member

Summary

  • Added --resume flag to continue training from existing checkpoints
  • Supports deterministic continuation by saving/restoring random states
  • Enhanced remote training script with resume capability

Changes

  • run_llm_stylometry.sh: Added -r, --resume flag support
  • generate_figures.py: Added --resume argument and logic to skip/continue models
  • main.py:
    • Added check_model_complete() function to verify training status
    • Intelligent handling of partially trained models
    • Removes logs for models without weights (e.g., after repo clone)
  • model_utils.py: Save/restore random states (Python, NumPy, PyTorch, CUDA)
  • remote_train.sh: Added --resume flag support for remote training
  • README.md: Comprehensive documentation of resume functionality

Test plan

  • Test --resume flag locally
  • Verify correct model weight file detection (model.safetensors)
  • Test model completion detection
  • Verify random state preservation
  • Test remote training with --resume flag
  • Documentation updated

🤖 Generated with Claude Code

jeremymanning and others added 2 commits September 19, 2025 12:59
- Added --resume flag to run_llm_stylometry.sh and generate_figures.py
- Implemented check_model_complete() to verify model training status
- Resume logic skips completed models, resumes partial ones, restarts those without weights
- Save/restore random states (Python, NumPy, PyTorch, CUDA) for consistent continuation
- Updated README with resume documentation
- Handles edge case of cloned repos with logs but no weights

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added --resume/-r flag to remote_train.sh for continuing interrupted training
- Script now passes resume mode through SSH to remote server
- Updated README with resume documentation for remote training
- Supports combining --kill and --resume flags for restart scenarios

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jeremymanning jeremymanning merged commit fd3100a into ContextLab:main Sep 19, 2025
jeremymanning referenced this pull request in jeremymanning/llm-stylometry Oct 20, 2025
Add --resume flag for training continuation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant