-
Notifications
You must be signed in to change notification settings - Fork 1
Example: Forking long training runs #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
LeoRoccoBreedt
wants to merge
124
commits into
main
Choose a base branch
from
lb/forking-long-runs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…en and project name
…calculate gradient norms for batch (step) rather than epoch
…debugging when building LLM's
…sses using the data loader
These need to be updated the the final branch when merged
…ging_model_training
Reviewer's GuideIntroduce a new how-to notebook demonstrating forking, checkpointing, and parallel long-training runs with Neptune, and add it to the CI test-notebooks workflow. Sequence diagram for forking and resuming a training runsequenceDiagram
actor User
participant Notebook
participant NeptuneRun
participant Checkpoint
User->>Notebook: Load checkpoint
Notebook->>Checkpoint: Read model/optimizer state
User->>Notebook: Create forked Neptune Run
Notebook->>NeptuneRun: Initialize forked run (fork_run_id, fork_step)
User->>Notebook: Resume training
Notebook->>NeptuneRun: Log metrics for forked run
Sequence diagram for launching multiple parallel forkssequenceDiagram
actor User
participant Notebook
participant ThreadPoolExecutor
participant NeptuneRun
participant Checkpoint
User->>Notebook: Define trial configs
User->>Notebook: Launch ThreadPoolExecutor
Notebook->>ThreadPoolExecutor: Submit forked runs
ThreadPoolExecutor->>Notebook: For each config
Notebook->>Checkpoint: Load checkpoint
Notebook->>NeptuneRun: Initialize forked run
Notebook->>NeptuneRun: Log metrics for forked run (parallel)
ThreadPoolExecutor->>User: Return run URLs
Class diagram for SimpleModel and training utilitiesclassDiagram
class SimpleModel {
+__init__(input_size, hidden_size, output_size, num_layers)
+forward(x)
model: nn.Sequential
}
class Run {
+log_metrics(data, step)
+log_configs(configs)
+add_tags(tags)
+close()
+get_run_url()
_run_id
}
class Checkpoint {
+save_checkpoint(epoch, global_step, model, optimizer, run)
+load_checkpoint(model, optimizer, checkpoint_path)
epoch
global_step
model_state_dict
optimizer_state_dict
run_id
}
class train {
+train(run, model, params, train_loader, optimizer, epoch_start, step_start, forked)
}
SimpleModel <.. train : used by
Run <.. train : used by
Checkpoint <.. train : used by
Checkpoint <.. Run : run_id used for forking
train <.. ThreadPoolExecutor : used for parallel forks
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LeoRoccoBreedt - I've reviewed your changes and they look great!
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Include a summary of the changes and the related issue.
Related to: <ClickUp/JIRA task name>
Any expected test failures?
Torch support for Python 3.13.
Add a
[X]
to relevant checklist items❔ This change
✔️ Pre-merge checklist
🧪 Test Configuration
Summary by Sourcery
Add a new how-to guide notebook demonstrating how to fork and resume long training runs with Neptune and include it in the CI test-notebooks workflow.
New Features:
Documentation:
Tests:
Summary by Sourcery
Add a tutorial notebook for forking long model training runs and include it in the CI test-notebooks workflow
New Features:
Documentation:
Tests: