Example: Forking long training runs #14

LeoRoccoBreedt · 2025-05-14T13:10:56Z

Description

Include a summary of the changes and the related issue.

Related to: <ClickUp/JIRA task name>

Any expected test failures?
Torch support for Python 3.13.

Add a [X] to relevant checklist items

❔ This change

adds a new feature
fixes breaking code
is cosmetic (refactoring/reformatting)

✔️ Pre-merge checklist

Refactored code (sourcery)
Tested code locally
Precommit installed and run before pushing changes
Added code to GitHub tests (notebooks, scripts)
Updated GitHub README
Updated the projects overview page on Notion

🧪 Test Configuration

OS: Windows
Python version: 3.12
Neptune version: 0.12
Affected libraries with version:

Summary by Sourcery

Add a new how-to guide notebook demonstrating how to fork and resume long training runs with Neptune and include it in the CI test-notebooks workflow.

New Features:

Introduce a tutorial notebook on forking and resuming long training runs with Neptune.

Documentation:

Add user-facing how-to guide for long-run forking in the documentation.

Tests:

Update the test-notebooks CI workflow to include the new forking-long-runs notebook.

Summary by Sourcery

Add a tutorial notebook for forking long model training runs and include it in the CI test-notebooks workflow

New Features:

Introduce a how-to guide notebook demonstrating forking and resuming long training runs with Neptune

Documentation:

Add user-facing documentation for long-run forking in the how-to-guides

Tests:

Update the test-notebooks CI workflow to include the new forking-long-runs notebook

… metrics

…en and project name

…rientated

… tracked

…calculate gradient norms for batch (step) rather than epoch

…debugging when building LLM's

…sses using the data loader

These need to be updated the the final branch when merged

…ging_model_training

sourcery-ai · 2025-05-14T13:11:00Z

Reviewer's Guide

Introduce a new how-to notebook demonstrating forking, checkpointing, and parallel long-training runs with Neptune, and add it to the CI test-notebooks workflow.

Sequence diagram for forking and resuming a training run

sequenceDiagram
    actor User
    participant Notebook
    participant NeptuneRun
    participant Checkpoint
    User->>Notebook: Load checkpoint
    Notebook->>Checkpoint: Read model/optimizer state
    User->>Notebook: Create forked Neptune Run
    Notebook->>NeptuneRun: Initialize forked run (fork_run_id, fork_step)
    User->>Notebook: Resume training
    Notebook->>NeptuneRun: Log metrics for forked run

Sequence diagram for launching multiple parallel forks

sequenceDiagram
    actor User
    participant Notebook
    participant ThreadPoolExecutor
    participant NeptuneRun
    participant Checkpoint
    User->>Notebook: Define trial configs
    User->>Notebook: Launch ThreadPoolExecutor
    Notebook->>ThreadPoolExecutor: Submit forked runs
    ThreadPoolExecutor->>Notebook: For each config
    Notebook->>Checkpoint: Load checkpoint
    Notebook->>NeptuneRun: Initialize forked run
    Notebook->>NeptuneRun: Log metrics for forked run (parallel)
    ThreadPoolExecutor->>User: Return run URLs

Class diagram for SimpleModel and training utilities

classDiagram
    class SimpleModel {
        +__init__(input_size, hidden_size, output_size, num_layers)
        +forward(x)
        model: nn.Sequential
    }
    class Run {
        +log_metrics(data, step)
        +log_configs(configs)
        +add_tags(tags)
        +close()
        +get_run_url()
        _run_id
    }
    class Checkpoint {
        +save_checkpoint(epoch, global_step, model, optimizer, run)
        +load_checkpoint(model, optimizer, checkpoint_path)
        epoch
        global_step
        model_state_dict
        optimizer_state_dict
        run_id
    }
    class train {
        +train(run, model, params, train_loader, optimizer, epoch_start, step_start, forked)
    }
    SimpleModel <.. train : used by
    Run <.. train : used by
    Checkpoint <.. train : used by
    Checkpoint <.. Run : run_id used for forking
    train <.. ThreadPoolExecutor : used for parallel forks

File-Level Changes

Change	Details	Files
Add tutorial notebook illustrating run forking and checkpoint workflows	Create a new notebook outlining environment setup, model definition, checkpoint save/load, and Neptune initialization Implement training functions that handle initial runs, single forks, and parallel forks with metric logging Provide example code blocks for creating base experiments, loading checkpoints, forking runs, and launching multiple forks Document analysis of fork relationships and Neptune lineage tracking	`how-to-guides/forking-long-runs/fork_long_training_runs.ipynb`
Register the new notebook in the CI test-notebooks workflow	Insert the new notebook path into the alphabetical list of tested notebooks in test-notebooks.yml	`.github/workflows/test-notebooks.yml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

…ng-long-runs

sourcery-ai

Hey @LeoRoccoBreedt - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

LeoRoccoBreedt added 30 commits February 26, 2025 11:07

feat: Added initial Pytorch example to monitor batching and per layer…

b837338

… metrics

refactor: Update introduction section for more clarity on notebook use

9994528

chore: change how the custom run id gets automatically generated

223fde1

chore: update instructions on how users can get and set their API tok…

ceb3a24

…en and project name

chore: update the introduction to be more foundation model training o…

39f5bb9

…rientated

chore: update dataset section with a better description

69fdfba

refactor: update training loop where grads, norms and activations are…

1cde287

… tracked

refactor: update batch size and edit gradient norm logging code

2da28f4

chore: add data file to ignore for pytorch example

123fea2

refactor: update model architecture layers and update training loop

86b6d5d

refactor: update accuracy calculation to not output the percentage

a39077a

refactor: update model architecture layers, accuracy calculation and …

3e2d453

…calculate gradient norms for batch (step) rather than epoch

feat: Added a pytorch text-based example that is used to demonstrate …

8a3da57

…debugging when building LLM's

refactor: add validation and test loss calcualtion for each epoch

d03fd43

refactor: update configs and parameters

b7f8e08

refactor: update logged configs

b113a36

refactor: calculate activations per layer

7258993

refactor: add tracking for grad norms

ae330e6

refactor: add gradient tracking per epoch

d4d160c

chore: remove uneeded section

890bf26

refactor: add fully connected layer to model for more complexity

ebd34c7

chore: fix activation saving for layers

ae9abc6

refactor: update packages for example

6f23761

refactor: update dataset to be used in example

2c82a0f

refactor: update evalution function that calculates the validation lo…

48451d6

…sses using the data loader

refactor: update training loop to work with new data

5b5fbdd

chore: remove unused sections

8ff6f30

chore: re-organize notebook layout

b32b27b

chore: cleanup and add parameters in right place

4d659a2

refactor: add all debugging metrics to the same dictionary variable

5e7ec6e

LeoRoccoBreedt added 12 commits April 9, 2025 10:13

chore: Add header links to GH, Neptune and docs

62227bc

These need to be updated the the final branch when merged

Merge commit 'fc4bc5ee2d6e3297c8611991e60f372ea785d213' into lb/debug…

f64d54d

…ging_model_training

style: minor updates to markdown

505f41c

small changes

a1eb757

testing with mnist dataset

4246341

refactor: update example and workflow

edc46d4

chore: update colab link and cleanup notebook

9b66894

chore: remove unused code for this example

074490e

chore: update notebook worfklow tests

1d56369

chore: remove more unused code

065a6d8

chore: update wrong gitignore message

ac67a23

chore: remove unused images

cfa0035

LeoRoccoBreedt changed the title ~~Lb/forking long runs~~ Example: Forking long training runs May 14, 2025

LeoRoccoBreedt added 3 commits May 14, 2025 15:15

Merge commit '0bda334ac5ae771af476abc351a8c6e92cd9d5fb' into lb/forki…

802e228

…ng-long-runs

fix: add numpy to dependencies

fc78863

fix: f-string error

646cd60

LeoRoccoBreedt self-assigned this Jun 6, 2025

LeoRoccoBreedt and others added 8 commits July 1, 2025 16:27

Merge branch 'main' into lb/forking-long-runs

5a08a83

Merge branch 'lb/forking_with_checkpoints' into lb/forking-long-runs

c6649c7

feat: add multi-fork support in example

df126bc

refactor: updates to fokring example

4dd14be

fix: update dependencies in notebook

dc1a4d8

refactor: removed unused notebook

929b0a1

Merge branch 'main' into lb/forking-long-runs

3d1014e

chore: minor fixes to the notebook

fd2b920

LeoRoccoBreedt marked this pull request as ready for review July 10, 2025 10:26

LeoRoccoBreedt requested a review from a team as a code owner July 10, 2025 10:26

sourcery-ai bot reviewed Jul 10, 2025

View reviewed changes

style: wording from marketing alignment on problem we are solving

3dd3832

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example: Forking long training runs #14

Example: Forking long training runs #14

Uh oh!

LeoRoccoBreedt commented May 14, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented May 14, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Example: Forking long training runs #14

Are you sure you want to change the base?

Example: Forking long training runs #14

Uh oh!

Conversation

LeoRoccoBreedt commented May 14, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

❔ This change

✔️ Pre-merge checklist

🧪 Test Configuration

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for forking and resuming a training run

Sequence diagram for launching multiple parallel forks

Class diagram for SimpleModel and training utilities

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeoRoccoBreedt commented May 14, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented May 14, 2025 •

edited

Loading