Fix small issue with train/testing examples by mkolodner-sc · Pull Request #331 · Snapchat/GiGL

mkolodner-sc · 2025-09-22T16:40:31Z

Scope of work done

This PR addresses two issues:

If should_skip_training is True, we try to log a train_start_time after testing, but it has not been initialized, which causes error.
If should_skip_training is True, we still try to re-save the model to the model uri, which should only be done after training.

We fix this by moving the train_start_time and model saving code to the code which runs if should_skip_training=False, and we introduce a time for logging how long the testing takes.

There is also a small consistency change here where we have the code block:

    if torch.cuda.is_available():
        torch.cuda.empty_cache()  # Releases all unoccupied cached memory currently held by the caching allocator on the CUDA-enabled GPU
        torch.cuda.synchronize()  # Ensures all CUDA operations have finished
    torch.distributed.barrier()  # Waits for all processes to reach the current point

which is currently done before calling .shutdown() in train and done after calling .shutdown() in test. It's good to be consistent between the two here, and this block should be called before .shutdown() in both cases so that we are only shutting down the dataloaders once it is safe to do so (all processes have reached the shutdown stage) and all cached memory is cleaned up before references are lost in .shutdown() call.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

mkolodner-sc added 2 commits September 22, 2025 16:38

Update

cf2efe8

update

b01f406

kmontemayor2-sc approved these changes Sep 22, 2025

View reviewed changes

Comment thread examples/link_prediction/heterogeneous_training.py Outdated

mkolodner-sc added 2 commits September 22, 2025 16:58

add comment for else for should_skip_training=True

01a27aa

Fix

fa24cdf

yliu2-sc approved these changes Sep 22, 2025

View reviewed changes

mkolodner-sc marked this pull request as ready for review September 22, 2025 20:37

mkolodner-sc requested review from nshah-sc, svij-sc, xgao4-sc and zfan3-sc as code owners September 22, 2025 20:37

mkolodner-sc added this pull request to the merge queue Sep 22, 2025

Merged via the queue into main with commit c9a5653 Sep 22, 2025
4 checks passed

mkolodner-sc deleted the mkolodner-sc/fix_train_example branch September 22, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix small issue with train/testing examples#331

Fix small issue with train/testing examples#331
mkolodner-sc merged 4 commits intomainfrom
mkolodner-sc/fix_train_example

mkolodner-sc commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mkolodner-sc commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkolodner-sc commented Sep 22, 2025 •

edited

Loading