Skip to content

Fix small issue with train/testing examples#331

Merged
mkolodner-sc merged 4 commits intomainfrom
mkolodner-sc/fix_train_example
Sep 22, 2025
Merged

Fix small issue with train/testing examples#331
mkolodner-sc merged 4 commits intomainfrom
mkolodner-sc/fix_train_example

Conversation

@mkolodner-sc
Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc commented Sep 22, 2025

Scope of work done

This PR addresses two issues:

  • If should_skip_training is True, we try to log a train_start_time after testing, but it has not been initialized, which causes error.
  • If should_skip_training is True, we still try to re-save the model to the model uri, which should only be done after training.

We fix this by moving the train_start_time and model saving code to the code which runs if should_skip_training=False, and we introduce a time for logging how long the testing takes.

There is also a small consistency change here where we have the code block:

    if torch.cuda.is_available():
        torch.cuda.empty_cache()  # Releases all unoccupied cached memory currently held by the caching allocator on the CUDA-enabled GPU
        torch.cuda.synchronize()  # Ensures all CUDA operations have finished
    torch.distributed.barrier()  # Waits for all processes to reach the current point

which is currently done before calling .shutdown() in train and done after calling .shutdown() in test. It's good to be consistent between the two here, and this block should be called before .shutdown() in both cases so that we are only shutting down the dataloaders once it is safe to do so (all processes have reached the shutdown stage) and all cached memory is cleaned up before references are lost in .shutdown() call.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

Comment thread examples/link_prediction/heterogeneous_training.py Outdated
@mkolodner-sc mkolodner-sc marked this pull request as ready for review September 22, 2025 20:37
@mkolodner-sc mkolodner-sc added this pull request to the merge queue Sep 22, 2025
Merged via the queue into main with commit c9a5653 Sep 22, 2025
4 checks passed
@mkolodner-sc mkolodner-sc deleted the mkolodner-sc/fix_train_example branch September 22, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants