Fixes for OGB experiments #18
Conversation
szaman19
commented
Jul 23, 2025
- Fixes single process bug by using comm barrier rather than dist barrier and having a no-op for single process comm case
- Fixes multi-process bug and updates README for clearer instructions
There was a problem hiding this comment.
Pull Request Overview
This PR addresses bugs in the OGB experiments related to process synchronization in both single-process and multi-process scenarios. The changes improve the robustness of distributed training by properly handling barriers and provide clearer documentation for running distributed experiments.
- Replaces
dist.barrier()calls withcomm.barrier()to use communication backend abstraction - Adds barrier method implementation for single-process communication with no-op behavior
- Updates documentation with specific instructions for HPC environments and torchrun-hpc usage
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| experiments/OGB/main.py | Adds single-process barrier no-op and replaces direct dist.barrier() calls with comm.barrier() |
| experiments/OGB/Readme.md | Updates documentation with torchrun-hpc command and HPC-specific configuration notes |
| DGraph/distributed/nccl/NCCLBackendEngine.py | Fixes class-level initialization state tracking and adds proper barrier method implementation |
|
|
||
| def destroy(self) -> None: | ||
| if self._initialized: | ||
| if NCCLBackendEngine._is_initialized: |
There was a problem hiding this comment.
The code references 'NCCLBackendEngine._is_initialized' but based on the context, this appears to be changing from 'self._initialized'. This could cause issues if '_is_initialized' is not properly defined as a class variable or if other methods still reference 'self._initialized'.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
FYI, here is the bug I found: |
|
@szaman19 Would it be possible to fix the |
@bvanessen I think this PR is ready to go |
The last fix was too quick
| # For the first time, the code downloads and processes the data | ||
| # doing that on all ranks causes a race condition | ||
| comm_object.barrier() | ||
| # Load the dataset on all other ranks |
There was a problem hiding this comment.
@szaman19 What is the issue with the race here, is it that the first one will download the data set to local disk and then the rest will load from local disk? (race is concurrent downloads?)
There was a problem hiding this comment.
Yup, for the first run, OGB will download the raw data, unzip it, and then delete the raw data. Subsequent runs search for the processed files and use them. If we don't lock around it, the concurrent downloads and processing calls result in OS errors.