Skip to content

Fixes for OGB experiments #18

Merged
bvanessen merged 12 commits intoLBANN:mainfrom
szaman19:ogbn-patch
Aug 25, 2025
Merged

Fixes for OGB experiments #18
bvanessen merged 12 commits intoLBANN:mainfrom
szaman19:ogbn-patch

Conversation

@szaman19
Copy link
Copy Markdown
Contributor

  • Fixes single process bug by using comm barrier rather than dist barrier and having a no-op for single process comm case
  • Fixes multi-process bug and updates README for clearer instructions

@szaman19 szaman19 requested review from bvanessen and Copilot July 23, 2025 20:40
@szaman19 szaman19 self-assigned this Jul 23, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses bugs in the OGB experiments related to process synchronization in both single-process and multi-process scenarios. The changes improve the robustness of distributed training by properly handling barriers and provide clearer documentation for running distributed experiments.

  • Replaces dist.barrier() calls with comm.barrier() to use communication backend abstraction
  • Adds barrier method implementation for single-process communication with no-op behavior
  • Updates documentation with specific instructions for HPC environments and torchrun-hpc usage

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
experiments/OGB/main.py Adds single-process barrier no-op and replaces direct dist.barrier() calls with comm.barrier()
experiments/OGB/Readme.md Updates documentation with torchrun-hpc command and HPC-specific configuration notes
DGraph/distributed/nccl/NCCLBackendEngine.py Fixes class-level initialization state tracking and adds proper barrier method implementation

Comment thread experiments/OGB/Readme.md Outdated

def destroy(self) -> None:
if self._initialized:
if NCCLBackendEngine._is_initialized:
Copy link

Copilot AI Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code references 'NCCLBackendEngine._is_initialized' but based on the context, this appears to be changing from 'self._initialized'. This could cause issues if '_is_initialized' is not properly defined as a class variable or if other methods still reference 'self._initialized'.

Copilot uses AI. Check for mistakes.
szaman19 and others added 2 commits July 23, 2025 16:45
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@KIwabuchi
Copy link
Copy Markdown
Contributor

FYI, here is the bug I found:

diff --git a/DGraph/data/ogbn_datasets.py b/DGraph/data/ogbn_datasets.py
index d8ebddd..8125d58 100644
--- a/DGraph/data/ogbn_datasets.py
+++ b/DGraph/data/ogbn_datasets.py
@@ -201,7 +201,7 @@ class DistributedOGBWrapper(Dataset):
         dir_name = dir_name if dir_name is not None else os.getcwd() + "/data"

         if not os.path.exists(dir_name):
-            os.makedirs(dir_name)
+            os.makedirs(dir_name, exist_ok=True)

         cached_graph_file = f"{dir_name}/{dname}_graph_data_{self._world_size}.pt"

@KIwabuchi
Copy link
Copy Markdown
Contributor

@szaman19 Would it be possible to fix the makedirs bug I found and merge this PR soon?
We want to use this example in the project I'm working on.

@szaman19
Copy link
Copy Markdown
Contributor Author

@szaman19 Would it be possible to fix the makedirs bug I found and merge this PR soon? We want to use this example in the project I'm working on.

@bvanessen I think this PR is ready to go

# For the first time, the code downloads and processes the data
# doing that on all ranks causes a race condition
comm_object.barrier()
# Load the dataset on all other ranks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szaman19 What is the issue with the race here, is it that the first one will download the data set to local disk and then the rest will load from local disk? (race is concurrent downloads?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, for the first run, OGB will download the raw data, unzip it, and then delete the raw data. Subsequent runs search for the processed files and use them. If we don't lock around it, the concurrent downloads and processing calls result in OS errors.

@bvanessen bvanessen merged commit 954bd99 into LBANN:main Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants