Skip to content

Fix: Defensively close GPU device FDs in dataloader worker processes#3684

Merged
asolergi-nv merged 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:close_nvidia_fds
Mar 18, 2026
Merged

Fix: Defensively close GPU device FDs in dataloader worker processes#3684
asolergi-nv merged 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:close_nvidia_fds

Conversation

@hexinw-nvidia
Copy link
Contributor

@hexinw-nvidia hexinw-nvidia commented Mar 4, 2026

This ensures workers do not keep references into NVIDIA memory space after fork. This helps ensure GPU memory can be reclaimed even if a dataloader worker is delayed or fails to exit.

How to Reproduce / Validate:

  1. Force a long-running dataloader worker
    Modify GPTDataset.__getitem__ to insert:

    time.sleep(3600)

    This simulates a stuck dataloader worker (e.g., blocked in I/O).

  2. Start training
    Launch a 1-node Megatron-LM job with:

    --num-workers > 0

  3. Verify dataloader workers are alive
    On the GPU node:

    sudo fuser -v /dev/nvidia*

    You should see the dataloader worker processes listed.
    With this patch, they should not retain active /dev/nvidia* file
    descriptors even though they are running.

  4. Trigger a rank failure
    Send SIGTERM to one of the training ranks:

    kill -15 <rank_pid>

  5. Observe GPU memory reclaim
    Run:

    nvidia-smi

    The corresponding rank’s GPU memory usage should return to 0
    immediately (assuming no other GPU-holding child processes such as async
    checkpoint workers are present in this test).

  6. Baseline (without this patch)
    Repeat the same steps without this change.

    After killing the rank in step 4, you will observe that GPU memory remains non-zero in nvidia-smi, because the dataloader worker still holds /dev/nvidia* references.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 4, 2026 02:37
@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

/claude review

@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:48
@hexinw-nvidia hexinw-nvidia marked this pull request as ready for review March 5, 2026 18:08
This ensures workers do not keep references into NVIDIA memory space after fork.
This helps ensure GPU memory can be reclaimed even if a dataloader worker is
delayed or fails to exit.

How to Reproduce / Validate:

1. Force a long-running dataloader worker
   Modify GPTDataset.__getitem__ to insert:

     time.sleep(3600)

   This simulates a stuck dataloader worker (e.g., blocked in I/O).

2. Start training
   Launch a 1-node Megatron-LM job with:

     --num-workers > 0

3. Verify dataloader workers are alive
   On the GPU node:

     sudo fuser -v /dev/nvidia*

  You should see the dataloader worker processes listed.
  With this patch, they should not retain active /dev/nvidia* file
  descriptors even though they are running.

4. Trigger a rank failure
   Send SIGTERM to one of the training ranks:

     kill -15 <rank_pid>

5. Observe GPU memory reclaim
   Run:

     nvidia-smi

  The corresponding rank’s GPU memory usage should return to 0
  immediately (assuming no other GPU-holding child processes such as async
  checkpoint workers are present in this test).

6. Baseline (without this patch)
   Repeat the same steps without this change.
   After killing the rank in step 4, you will observe that GPU memory
   remains non-zero in nvidia-smi, because the dataloader worker still
   holds /dev/nvidia* references.
@hexinw-nvidia
Copy link
Contributor Author

/claude review

@ericharper ericharper requested a review from asolergi-nv March 9, 2026 23:00
@hexinw-nvidia
Copy link
Contributor Author

/ok to test d70e411

@chtruong814
Copy link
Contributor

/ok to test d70e411

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 13, 2026

/ok to test d70e411

@chtruong814, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@chtruong814
Copy link
Contributor

/ok to test 1136a37

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 13, 2026
@asolergi-nv asolergi-nv added this pull request to the merge queue Mar 18, 2026
@svcnvidia-nemo-ci
Copy link

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23256564704

Merged via the queue into NVIDIA:main with commit 1259982 Mar 18, 2026
51 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants