Fix: Defensively close GPU device FDs in dataloader worker processes by hexinw-nvidia · Pull Request #3684 · NVIDIA/Megatron-LM

hexinw-nvidia · 2026-03-04T02:36:59Z

This ensures workers do not keep references into NVIDIA memory space after fork. This helps ensure GPU memory can be reclaimed even if a dataloader worker is delayed or fails to exit.

How to Reproduce / Validate:

Force a long-running dataloader worker
Modify GPTDataset.__getitem__ to insert:

time.sleep(3600)

This simulates a stuck dataloader worker (e.g., blocked in I/O).
Start training
Launch a 1-node Megatron-LM job with:

--num-workers > 0
Verify dataloader workers are alive
On the GPU node:

sudo fuser -v /dev/nvidia*

You should see the dataloader worker processes listed.
With this patch, they should not retain active /dev/nvidia* file
descriptors even though they are running.
Trigger a rank failure
Send SIGTERM to one of the training ranks:

kill -15 <rank_pid>
Observe GPU memory reclaim
Run:

nvidia-smi

The corresponding rank’s GPU memory usage should return to 0
immediately (assuming no other GPU-holding child processes such as async
checkpoint workers are present in this test).
Baseline (without this patch)
Repeat the same steps without this change.

After killing the rank in step 4, you will observe that GPU memory remains non-zero in nvidia-smi, because the dataloader worker still holds /dev/nvidia* references.

copy-pr-bot · 2026-03-04T02:37:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-03-04T23:45:04Z

/claude review

megatron/training/datasets/data_samplers.py

Phlip79 · 2026-03-04T23:48:00Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

This ensures workers do not keep references into NVIDIA memory space after fork. This helps ensure GPU memory can be reclaimed even if a dataloader worker is delayed or fails to exit. How to Reproduce / Validate: 1. Force a long-running dataloader worker Modify GPTDataset.__getitem__ to insert: time.sleep(3600) This simulates a stuck dataloader worker (e.g., blocked in I/O). 2. Start training Launch a 1-node Megatron-LM job with: --num-workers > 0 3. Verify dataloader workers are alive On the GPU node: sudo fuser -v /dev/nvidia* You should see the dataloader worker processes listed. With this patch, they should not retain active /dev/nvidia* file descriptors even though they are running. 4. Trigger a rank failure Send SIGTERM to one of the training ranks: kill -15 <rank_pid> 5. Observe GPU memory reclaim Run: nvidia-smi The corresponding rank’s GPU memory usage should return to 0 immediately (assuming no other GPU-holding child processes such as async checkpoint workers are present in this test). 6. Baseline (without this patch) Repeat the same steps without this change. After killing the rank in step 4, you will observe that GPU memory remains non-zero in nvidia-smi, because the dataloader worker still holds /dev/nvidia* references.

hexinw-nvidia · 2026-03-05T18:11:33Z

/claude review

hexinw-nvidia · 2026-03-13T03:32:13Z

/ok to test d70e411

chtruong814 · 2026-03-13T12:55:34Z

/ok to test d70e411

copy-pr-bot · 2026-03-13T12:55:39Z

/ok to test d70e411

@chtruong814, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

chtruong814 · 2026-03-13T12:55:57Z

/ok to test 1136a37

svcnvidia-nemo-ci · 2026-03-18T16:53:31Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23256564704

svcnvidia-nemo-ci requested a review from a team March 4, 2026 02:37

claude bot reviewed Mar 4, 2026

View reviewed changes

megatron/training/datasets/data_samplers.py Show resolved Hide resolved

Phlip79 marked this pull request as draft March 4, 2026 23:48

hexinw-nvidia marked this pull request as ready for review March 5, 2026 18:08

hexinw-nvidia force-pushed the close_nvidia_fds branch from e79099f to d70e411 Compare March 5, 2026 18:11

ericharper requested a review from asolergi-nv March 9, 2026 23:00

asolergi-nv approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into close_nvidia_fds

1136a37

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 13, 2026

copy-pr-bot bot temporarily deployed to test March 13, 2026 12:56 Inactive

asolergi-nv added this pull request to the merge queue Mar 18, 2026

Merged via the queue into NVIDIA:main with commit 1259982 Mar 18, 2026
51 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Defensively close GPU device FDs in dataloader worker processes#3684

Fix: Defensively close GPU device FDs in dataloader worker processes#3684
asolergi-nv merged 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:close_nvidia_fds

hexinw-nvidia commented Mar 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

hexinw-nvidia commented Mar 5, 2026

Uh oh!

hexinw-nvidia commented Mar 13, 2026

Uh oh!

chtruong814 commented Mar 13, 2026

Uh oh!

copy-pr-bot bot commented Mar 13, 2026

Uh oh!

chtruong814 commented Mar 13, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

hexinw-nvidia commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

hexinw-nvidia commented Mar 5, 2026

Uh oh!

hexinw-nvidia commented Mar 13, 2026

Uh oh!

chtruong814 commented Mar 13, 2026

Uh oh!

copy-pr-bot bot commented Mar 13, 2026

Uh oh!

chtruong814 commented Mar 13, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hexinw-nvidia commented Mar 4, 2026 •

edited

Loading