fix(ler): force num_workers=0 when torch.compile is active to prevent segfault by ivanbasov · Pull Request #31 · NVIDIA/Ising-Decoding

ivanbasov · 2026-03-30T18:56:59Z

Summary

In `run_inference_and_decode_pre_decoder_memory`: force `num_workers=0` when `_applied_compile` is True
In `compute_syndrome_density_reduction`: force `num_workers=0` when `PREDECODER_TORCH_COMPILE` is enabled
Root cause: `torch.compile` + DataLoader `multiprocessing_context=spawn` segfaults during LER validation at end of training — 20 leaked semaphores and `core dumped` (run 23575476552)
torch.compile remains active; only multiprocessing workers are disabled when compile is on — single-process DataLoader has no meaningful throughput cost for inference

Test plan

Trigger `orientation-inference` job via `workflow_dispatch` and confirm "Train all orientations" step completes without segfault
Confirm no `Segmentation fault` or `leaked semaphore` warnings in the LER validation log

🤖 Generated with Claude Code

bmhowe23 · 2026-03-30T19:13:04Z

Adds PREDECODER_TORCH_COMPILE: "0" to the Train all orientations (O1–O4) step in .github/workflows/long-running-tests.yml

If we are adding this to all orientations (O1-O4), then when is torch compilation remaining enabled?

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-03-30T21:24:19Z

Adds PREDECODER_TORCH_COMPILE: "0" to the Train all orientations (O1–O4) step in .github/workflows/long-running-tests.yml

If we are adding this to all orientations (O1-O4), then when is torch compilation remaining enabled?

Thanks for the great point! I originally attempted to fix this by disabling torch.compile in the CI
workflow step, but that would have left the compilation path untested entirely. Instead, the fix is in
logical_error_rate.py — when torch.compile is active, the DataLoader is forced to num_workers=0 to avoid
the spawn + compile segfault. torch.compile remains enabled for all orientations; only multiprocessing
workers are disabled during inference/LER evaluation where they aren't needed for throughput anyway.

kvmto · 2026-03-31T11:43:53Z

I am not sure how come we bump into this, the following PR was made to test and make sure we removed this problem:
#29

removing parallelism is a nice quick hack, but we need it for production.
I thought the LER computation could have been costly, but if you think it doesn't damage throughput let's remove it and merge 👍
I was rather sure I solved this problem though, hmmm

ivanbasov · 2026-03-31T17:31:11Z

seems to be already fixed with #29

ivanbasov requested review from bmhowe23 and kvmto March 30, 2026 18:58

ivanbasov force-pushed the fix/ci-torch-compile-orient-segfault branch from 7f0f6c8 to 8410009 Compare March 30, 2026 21:05

ivanbasov changed the title ~~fix(ci): disable torch.compile in orientation training to prevent segfault~~ fix(ler): force num_workers=0 when torch.compile is active to prevent segfault Mar 30, 2026

ivanbasov force-pushed the fix/ci-torch-compile-orient-segfault branch from 8410009 to ec856fa Compare March 30, 2026 21:19

ivanbasov force-pushed the fix/ci-torch-compile-orient-segfault branch from ec856fa to 6161a3e Compare March 30, 2026 21:21

ivanbasov closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#31

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#31
ivanbasov wants to merge 1 commit into
NVIDIA:mainfrom
ivanbasov:fix/ci-torch-compile-orient-segfault

ivanbasov commented Mar 30, 2026 •

edited

Loading

Uh oh!

bmhowe23 commented Mar 30, 2026

Uh oh!

ivanbasov commented Mar 30, 2026

Uh oh!

kvmto commented Mar 31, 2026 •

edited

Loading

Uh oh!

ivanbasov commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivanbasov commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

bmhowe23 commented Mar 30, 2026

Uh oh!

ivanbasov commented Mar 30, 2026

Uh oh!

kvmto commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanbasov commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivanbasov commented Mar 30, 2026 •

edited

Loading

kvmto commented Mar 31, 2026 •

edited

Loading