Skip to content

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#31

Closed
ivanbasov wants to merge 1 commit into
NVIDIA:mainfrom
ivanbasov:fix/ci-torch-compile-orient-segfault
Closed

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#31
ivanbasov wants to merge 1 commit into
NVIDIA:mainfrom
ivanbasov:fix/ci-torch-compile-orient-segfault

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

@ivanbasov ivanbasov commented Mar 30, 2026

Summary

  • In `run_inference_and_decode_pre_decoder_memory`: force `num_workers=0` when `_applied_compile` is True
  • In `compute_syndrome_density_reduction`: force `num_workers=0` when `PREDECODER_TORCH_COMPILE` is enabled
  • Root cause: `torch.compile` + DataLoader `multiprocessing_context=spawn` segfaults during LER validation at end of training — 20 leaked semaphores and `core dumped` (run 23575476552)
  • torch.compile remains active; only multiprocessing workers are disabled when compile is on — single-process DataLoader has no meaningful throughput cost for inference

Test plan

  • Trigger `orientation-inference` job via `workflow_dispatch` and confirm "Train all orientations" step completes without segfault
  • Confirm no `Segmentation fault` or `leaked semaphore` warnings in the LER validation log

🤖 Generated with Claude Code

@ivanbasov ivanbasov requested review from bmhowe23 and kvmto March 30, 2026 18:58
@bmhowe23
Copy link
Copy Markdown
Collaborator

Adds PREDECODER_TORCH_COMPILE: "0" to the Train all orientations (O1–O4) step in .github/workflows/long-running-tests.yml

If we are adding this to all orientations (O1-O4), then when is torch compilation remaining enabled?

@ivanbasov ivanbasov force-pushed the fix/ci-torch-compile-orient-segfault branch from 7f0f6c8 to 8410009 Compare March 30, 2026 21:05
@ivanbasov ivanbasov changed the title fix(ci): disable torch.compile in orientation training to prevent segfault fix(ler): force num_workers=0 when torch.compile is active to prevent segfault Mar 30, 2026
@ivanbasov ivanbasov force-pushed the fix/ci-torch-compile-orient-segfault branch from 8410009 to ec856fa Compare March 30, 2026 21:19
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov force-pushed the fix/ci-torch-compile-orient-segfault branch from ec856fa to 6161a3e Compare March 30, 2026 21:21
@ivanbasov
Copy link
Copy Markdown
Member Author

Adds PREDECODER_TORCH_COMPILE: "0" to the Train all orientations (O1–O4) step in .github/workflows/long-running-tests.yml

If we are adding this to all orientations (O1-O4), then when is torch compilation remaining enabled?

Thanks for the great point! I originally attempted to fix this by disabling torch.compile in the CI
workflow step, but that would have left the compilation path untested entirely. Instead, the fix is in
logical_error_rate.py — when torch.compile is active, the DataLoader is forced to num_workers=0 to avoid
the spawn + compile segfault. torch.compile remains enabled for all orientations; only multiprocessing
workers are disabled during inference/LER evaluation where they aren't needed for throughput anyway.

@kvmto
Copy link
Copy Markdown
Collaborator

kvmto commented Mar 31, 2026

I am not sure how come we bump into this, the following PR was made to test and make sure we removed this problem:
#29

removing parallelism is a nice quick hack, but we need it for production.
I thought the LER computation could have been costly, but if you think it doesn't damage throughput let's remove it and merge 👍
I was rather sure I solved this problem though, hmmm

@ivanbasov
Copy link
Copy Markdown
Member Author

seems to be already fixed with #29

@ivanbasov ivanbasov closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants