Skip to content

[cleanup] Remove NCCL_CUMEM_ENABLE=0 from prepare_runtime_environment #1600

Merged
SumanthRH merged 1 commit into
NovaSky-AI:mainfrom
erictang000:remove_nccl_cumem_enable
Apr 30, 2026
Merged

[cleanup] Remove NCCL_CUMEM_ENABLE=0 from prepare_runtime_environment #1600
SumanthRH merged 1 commit into
NovaSky-AI:mainfrom
erictang000:remove_nccl_cumem_enable

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

@erictang000 erictang000 commented Apr 30, 2026

We previously had the following snippet in prepare_runtime_environment

# NOTE (charlie): See https://github.com/vllm-project/vllm/blob/c6b0a7d3ba03ca414be1174e9bd86a97191b7090/vllm/worker/worker_base.py#L445
# and https://docs.vllm.ai/en/v0.9.2/usage/troubleshooting.html?h=nccl_cumem_enable#known-issues
    if cfg.generator.inference_engine.weight_sync_backend == "nccl":
        env_vars["NCCL_CUMEM_ENABLE"] = "0"

The NCCL bug that required this was resolved in NCCL 2.22.3, and this override was removed from vLLM:

NCCL: NVIDIA/nccl#1234
vLLM: vllm-project/vllm#24141

Since the resolved NCCL version is shipped with PyTorch, and we are pinned to 2.10.0 (NCCL 2.26.2), it seems safe to remove this env var for older NCCL versions.

In fact, Nemo-RL actually sets this env-var to 1 (link).

Verified that GSM8K still works with this flag removed both colocated and non-colocated, and with vllm tp=2

We see that max gpu memory utilization is also slightly lower with this env var removed, as it enables newer NCCL version memory optimizations (as mentioned in vllm-project/vllm#24141):

image
Open in Devin Review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

@SumanthRH SumanthRH merged commit f6a61a8 into NovaSky-AI:main Apr 30, 2026
5 of 6 checks passed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the logic that sets the NCCL_CUMEM_ENABLE environment variable to '0' when the NCCL weight synchronization backend is used during runtime environment preparation. I have no feedback to provide as there are no review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants