Skip to content

fix CVEs: nemo-toolkit >=2.7.2, xgrammar >=0.1.32, delete ray_dist.jar#1612

Merged
ayushdg merged 7 commits into
NVIDIA-NeMo:mainfrom
mohammadaaftabv:audio-cve-fixes
Apr 2, 2026
Merged

fix CVEs: nemo-toolkit >=2.7.2, xgrammar >=0.1.32, delete ray_dist.jar#1612
ayushdg merged 7 commits into
NVIDIA-NeMo:mainfrom
mohammadaaftabv:audio-cve-fixes

Conversation

@mohammadaaftabv
Copy link
Copy Markdown
Contributor

@mohammadaaftabv mohammadaaftabv commented Mar 16, 2026

Summary

Fix four HIGH-severity CVEs affecting Curator dependencies: two in nemo-toolkit (deserialization RCE) and two in transitive dependencies (xgrammar DoS, jackson-core DoS). Also fix a latent flaky test in test_base.py.

CVEs addressed

CVE / Advisory Severity Component Fix
GHSA-9379-mwvr-7wxx / CVE-2025-33245 HIGH (CVSS 8.0) nemo-toolkit (RCE via unsafe deserialization) Bump nemo_toolkit[asr] from ==2.4.0 to >=2.7.2
GHSA-hvjw-vp7g-39h5 / CVE-2025-33253 HIGH (CVSS 7.8) nemo-toolkit (RCE via unsafe deserialization) Bump nemo_toolkit[asr] from ==2.4.0 to >=2.7.2
GHSA-7rgv-gqhr-fxg3 / CVE-2026-25048 HIGH (CVSS 8.7) xgrammar (DoS via uncontrolled recursion) Override xgrammar>=0.1.32 in pyproject.toml
GHSA-72hv-8253-57qq HIGH (CVSS 8.7) jackson-core bundled in Ray's ray_dist.jar (async parser DoS) Delete ray_dist.jar in Dockerfile

CVE details

nemo-toolkit (GHSA-9379-mwvr-7wxx, GHSA-hvjw-vp7g-39h5): NeMo < 2.6.1 uses torch.load()/pickle.load() without weights_only=True when loading model checkpoints. An attacker who convinces a user to load a maliciously crafted .nemo or .ckpt file can achieve remote code execution. In Curator, InferenceAsrNemoStage calls ASRModel.from_pretrained() — the exact deserialization path these CVEs target. Fixed in nemo-toolkit >= 2.6.1; bumped to >= 2.7.2 for latest fixes.

xgrammar (GHSA-7rgv-gqhr-fxg3): Constructing a grammar rule with ~30,000 layers of nested parentheses triggers a segfault via uncontrolled recursion (CWE-674) in xgrammar's syntax parsing. Remote attackers can crash any app using xgrammar (e.g., vllm structured output) without authentication. Fixed in xgrammar 0.1.32. vllm pins xgrammar==0.1.29, so we use override-dependencies to bump it to >=0.1.32.

jackson-core (GHSA-72hv-8253-57qq): The non-blocking (async) JSON parser in jackson-core bypasses the maxNumberLength constraint (default 1000 chars). Attackers can send JSON with arbitrarily long numbers, causing OutOfMemoryError and CPU exhaustion. The vulnerable jackson-core 2.16.1 is bundled inside ray_dist.jar, a Java binary artifact in the Ray Python package. Since Curator never uses Ray's Java support, we delete the JAR in the Dockerfile. A build-time verification step fails the build if the JAR persists. Ray has merged the upstream fix (ray-project/ray#61808, jackson-databind 2.16.1 → 2.18.6) but has not released it yet (latest is still Ray 2.54.0).

Cross-modality impact

  • nemo-toolkit bump only affects the audio_cpu / audio_cuda12 extras. No other modality depends on nemo-toolkit.
  • xgrammar is used internally by vllm and is never imported directly by Curator. The override simply bumps the transitive dependency version resolved by uv.
  • ray_dist.jar deletion only removes Ray's unused Java support. No Curator backend (Xenna, RayData, or Dask) invokes Ray Java.

Changes

File What changed
pyproject.toml Bump nemo_toolkit[asr] from ==2.4.0 to >=2.7.2; move xgrammar from constraint-dependencies (>=0.1.21) to override-dependencies (>=0.1.32)
uv.lock Regenerated (nemo-toolkit 2.4.0 → 2.7.2, xgrammar 0.1.29 → 0.1.32, +8 new deps, -6 removed deps)
docker/Dockerfile Delete ray_dist.jar post-install with build-time verification guard
tests/stages/common/test_base.py Fix latent flaky test: test_with_method_thread_safety now sorts thread results by worker_id before asserting per-worker values, removing dependence on non-deterministic thread completion order

Testing

  • Docker build: Full --no-cache build with CURATOR_EXTRA=audio_cuda12 succeeds. ray_dist.jar deletion verified by build-time guard. NeMo 2.7.2 installs cleanly with no dependency conflicts.
  • FLEURS end-to-end pipeline in Docker: pipeline.py --backend ray_data --gpus 1 completed successfully — 394 tasks processed in 27s with GPU (InferenceAsrNemoStage ran as a Ray GPU actor with num_gpus=1.0).
  • Unit tests: All 112 audio tests pass (pytest tests/stages/audio/ tests/tasks/test_audio_task.py).
  • Flaky test fix: test_with_method_thread_safety was failing non-deterministically under CI load because it assumed threads complete in creation order. The test exists on main with the same bug — this PR fixes it.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 16, 2026

Greptile Summary

This PR addresses four HIGH-severity CVEs by bumping nemo_toolkit[asr] to >=2.7.2, overriding xgrammar to >=0.1.32 (needed because vllm uses a strict ==0.1.29 pin), and deleting ray_dist.jar from the Dockerfile with a build-time guard that fails the build if the JAR persists. The test_base.py flaky-test fix mentioned in the PR description does not appear in the actual diff — it may have been squashed, deferred, or the description is aspirational.

Confidence Score: 5/5

  • Safe to merge; all remaining findings are P2 style/documentation nits with no impact on correctness or security.
  • No P0 or P1 issues found. The CVE fixes are correctly implemented: the nemo-toolkit bump removes RCE-vulnerable deserialization paths, the xgrammar override (not just constraint) is the right mechanism to bypass vllm's strict pin, and the Dockerfile verification guard correctly fails the build if ray_dist.jar persists. The two P2 findings (undocumented pynvml removal and a comment typo) do not affect runtime behaviour.
  • No files require special attention.

Important Files Changed

Filename Overview
docker/Dockerfile Adds deletion of ray_dist.jar (bundled vulnerable jackson-core 2.16.1) with a build-time verification guard that fails the build if the JAR persists; clean and correct implementation.
pyproject.toml Bumps nemo_toolkit[asr] to >=2.7.2, moves xgrammar to override-dependencies at >=0.1.32, moves transformers constraint from override to constraint-dependencies, and silently removes pynvml>=13.0.1 without documentation.
uv.lock Auto-regenerated lockfile reflecting nemo-toolkit 2.4.0→2.7.2, xgrammar 0.1.29→0.1.32, and associated transitive dependency changes; not manually edited.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[uv sync --locked\nnemo-toolkit 2.7.2, xgrammar 0.1.32] --> B[Delete aiohttp thirdparty dir]
    B --> C[find ray_dist.jar -delete\nGHSA-72hv-8253-57qq fix]
    C --> D{ray_dist.jar\nstill present?}
    D -- Yes --> E[exit 1 - Build fails]
    D -- No --> F[Build succeeds]

    G[pyproject.toml\nnemo_toolkit asr >=2.7.2\nGHSA-9379 / GHSA-hvjw] --> A
    H[pyproject.toml\nxgrammar >=0.1.32 override\nGHSA-7rgv-gqhr-fxg3] --> A
Loading

Reviews (13): Last reviewed commit: "Merge branch 'main' into audio-cve-fixes" | Re-trigger Greptile

Comment thread pyproject.toml
"protobuf>=5.29.5", # Override nemo-toolkits constraint of ~=5.29.5
"setuptools>=80.10.1", # Override setuptools range in other dependencies to address CVE GHSA-58pv-8j8x-9vj2
"transformers<=4.55.2", # Else Cosmos Embed imports fail
"xgrammar>=0.1.32", # Override vllm's ==0.1.29 pin to address CVE GHSA-7rgv-gqhr-fxg3 (DoS via multi-layer nesting)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xgrammar override may break vllm structured output

vllm pins xgrammar==0.1.29 strictly because it relies on a stable internal API for grammar-based structured generation. Overriding to >=0.1.32 is the correct mechanism to address the CVE, but xgrammar releases between 0.1.29 and 0.1.32 may have introduced API changes that break vllm's usage. It would be worth confirming (e.g., via the CI test suite or a quick manual check of the xgrammar changelog) that vllm's structured-output feature continues to work with xgrammar>=0.1.32 before merging.

@mohammadaaftabv mohammadaaftabv changed the title fix CVEs: bump nemo-toolkit>=2.6.1, xgrammar>=0.1.32, delete ray_dist… fix CVEs: bump nemo-toolkit>=2.6.1, xgrammar>=0.1.32, delete ray_dist.jar, add --backend ray to FLEURS tutorial Mar 16, 2026
@mohammadaaftabv
Copy link
Copy Markdown
Contributor Author

/ok to test bf57325

@mohammadaaftabv
Copy link
Copy Markdown
Contributor Author

/ok to test 60dc9cd

Comment thread uv.lock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are conflicts here. Maybe @thomasdhc can help unblock?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take over the PR. For context the older uv version generates lock files with a different format (no time) that causes a big diff and conflicts. I've re-updated that in #1682 after it was changed in #1608 (I think)

mohammadaaftabv and others added 6 commits April 1, 2026 16:16
- Override vllm==0.1.29 xgrammar pin with >=0.1.32 (GHSA-7rgv-gqhr-fxg3: DoS via uncontrolled recursion)
- Delete ray_dist.jar in Dockerfile (GHSA-72hv-8253-57qq: jackson-core async parser DoS)
- Regenerate uv.lock

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Bump nemo_toolkit[asr] from ==2.4.0 to >=2.7.2
  Fixes CVE-2025-33245 (GHSA-9379-mwvr-7wxx, CVSS 8.0) and
  CVE-2025-33253 (GHSA-hvjw-vp7g-39h5, CVSS 7.8): RCE via unsafe
  deserialization in nemo-toolkit < 2.6.1
- Override xgrammar >=0.1.32 (GHSA-7rgv-gqhr-fxg3: DoS via recursion)
- Delete ray_dist.jar in Dockerfile (GHSA-72hv-8253-57qq: jackson-core DoS)
- Regenerate uv.lock

Verified: Docker build succeeds, FLEURS e2e pipeline completes
with GPU (394 tasks, 27s, RayDataExecutor).

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
test_with_method_thread_safety relied on thread completion order matching
thread creation order, which is non-deterministic under CI load.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Undo the test sorting fix (separate concern from CVE fixes)
- Regenerate uv.lock after rebase with main

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@ayushdg
Copy link
Copy Markdown
Contributor

ayushdg commented Apr 1, 2026

Summary of changes made on top of @mohammadaaftabv work:

  1. removed the pytest changes to address in a separate PR
  2. Moved the transformers version from override to constraint rep since newer nemo toolkit no longer pins to an older transformers

Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants