fix CVEs: nemo-toolkit >=2.7.2, xgrammar >=0.1.32, delete ray_dist.jar by mohammadaaftabv · Pull Request #1612 · NVIDIA-NeMo/Curator

mohammadaaftabv · 2026-03-16T18:26:16Z

Summary

Fix four HIGH-severity CVEs affecting Curator dependencies: two in nemo-toolkit (deserialization RCE) and two in transitive dependencies (xgrammar DoS, jackson-core DoS). Also fix a latent flaky test in test_base.py.

CVEs addressed

CVE / Advisory	Severity	Component	Fix
GHSA-9379-mwvr-7wxx / CVE-2025-33245	HIGH (CVSS 8.0)	nemo-toolkit (RCE via unsafe deserialization)	Bump `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2`
GHSA-hvjw-vp7g-39h5 / CVE-2025-33253	HIGH (CVSS 7.8)	nemo-toolkit (RCE via unsafe deserialization)	Bump `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2`
GHSA-7rgv-gqhr-fxg3 / CVE-2026-25048	HIGH (CVSS 8.7)	xgrammar (DoS via uncontrolled recursion)	Override `xgrammar>=0.1.32` in `pyproject.toml`
GHSA-72hv-8253-57qq	HIGH (CVSS 8.7)	jackson-core bundled in Ray's `ray_dist.jar` (async parser DoS)	Delete `ray_dist.jar` in Dockerfile

CVE details

nemo-toolkit (GHSA-9379-mwvr-7wxx, GHSA-hvjw-vp7g-39h5): NeMo < 2.6.1 uses torch.load()/pickle.load() without weights_only=True when loading model checkpoints. An attacker who convinces a user to load a maliciously crafted .nemo or .ckpt file can achieve remote code execution. In Curator, InferenceAsrNemoStage calls ASRModel.from_pretrained() — the exact deserialization path these CVEs target. Fixed in nemo-toolkit >= 2.6.1; bumped to >= 2.7.2 for latest fixes.

xgrammar (GHSA-7rgv-gqhr-fxg3): Constructing a grammar rule with ~30,000 layers of nested parentheses triggers a segfault via uncontrolled recursion (CWE-674) in xgrammar's syntax parsing. Remote attackers can crash any app using xgrammar (e.g., vllm structured output) without authentication. Fixed in xgrammar 0.1.32. vllm pins xgrammar==0.1.29, so we use override-dependencies to bump it to >=0.1.32.

jackson-core (GHSA-72hv-8253-57qq): The non-blocking (async) JSON parser in jackson-core bypasses the maxNumberLength constraint (default 1000 chars). Attackers can send JSON with arbitrarily long numbers, causing OutOfMemoryError and CPU exhaustion. The vulnerable jackson-core 2.16.1 is bundled inside ray_dist.jar, a Java binary artifact in the Ray Python package. Since Curator never uses Ray's Java support, we delete the JAR in the Dockerfile. A build-time verification step fails the build if the JAR persists. Ray has merged the upstream fix (ray-project/ray#61808, jackson-databind 2.16.1 → 2.18.6) but has not released it yet (latest is still Ray 2.54.0).

Cross-modality impact

nemo-toolkit bump only affects the audio_cpu / audio_cuda12 extras. No other modality depends on nemo-toolkit.
xgrammar is used internally by vllm and is never imported directly by Curator. The override simply bumps the transitive dependency version resolved by uv.
ray_dist.jar deletion only removes Ray's unused Java support. No Curator backend (Xenna, RayData, or Dask) invokes Ray Java.

Changes

File	What changed
`pyproject.toml`	Bump `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2`; move xgrammar from `constraint-dependencies` (`>=0.1.21`) to `override-dependencies` (`>=0.1.32`)
`uv.lock`	Regenerated (nemo-toolkit 2.4.0 → 2.7.2, xgrammar 0.1.29 → 0.1.32, +8 new deps, -6 removed deps)
`docker/Dockerfile`	Delete `ray_dist.jar` post-install with build-time verification guard
`tests/stages/common/test_base.py`	Fix latent flaky test: `test_with_method_thread_safety` now sorts thread results by `worker_id` before asserting per-worker values, removing dependence on non-deterministic thread completion order

Testing

Docker build: Full --no-cache build with CURATOR_EXTRA=audio_cuda12 succeeds. ray_dist.jar deletion verified by build-time guard. NeMo 2.7.2 installs cleanly with no dependency conflicts.
FLEURS end-to-end pipeline in Docker: pipeline.py --backend ray_data --gpus 1 completed successfully — 394 tasks processed in 27s with GPU (InferenceAsrNemoStage ran as a Ray GPU actor with num_gpus=1.0).
Unit tests: All 112 audio tests pass (pytest tests/stages/audio/ tests/tasks/test_audio_task.py).
Flaky test fix: test_with_method_thread_safety was failing non-deterministically under CI load because it assumed threads complete in creation order. The test exists on main with the same bug — this PR fixes it.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-03-16T18:26:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-03-16T18:29:55Z

Greptile Summary

This PR addresses four HIGH-severity CVEs by bumping nemo_toolkit[asr] to >=2.7.2, overriding xgrammar to >=0.1.32 (needed because vllm uses a strict ==0.1.29 pin), and deleting ray_dist.jar from the Dockerfile with a build-time guard that fails the build if the JAR persists. The test_base.py flaky-test fix mentioned in the PR description does not appear in the actual diff — it may have been squashed, deferred, or the description is aspirational.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/documentation nits with no impact on correctness or security.
No P0 or P1 issues found. The CVE fixes are correctly implemented: the nemo-toolkit bump removes RCE-vulnerable deserialization paths, the xgrammar override (not just constraint) is the right mechanism to bypass vllm's strict pin, and the Dockerfile verification guard correctly fails the build if ray_dist.jar persists. The two P2 findings (undocumented pynvml removal and a comment typo) do not affect runtime behaviour.
No files require special attention.

Important Files Changed

Filename	Overview
docker/Dockerfile	Adds deletion of ray_dist.jar (bundled vulnerable jackson-core 2.16.1) with a build-time verification guard that fails the build if the JAR persists; clean and correct implementation.
pyproject.toml	Bumps nemo_toolkit[asr] to >=2.7.2, moves xgrammar to override-dependencies at >=0.1.32, moves transformers constraint from override to constraint-dependencies, and silently removes pynvml>=13.0.1 without documentation.
uv.lock	Auto-regenerated lockfile reflecting nemo-toolkit 2.4.0→2.7.2, xgrammar 0.1.29→0.1.32, and associated transitive dependency changes; not manually edited.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[uv sync --locked\nnemo-toolkit 2.7.2, xgrammar 0.1.32] --> B[Delete aiohttp thirdparty dir]
    B --> C[find ray_dist.jar -delete\nGHSA-72hv-8253-57qq fix]
    C --> D{ray_dist.jar\nstill present?}
    D -- Yes --> E[exit 1 - Build fails]
    D -- No --> F[Build succeeds]

    G[pyproject.toml\nnemo_toolkit asr >=2.7.2\nGHSA-9379 / GHSA-hvjw] --> A
    H[pyproject.toml\nxgrammar >=0.1.32 override\nGHSA-7rgv-gqhr-fxg3] --> A

_{Reviews (13): Last reviewed commit: "Merge branch 'main' into audio-cve-fixes" | Re-trigger Greptile}

greptile-apps · 2026-03-16T18:29:59Z

    "protobuf>=5.29.5",  # Override nemo-toolkits constraint of ~=5.29.5
    "setuptools>=80.10.1", # Override setuptools range in other dependencies to address CVE GHSA-58pv-8j8x-9vj2
    "transformers<=4.55.2", # Else Cosmos Embed imports fail
+    "xgrammar>=0.1.32", # Override vllm's ==0.1.29 pin to address CVE GHSA-7rgv-gqhr-fxg3 (DoS via multi-layer nesting)


xgrammar override may break vllm structured output

vllm pins xgrammar==0.1.29 strictly because it relies on a stable internal API for grammar-based structured generation. Overriding to >=0.1.32 is the correct mechanism to address the CVE, but xgrammar releases between 0.1.29 and 0.1.32 may have introduced API changes that break vllm's usage. It would be worth confirming (e.g., via the CI test suite or a quick manual check of the xgrammar changelog) that vllm's structured-output feature continues to work with xgrammar>=0.1.32 before merging.

mohammadaaftabv · 2026-03-16T19:51:30Z

/ok to test bf57325

mohammadaaftabv · 2026-03-30T12:02:40Z

/ok to test 60dc9cd

sarahyurick · 2026-03-31T17:20:30Z

It looks like there are conflicts here. Maybe @thomasdhc can help unblock?

I can take over the PR. For context the older uv version generates lock files with a different format (no time) that causes a big diff and conflicts. I've re-updated that in #1682 after it was changed in #1608 (I think)

- Override vllm==0.1.29 xgrammar pin with >=0.1.32 (GHSA-7rgv-gqhr-fxg3: DoS via uncontrolled recursion) - Delete ray_dist.jar in Dockerfile (GHSA-72hv-8253-57qq: jackson-core async parser DoS) - Regenerate uv.lock Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Bump nemo_toolkit[asr] from ==2.4.0 to >=2.7.2 Fixes CVE-2025-33245 (GHSA-9379-mwvr-7wxx, CVSS 8.0) and CVE-2025-33253 (GHSA-hvjw-vp7g-39h5, CVSS 7.8): RCE via unsafe deserialization in nemo-toolkit < 2.6.1 - Override xgrammar >=0.1.32 (GHSA-7rgv-gqhr-fxg3: DoS via recursion) - Delete ray_dist.jar in Dockerfile (GHSA-72hv-8253-57qq: jackson-core DoS) - Regenerate uv.lock Verified: Docker build succeeds, FLEURS e2e pipeline completes with GPU (394 tasks, 27s, RayDataExecutor). Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

test_with_method_thread_safety relied on thread completion order matching thread creation order, which is non-deterministic under CI load. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Undo the test sorting fix (separate concern from CVE fixes) - Regenerate uv.lock after rebase with main Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

ayushdg · 2026-04-01T23:22:27Z

Summary of changes made on top of @mohammadaaftabv work:

removed the pytest changes to address in a separate PR
Moved the transformers version from override to constraint rep since newer nemo toolkit no longer pins to an older transformers

sarahyurick

Thanks!

greptile-apps Bot reviewed Mar 16, 2026

View reviewed changes

mohammadaaftabv changed the title ~~fix CVEs: bump nemo-toolkit>=2.6.1, xgrammar>=0.1.32, delete ray_dist…~~ fix CVEs: bump nemo-toolkit>=2.6.1, xgrammar>=0.1.32, delete ray_dist.jar, add --backend ray to FLEURS tutorial Mar 16, 2026

copy-pr-bot Bot temporarily deployed to test March 16, 2026 19:52 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 16, 2026 19:52 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 30, 2026 11:39 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 30, 2026 11:39 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci March 30, 2026 11:39 Inactive

sarahyurick reviewed Mar 31, 2026

View reviewed changes

mohammadaaftabv and others added 6 commits April 1, 2026 16:16

fix flaky test: sort thread results by worker_id before asserting

d6c828e

test_with_method_thread_safety relied on thread completion order matching thread creation order, which is non-deterministic under CI load. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Revert test_base.py changes and regenerate lockfile

de0c95f

- Undo the test sorting fix (separate concern from CVE fixes) - Regenerate uv.lock after rebase with main Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

remove pynvml since nvidia-ml-py already covers those deps

fc1fa5d

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Handle conflicts and remove transfomers override to use constraint now

7fa01de

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

sarahyurick approved these changes Apr 1, 2026

View reviewed changes

Merge branch 'main' into audio-cve-fixes

b3a9303

thomasdhc approved these changes Apr 2, 2026

View reviewed changes

This was referenced Apr 3, 2026

docs: add fern docs for 26.04 PRs (#1160, #1575, #1576, #1603, #1612, #1652) #1730

Closed

docs: add release notes and container docs for CVE fixes (PR #1612) #1733

Merged

ayushdg mentioned this pull request Apr 3, 2026

fix(deps): Address container scan vulnerabilities in nemo_toolkit #1740

Closed

sarahyurick mentioned this pull request Apr 15, 2026

26.04 CVE Fixes - first draft #1600

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix CVEs: nemo-toolkit >=2.7.2, xgrammar >=0.1.32, delete ray_dist.jar#1612

fix CVEs: nemo-toolkit >=2.7.2, xgrammar >=0.1.32, delete ray_dist.jar#1612
ayushdg merged 7 commits into
NVIDIA-NeMo:mainfrom
mohammadaaftabv:audio-cve-fixes

mohammadaaftabv commented Mar 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

greptile-apps Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Mar 16, 2026

Uh oh!

mohammadaaftabv commented Mar 16, 2026

Uh oh!

mohammadaaftabv commented Mar 30, 2026

Uh oh!

sarahyurick Mar 31, 2026

Uh oh!

ayushdg Mar 31, 2026

Uh oh!

ayushdg commented Apr 1, 2026

Uh oh!

sarahyurick left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mohammadaaftabv commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CVEs addressed

CVE details

Cross-modality impact

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

greptile-apps Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mohammadaaftabv commented Mar 16, 2026

Uh oh!

mohammadaaftabv commented Mar 30, 2026

Uh oh!

sarahyurick Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ayushdg Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ayushdg commented Apr 1, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohammadaaftabv commented Mar 16, 2026 •

edited

Loading

greptile-apps Bot commented Mar 16, 2026 •

edited

Loading