fix CVEs: nemo-toolkit >=2.7.2, xgrammar >=0.1.32, delete ray_dist.jar#1612
Conversation
Greptile SummaryThis PR addresses four HIGH-severity CVEs by bumping Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[uv sync --locked\nnemo-toolkit 2.7.2, xgrammar 0.1.32] --> B[Delete aiohttp thirdparty dir]
B --> C[find ray_dist.jar -delete\nGHSA-72hv-8253-57qq fix]
C --> D{ray_dist.jar\nstill present?}
D -- Yes --> E[exit 1 - Build fails]
D -- No --> F[Build succeeds]
G[pyproject.toml\nnemo_toolkit asr >=2.7.2\nGHSA-9379 / GHSA-hvjw] --> A
H[pyproject.toml\nxgrammar >=0.1.32 override\nGHSA-7rgv-gqhr-fxg3] --> A
Reviews (13): Last reviewed commit: "Merge branch 'main' into audio-cve-fixes" | Re-trigger Greptile |
| "protobuf>=5.29.5", # Override nemo-toolkits constraint of ~=5.29.5 | ||
| "setuptools>=80.10.1", # Override setuptools range in other dependencies to address CVE GHSA-58pv-8j8x-9vj2 | ||
| "transformers<=4.55.2", # Else Cosmos Embed imports fail | ||
| "xgrammar>=0.1.32", # Override vllm's ==0.1.29 pin to address CVE GHSA-7rgv-gqhr-fxg3 (DoS via multi-layer nesting) |
There was a problem hiding this comment.
xgrammar override may break vllm structured output
vllm pins xgrammar==0.1.29 strictly because it relies on a stable internal API for grammar-based structured generation. Overriding to >=0.1.32 is the correct mechanism to address the CVE, but xgrammar releases between 0.1.29 and 0.1.32 may have introduced API changes that break vllm's usage. It would be worth confirming (e.g., via the CI test suite or a quick manual check of the xgrammar changelog) that vllm's structured-output feature continues to work with xgrammar>=0.1.32 before merging.
|
/ok to test bf57325 |
|
/ok to test 60dc9cd |
There was a problem hiding this comment.
It looks like there are conflicts here. Maybe @thomasdhc can help unblock?
- Override vllm==0.1.29 xgrammar pin with >=0.1.32 (GHSA-7rgv-gqhr-fxg3: DoS via uncontrolled recursion) - Delete ray_dist.jar in Dockerfile (GHSA-72hv-8253-57qq: jackson-core async parser DoS) - Regenerate uv.lock Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Bump nemo_toolkit[asr] from ==2.4.0 to >=2.7.2 Fixes CVE-2025-33245 (GHSA-9379-mwvr-7wxx, CVSS 8.0) and CVE-2025-33253 (GHSA-hvjw-vp7g-39h5, CVSS 7.8): RCE via unsafe deserialization in nemo-toolkit < 2.6.1 - Override xgrammar >=0.1.32 (GHSA-7rgv-gqhr-fxg3: DoS via recursion) - Delete ray_dist.jar in Dockerfile (GHSA-72hv-8253-57qq: jackson-core DoS) - Regenerate uv.lock Verified: Docker build succeeds, FLEURS e2e pipeline completes with GPU (394 tasks, 27s, RayDataExecutor). Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
test_with_method_thread_safety relied on thread completion order matching thread creation order, which is non-deterministic under CI load. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Undo the test sorting fix (separate concern from CVE fixes) - Regenerate uv.lock after rebase with main Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
|
Summary of changes made on top of @mohammadaaftabv work:
|
Summary
Fix four HIGH-severity CVEs affecting Curator dependencies: two in nemo-toolkit (deserialization RCE) and two in transitive dependencies (xgrammar DoS, jackson-core DoS). Also fix a latent flaky test in
test_base.py.CVEs addressed
nemo_toolkit[asr]from==2.4.0to>=2.7.2nemo_toolkit[asr]from==2.4.0to>=2.7.2xgrammar>=0.1.32inpyproject.tomlray_dist.jar(async parser DoS)ray_dist.jarin DockerfileCVE details
nemo-toolkit (GHSA-9379-mwvr-7wxx, GHSA-hvjw-vp7g-39h5): NeMo < 2.6.1 uses
torch.load()/pickle.load()withoutweights_only=Truewhen loading model checkpoints. An attacker who convinces a user to load a maliciously crafted.nemoor.ckptfile can achieve remote code execution. In Curator,InferenceAsrNemoStagecallsASRModel.from_pretrained()— the exact deserialization path these CVEs target. Fixed in nemo-toolkit >= 2.6.1; bumped to >= 2.7.2 for latest fixes.xgrammar (GHSA-7rgv-gqhr-fxg3): Constructing a grammar rule with ~30,000 layers of nested parentheses triggers a segfault via uncontrolled recursion (CWE-674) in xgrammar's syntax parsing. Remote attackers can crash any app using xgrammar (e.g., vllm structured output) without authentication. Fixed in xgrammar 0.1.32. vllm pins
xgrammar==0.1.29, so we useoverride-dependenciesto bump it to>=0.1.32.jackson-core (GHSA-72hv-8253-57qq): The non-blocking (async) JSON parser in jackson-core bypasses the
maxNumberLengthconstraint (default 1000 chars). Attackers can send JSON with arbitrarily long numbers, causingOutOfMemoryErrorand CPU exhaustion. The vulnerable jackson-core 2.16.1 is bundled insideray_dist.jar, a Java binary artifact in the Ray Python package. Since Curator never uses Ray's Java support, we delete the JAR in the Dockerfile. A build-time verification step fails the build if the JAR persists. Ray has merged the upstream fix (ray-project/ray#61808, jackson-databind 2.16.1 → 2.18.6) but has not released it yet (latest is still Ray 2.54.0).Cross-modality impact
audio_cpu/audio_cuda12extras. No other modality depends on nemo-toolkit.uv.Changes
pyproject.tomlnemo_toolkit[asr]from==2.4.0to>=2.7.2; move xgrammar fromconstraint-dependencies(>=0.1.21) tooverride-dependencies(>=0.1.32)uv.lockdocker/Dockerfileray_dist.jarpost-install with build-time verification guardtests/stages/common/test_base.pytest_with_method_thread_safetynow sorts thread results byworker_idbefore asserting per-worker values, removing dependence on non-deterministic thread completion orderTesting
--no-cachebuild withCURATOR_EXTRA=audio_cuda12succeeds.ray_dist.jardeletion verified by build-time guard. NeMo 2.7.2 installs cleanly with no dependency conflicts.pipeline.py --backend ray_data --gpus 1completed successfully — 394 tasks processed in 27s with GPU (InferenceAsrNemoStageran as a Ray GPU actor withnum_gpus=1.0).pytest tests/stages/audio/ tests/tasks/test_audio_task.py).test_with_method_thread_safetywas failing non-deterministically under CI load because it assumed threads complete in creation order. The test exists onmainwith the same bug — this PR fixes it.Checklist