Replace in-repo LLM ONNX export with TensorRT-Edge-LLM#1210
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoves the legacy LLM ONNX export pipeline and utilities, deletes the example export script and tests, and replaces the README with a TensorRT-Edge-LLM–focused LLM/VLM quantize→export→build→inference workflow and CLI example flows. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/torch_onnx/README.md (1)
105-110: Pin TensorRT-Edge-LLM to a stable release tag for reproducible docs.The current install flow tracks the default branch, so commands can silently drift and break over time. The latest stable release is
v0.6.0(Mar 19, 2026), which includes both thetensorrt-edgellm-quantize-llmandtensorrt-edgellm-export-llmCLI commands. Update the snippet to clone with--branch v0.6.0or reference a specific commit SHA instead of relying on the default branch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 105 - 110, The clone step in the README currently tracks the default branch causing non-reproducible installs; update the git clone command for TensorRT-Edge-LLM to pin the stable release by adding --branch v0.6.0 (or replace with a specific commit SHA) so the sequence starting at the git clone line uses a fixed tag; ensure the README shows git clone --branch v0.6.0 https://github.com/NVIDIA/TensorRT-Edge-LLM.git (or the chosen SHA) before the subsequent git submodule update --init --recursive, venv creation, activation, and pip3 install steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/torch_onnx/README.md`:
- Around line 117-123: The README's System requirements section currently lists
conflicting VRAM guidance ("GPU VRAM: 8-16 GB ... 20-48 GB for models up to 8B")
that contradicts the later statement saying "80 GB may be needed for up to 8B";
update both places under the "**System requirements:**" heading and the later
mention so they match and are split by model size, precision and loading method
(e.g., FP16/INT8 with tensor-parallel or offloading: 20–48 GB; FP32 or no
offload: ~80 GB), and add a short parenthetical note naming the methods that
require the higher VRAM (e.g., "full FP32/no offload") so readers can provision
correctly; ensure the same phrasing and numbers are used in both occurrences.
---
Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Around line 105-110: The clone step in the README currently tracks the default
branch causing non-reproducible installs; update the git clone command for
TensorRT-Edge-LLM to pin the stable release by adding --branch v0.6.0 (or
replace with a specific commit SHA) so the sequence starting at the git clone
line uses a fixed tag; ensure the README shows git clone --branch v0.6.0
https://github.com/NVIDIA/TensorRT-Edge-LLM.git (or the chosen SHA) before the
subsequent git submodule update --init --recursive, venv creation, activation,
and pip3 install steps.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c95fd1ad-b69a-4966-ad57-7cfd28bbe508
📒 Files selected for processing (7)
examples/torch_onnx/README.mdexamples/torch_onnx/llm_export.pymodelopt/onnx/llm_export_utils/__init__.pymodelopt/onnx/llm_export_utils/export_utils.pymodelopt/onnx/llm_export_utils/quantization_utils.pymodelopt/onnx/llm_export_utils/surgeon_utils.pytests/examples/torch_onnx/test_llm_export.py
💤 Files with no reviewable changes (6)
- tests/examples/torch_onnx/test_llm_export.py
- modelopt/onnx/llm_export_utils/init.py
- modelopt/onnx/llm_export_utils/surgeon_utils.py
- modelopt/onnx/llm_export_utils/quantization_utils.py
- examples/torch_onnx/llm_export.py
- modelopt/onnx/llm_export_utils/export_utils.py
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1210 +/- ##
==========================================
+ Coverage 75.42% 76.72% +1.30%
==========================================
Files 353 353
Lines 40603 40596 -7
==========================================
+ Hits 30623 31149 +526
+ Misses 9980 9447 -533
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
The modelopt.onnx.llm_export_utils module was only used by the now-removed examples/torch_onnx/llm_export.py script. No other code in the codebase imports from it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
977e68c to
9f26f2c
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
examples/torch_onnx/README.md (1)
117-123:⚠️ Potential issue | 🟠 MajorConflicting VRAM requirements remain unresolved.
Line 122 specifies 20-48 GB for models up to 8B, but Line 249 states 80 GB may be needed for up to 8B. This conflicts with the previous review comment and should be reconciled to avoid user confusion.
Also applies to: 249-249
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 117 - 123, The README's "System requirements" section contains conflicting GPU VRAM guidance for ~8B models: the "GPU VRAM" bullet states 20-48 GB while another place later claims 80 GB; reconcile them by picking a single authoritative recommendation or by clarifying conditions that produce different requirements (e.g., full-precision vs quantized, CPU/GPU offload, sequence length, and pipeline/PEFT usage). Update the "GPU VRAM" bullet in the System requirements and the later VRAM statement so they match and, if needed, add a short parenthetical explaining when 20-48 GB suffices vs when 80 GB might be required (e.g., no quantization and full context on a single GPU).
🧹 Nitpick comments (2)
examples/torch_onnx/README.md (2)
120-120: Clarify compute capability requirements.Line 120 states "Compute Capability 8.0+ (Ampere or newer)" as a general requirement, but Line 204 indicates FP8 specifically requires "SM89+ hardware (Hopper, Ada)", which is Compute Capability 8.9+. Consider clarifying that different quantization methods have different minimum requirements.
📝 Suggested clarification
-- GPU VRAM: 8-16 GB for models up to 3B, 20-48 GB for models up to 8B +- NVIDIA GPU: Compute Capability 8.0+ (Ampere or newer) for INT4; 8.9+ (Hopper, Ada) for FP8; Blackwell for NVFP4 +- GPU VRAM: 8-16 GB for models up to 3B, 20-48 GB for models up to 8BAlso applies to: 204-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` at line 120, Update the README text to clearly distinguish general GPU requirements from FP8-specific requirements: change the general bullet "NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)" to note that most quantization methods require Compute Capability 8.0+ (Ampere), and add an explicit note that FP8 quantization requires SM89 / Compute Capability 8.9+ (Hopper, Ada) as stated later; ensure the README entries around the two mentions (the original 8.0+ line and the FP8 line at 204) are harmonized so readers know which quantization modes need SM89+ versus which work on 8.0+.
104-110: Consider pinning the TensorRT-Edge-LLM version.The installation clones the latest code from the main branch, which could introduce breaking changes over time. Consider adding a specific tag or commit hash for reproducibility.
📌 Suggested enhancement
# Clone and install TensorRT-Edge-LLM git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git cd TensorRT-Edge-LLM +git checkout v1.0.0 # Pin to a specific stable version git submodule update --init --recursive🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 104 - 110, The README currently clones the latest TensorRT-Edge-LLM main branch which can break reproducibility; update the git clone step that references "TensorRT-Edge-LLM" so it pins a stable release (tag) or a specific commit hash (e.g., use a clone+checkout to the chosen tag/commit or clone with --branch <tag> --depth 1) and document which tag/commit you pinned so installs are deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@examples/torch_onnx/README.md`:
- Around line 117-123: The README's "System requirements" section contains
conflicting GPU VRAM guidance for ~8B models: the "GPU VRAM" bullet states 20-48
GB while another place later claims 80 GB; reconcile them by picking a single
authoritative recommendation or by clarifying conditions that produce different
requirements (e.g., full-precision vs quantized, CPU/GPU offload, sequence
length, and pipeline/PEFT usage). Update the "GPU VRAM" bullet in the System
requirements and the later VRAM statement so they match and, if needed, add a
short parenthetical explaining when 20-48 GB suffices vs when 80 GB might be
required (e.g., no quantization and full context on a single GPU).
---
Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Line 120: Update the README text to clearly distinguish general GPU
requirements from FP8-specific requirements: change the general bullet "NVIDIA
GPU with Compute Capability 8.0+ (Ampere or newer)" to note that most
quantization methods require Compute Capability 8.0+ (Ampere), and add an
explicit note that FP8 quantization requires SM89 / Compute Capability 8.9+
(Hopper, Ada) as stated later; ensure the README entries around the two mentions
(the original 8.0+ line and the FP8 line at 204) are harmonized so readers know
which quantization modes need SM89+ versus which work on 8.0+.
- Around line 104-110: The README currently clones the latest TensorRT-Edge-LLM
main branch which can break reproducibility; update the git clone step that
references "TensorRT-Edge-LLM" so it pins a stable release (tag) or a specific
commit hash (e.g., use a clone+checkout to the chosen tag/commit or clone with
--branch <tag> --depth 1) and document which tag/commit you pinned so installs
are deterministic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9fae19b7-ee9f-4d1c-bfb8-d24b2558d6de
📒 Files selected for processing (6)
examples/torch_onnx/README.mdexamples/torch_onnx/llm_export.pymodelopt/onnx/llm_export_utils/export_utils.pymodelopt/onnx/llm_export_utils/quantization_utils.pymodelopt/onnx/llm_export_utils/surgeon_utils.pytests/examples/torch_onnx/test_llm_export.py
💤 Files with no reviewable changes (5)
- tests/examples/torch_onnx/test_llm_export.py
- modelopt/onnx/llm_export_utils/surgeon_utils.py
- modelopt/onnx/llm_export_utils/quantization_utils.py
- modelopt/onnx/llm_export_utils/export_utils.py
- examples/torch_onnx/llm_export.py
…_template (#1225) ## Summary - `apply_chat_template(..., return_tensors="pt")` returns a `BatchEncoding` in transformers 4.46+, which no longer subclasses `dict` - The old guard `isinstance(tokenized, dict)` evaluates to `False` for `BatchEncoding`, so `input_ids` was set to the whole `BatchEncoding` object - Calling `.shape[1]` on a `BatchEncoding` triggers `__getattr__("shape")` → `AttributeError` - Fix: check `isinstance(tokenized, torch.Tensor)` instead, which correctly handles both old transformers (plain tensor) and new transformers (BatchEncoding) This is causing `test_collect_hidden_states` to fail in the speculative decoding CI for all open PRs (#1207, #1210, #1221). ## Test plan - [ ] `torch-pr (speculative_decoding, 26.01)` CI passes - [ ] Verify fix handles both `torch.Tensor` return (old transformers) and `BatchEncoding` return (new transformers 4.46+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
b5246a8 to
14ba626
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
examples/torch_onnx/README.md (3)
210-247: Consider adding a version/date stamp for the supported-model matrix.This matrix will drift quickly as upstream support changes. Adding “validated against TensorRT-Edge-LLM <version/date>” (or linking to a canonical upstream support page) will keep expectations clear.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 210 - 247, The Supported Models matrix under the "Supported Models" header in README.md lacks a validation timestamp or version; add a clear version/date stamp or a link to a canonical upstream compatibility page adjacent to the matrix (e.g., "Validated against TensorRT-Edge-LLM vX.Y — YYYY-MM-DD" or a URL) so readers know when the table was last verified and which upstream version it corresponds to; update the header or add a short footnote beneath the LLM/VLM tables and include a changelog entry or comment indicating where to update this stamp in future.
85-97: Add an explicit migration note from removedllm_export.pyto new CLI commands.This section introduces the new flow, but it doesn’t directly help existing users map old commands/options to
tensorrt-edgellm-*. A short “Migration from legacyllm_export.py” subsection would reduce upgrade friction and support burden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 85 - 97, Add a short "Migration from legacy llm_export.py" subsection in the README that maps common llm_export.py commands/options to the new tensorrt-edgellm-* CLI equivalents; explicitly list the most-used llm_export.py actions (quantize/export/build/infer) and show the corresponding tensorrt-edgellm-<stage> command names, flag mappings, and any changed argument semantics so users can translate old invocations (e.g., quantize -> tensorrt-edgellm-quantize, export -> tensorrt-edgellm-export, build -> tensorrt-edgellm-build, inference -> tensorrt-edgellm-infer) and note any deprecated flags or required new flags.
100-112: Pin TensorRT-Edge-LLM to a tagged release (or commit) in docs.
git clone+pip install .on default branch is non-reproducible and can break over time. Please document a known-good tag/commit for this repo to make the instructions stable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 100 - 112, Update the README's repository checkout/install steps so they pin TensorRT-Edge-LLM to a known-good tag or commit instead of cloning the default branch; replace the current `git clone` and `pip3 install .` flow in the instructions with a step that clones a specific release tag or clones then checks out a pinned commit (e.g., use `git clone ... && git checkout <TAG_OR_COMMIT>` or `git clone --branch <TAG> --depth 1`) and state the exact tag/commit string used, so the subsequent `python3 -m venv venv`, `source venv/bin/activate`, and `pip3 install .` operate on a reproducible checkout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Around line 210-247: The Supported Models matrix under the "Supported Models"
header in README.md lacks a validation timestamp or version; add a clear
version/date stamp or a link to a canonical upstream compatibility page adjacent
to the matrix (e.g., "Validated against TensorRT-Edge-LLM vX.Y — YYYY-MM-DD" or
a URL) so readers know when the table was last verified and which upstream
version it corresponds to; update the header or add a short footnote beneath the
LLM/VLM tables and include a changelog entry or comment indicating where to
update this stamp in future.
- Around line 85-97: Add a short "Migration from legacy llm_export.py"
subsection in the README that maps common llm_export.py commands/options to the
new tensorrt-edgellm-* CLI equivalents; explicitly list the most-used
llm_export.py actions (quantize/export/build/infer) and show the corresponding
tensorrt-edgellm-<stage> command names, flag mappings, and any changed argument
semantics so users can translate old invocations (e.g., quantize ->
tensorrt-edgellm-quantize, export -> tensorrt-edgellm-export, build ->
tensorrt-edgellm-build, inference -> tensorrt-edgellm-infer) and note any
deprecated flags or required new flags.
- Around line 100-112: Update the README's repository checkout/install steps so
they pin TensorRT-Edge-LLM to a known-good tag or commit instead of cloning the
default branch; replace the current `git clone` and `pip3 install .` flow in the
instructions with a step that clones a specific release tag or clones then
checks out a pinned commit (e.g., use `git clone ... && git checkout
<TAG_OR_COMMIT>` or `git clone --branch <TAG> --depth 1`) and state the exact
tag/commit string used, so the subsequent `python3 -m venv venv`, `source
venv/bin/activate`, and `pip3 install .` operate on a reproducible checkout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 669c18fe-40bf-4584-a968-a2c28e46e882
📒 Files selected for processing (1)
examples/torch_onnx/README.md
14ba626 to
f273ad0
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/torch_onnx/README.md (1)
213-250: Avoid duplicating the support matrix inline when you already link the canonical source.You already point to the live TensorRT-Edge-LLM support page on Line 213. Keeping large static ✅ tables here is likely to drift and create stale docs. Consider replacing these tables with a brief snapshot note (date/version) or removing them and relying on the canonical link.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` around lines 213 - 250, Summary: The README duplicates the live "TensorRT-Edge-LLM Supported Models" matrix (the linked page) causing potential drift; remove the large static LLMs/VLMs tables and replace them with a short snapshot note and canonical link. Fix: open the README content that contains the "TensorRT-Edge-LLM Supported Models" link and the following LLMs and VLMs tables, delete the duplicated tables labeled "LLMs" and "VLMs", and add a one-line snapshot summary like "Snapshot (YYYY-MM-DD): see canonical support matrix" plus the existing link to "TensorRT-Edge-LLM Supported Models"; optionally include a version or date field if you want to preserve a historic capture. Reference points: the linked text "TensorRT-Edge-LLM Supported Models" and the table headings "LLMs" and "VLMs" to locate the content to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/torch_onnx/README.md`:
- Around line 93-97: The overview "Quantize" step currently lists only "FP8,
INT4 AWQ, NVFP4" while the methods section (lines ~201-210) also documents "INT8
SmoothQuant" and "INT4 GPTQ"; update the overview's Quantize line to either
enumerate the full set (FP8, INT8 SmoothQuant, INT4 AWQ, INT4 GPTQ, NVFP4) to
match the methods section, or explicitly mark the overview as "common/default
examples" (e.g., prepend "Common examples:") so readers aren’t misled; ensure
you change the text in the README's overview "Quantize" step and keep the
methods section unchanged.
---
Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Around line 213-250: Summary: The README duplicates the live
"TensorRT-Edge-LLM Supported Models" matrix (the linked page) causing potential
drift; remove the large static LLMs/VLMs tables and replace them with a short
snapshot note and canonical link. Fix: open the README content that contains the
"TensorRT-Edge-LLM Supported Models" link and the following LLMs and VLMs
tables, delete the duplicated tables labeled "LLMs" and "VLMs", and add a
one-line snapshot summary like "Snapshot (YYYY-MM-DD): see canonical support
matrix" plus the existing link to "TensorRT-Edge-LLM Supported Models";
optionally include a version or date field if you want to preserve a historic
capture. Reference points: the linked text "TensorRT-Edge-LLM Supported Models"
and the table headings "LLMs" and "VLMs" to locate the content to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f04c1836-ef09-4ea0-b3f4-4504d39e88c8
📒 Files selected for processing (1)
examples/torch_onnx/README.md
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
f273ad0 to
4428528
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
examples/torch_onnx/README.md (1)
93-93:⚠️ Potential issue | 🟡 MinorStill incomplete: Quantization methods list in overview.
Line 93 lists only "FP8, INT4 AWQ, NVFP4" but the methods table (lines 208-210) also documents MXFP8, INT8 SmoothQuant, and INT4 GPTQ. This discrepancy was already flagged in a previous review but remains unresolved.
📝 Suggested fix
Either include the full set in the overview or label it as showing common examples:
-1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4) +1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ/GPTQ, NVFP4, MXFP8, INT8 SmoothQuant)Or:
-1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4) +1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (common: FP8, INT4 AWQ, NVFP4; see [Quantization Methods](`#quantization-methods`) for full list)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` at line 93, The "Quantize" overview entry currently lists only "FP8, INT4 AWQ, NVFP4" but the methods table includes additional techniques (MXFP8, INT8 SmoothQuant, INT4 GPTQ); update the README overview to either enumerate the full set (add MXFP8, INT8 SmoothQuant, INT4 GPTQ) or explicitly mark the listed items as examples (e.g., "common examples: FP8, INT4 AWQ, NVFP4") so the overview and the methods table (the Quantize section and the methods table) are consistent.
🧹 Nitpick comments (1)
examples/torch_onnx/README.md (1)
123-123: Minor: Add 3B VRAM guidance to troubleshooting for completeness.Line 123 specifies "16 GB for models up to 3B" but line 254's troubleshooting section only mentions 4B and 8B. For consistency and completeness, consider adding the 3B guidance.
📝 Suggested addition
-- **GPU out of memory**: Use a larger GPU (40 GB for models up to 4B, 80 GB for models up to 8B) or try `--device cpu` (limited precision support). +- **GPU out of memory**: Use a larger GPU (16 GB for models up to 3B, 40 GB for models up to 4B, 80 GB for models up to 8B) or try `--device cpu` (limited precision support).Also applies to: 254-254
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/README.md` at line 123, Add the missing 3B VRAM guidance to the Troubleshooting section: locate the "Troubleshooting" heading and the existing GPU VRAM guidance that mentions 4B and 8B and insert a line mirroring the top-of-file note ("GPU VRAM: 16 GB for models up to 3B") so the Troubleshooting section consistently documents VRAM for 3B, 4B, and 8B models.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@examples/torch_onnx/README.md`:
- Line 93: The "Quantize" overview entry currently lists only "FP8, INT4 AWQ,
NVFP4" but the methods table includes additional techniques (MXFP8, INT8
SmoothQuant, INT4 GPTQ); update the README overview to either enumerate the full
set (add MXFP8, INT8 SmoothQuant, INT4 GPTQ) or explicitly mark the listed items
as examples (e.g., "common examples: FP8, INT4 AWQ, NVFP4") so the overview and
the methods table (the Quantize section and the methods table) are consistent.
---
Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Line 123: Add the missing 3B VRAM guidance to the Troubleshooting section:
locate the "Troubleshooting" heading and the existing GPU VRAM guidance that
mentions 4B and 8B and insert a line mirroring the top-of-file note ("GPU VRAM:
16 GB for models up to 3B") so the Troubleshooting section consistently
documents VRAM for 3B, 4B, and 8B models.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 451a8ec8-b483-455a-b76a-5de63f66fdc2
📒 Files selected for processing (1)
examples/torch_onnx/README.md
…_template (#1225) ## Summary - `apply_chat_template(..., return_tensors="pt")` returns a `BatchEncoding` in transformers 4.46+, which no longer subclasses `dict` - The old guard `isinstance(tokenized, dict)` evaluates to `False` for `BatchEncoding`, so `input_ids` was set to the whole `BatchEncoding` object - Calling `.shape[1]` on a `BatchEncoding` triggers `__getattr__("shape")` → `AttributeError` - Fix: check `isinstance(tokenized, torch.Tensor)` instead, which correctly handles both old transformers (plain tensor) and new transformers (BatchEncoding) This is causing `test_collect_hidden_states` to fail in the speculative decoding CI for all open PRs (#1207, #1210, #1221). ## Test plan - [ ] `torch-pr (speculative_decoding, 26.01)` CI passes - [ ] Verify fix handles both `torch.Tensor` return (old transformers) and `BatchEncoding` return (new transformers 4.46+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflict: accept deletion of modelopt/onnx/llm_export_utils/export_utils.py (removed on main by #1210). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
What does this PR do?
Type of change: Documentation, Cleanup
This PR removes the in-repo LLM ONNX export pipeline (
llm_export.pyandmodelopt.onnx.llm_export_utilspackage) and updates theexamples/torch_onnx/README.mdto direct users to TensorRT-Edge-LLM, which provides a more complete and actively maintained pipeline for quantizing LLMs/VLMs with ModelOpt and exporting them to optimized ONNX for edge deployment (Jetson, DRIVE).Removed:
examples/torch_onnx/llm_export.py— standalone LLM export scriptmodelopt/onnx/llm_export_utils/— supporting package (export_utils.py,quantization_utils.py,surgeon_utils.py)tests/examples/torch_onnx/test_llm_export.py— associated testsUpdated:
examples/torch_onnx/README.md— rewrote the LLM section with TensorRT-Edge-LLM installation, CLI tools, usage examples (LLM, VLM, EAGLE speculative decoding), supported model matrix, quantization methods, and troubleshooting guidanceUsage
Users should now use TensorRT-Edge-LLM CLI tools instead of the removed
llm_export.py:Testing
torch_quant_to_onnx.py) and mixed precision examples are unaffected.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).llm_export.pyandmodelopt.onnx.llm_export_utilspackage. Users should migrate to TensorRT-Edge-LLM.CONTRIBUTING.md: N/Allm_export_utilsand migration to TensorRT-Edge-LLM.Additional Information
Summary by CodeRabbit
Documentation
Chores