Replace in-repo LLM ONNX export with TensorRT-Edge-LLM by ajrasane · Pull Request #1210 · NVIDIA/Model-Optimizer

ajrasane · 2026-04-08T19:55:36Z

What does this PR do?

Type of change: Documentation, Cleanup

This PR removes the in-repo LLM ONNX export pipeline (llm_export.py and modelopt.onnx.llm_export_utils package) and updates the examples/torch_onnx/README.md to direct users to TensorRT-Edge-LLM, which provides a more complete and actively maintained pipeline for quantizing LLMs/VLMs with ModelOpt and exporting them to optimized ONNX for edge deployment (Jetson, DRIVE).

Removed:

examples/torch_onnx/llm_export.py — standalone LLM export script
modelopt/onnx/llm_export_utils/ — supporting package (export_utils.py, quantization_utils.py, surgeon_utils.py)
tests/examples/torch_onnx/test_llm_export.py — associated tests

Updated:

examples/torch_onnx/README.md — rewrote the LLM section with TensorRT-Edge-LLM installation, CLI tools, usage examples (LLM, VLM, EAGLE speculative decoding), supported model matrix, quantization methods, and troubleshooting guidance

Usage

Users should now use TensorRT-Edge-LLM CLI tools instead of the removed llm_export.py:

# Quantize a model with ModelOpt
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen2.5-3B-Instruct \
    --quantization fp8 \
    --output_dir quantized/qwen2.5-3b-fp8

# Export to ONNX
tensorrt-edgellm-export-llm \
    --model_dir quantized/qwen2.5-3b-fp8 \
    --output_dir onnx_models/qwen2.5-3b

Testing

No functional code changes — this is a removal of deprecated code and documentation update.
Vision model quantization/export (torch_quant_to_onnx.py) and mixed precision examples are unaffected.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ❌ — Removes llm_export.py and modelopt.onnx.llm_export_utils package. Users should migrate to TensorRT-Edge-LLM.
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A — This removes code and tests; no new functionality added.
Did you update Changelog?: ❌ — Should be updated to note the removal of llm_export_utils and migration to TensorRT-Edge-LLM.

Additional Information

Successor tool: TensorRT-Edge-LLM

Summary by CodeRabbit

Documentation
- Replaced prior “LLM Export” docs with an end-to-end "LLM Quantization and Export with TensorRT-Edge-LLM" workflow; added installation, CLI usage, supported-models, expanded quantization methods, example command flows (LLM/VLM/EAGLE), troubleshooting, and updated TOC.
Chores
- Removed legacy LLM export/quantization example scripts, ONNX export/quantization/surgeon utilities, and related integration tests.

coderabbitai · 2026-04-08T19:55:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Removes the legacy LLM ONNX export pipeline and utilities, deletes the example export script and tests, and replaces the README with a TensorRT-Edge-LLM–focused LLM/VLM quantize→export→build→inference workflow and CLI example flows.

Changes

Cohort / File(s)	Summary
Documentation Rewrite `examples/torch_onnx/README.md`	Replaced the prior vision+LLM/ModelOpt-driven LLM export docs with a TensorRT-Edge-LLM–centric workflow: quantize → export → build → inference. Added CLI tool examples (`tensorrt-edgellm-*`), expanded quantization methods, supported LLM/VLM matrices, TOC updates, and troubleshooting notes.
Removed Legacy Export Script `examples/torch_onnx/llm_export.py`	Deleted the example CLI script that resolved config paths, loaded HF models, exported PyTorch→ONNX, optionally ran local quantization (fp8/int4_awq/nvfp4), performed ONNX surgeon steps, and packaged outputs.
Removed Export Utilities Package Init `modelopt/onnx/llm_export_utils/__init__.py`	Removed the package initializer (docstring-only) for the LLM ONNX export utilities package.
Removed Export Utilities `modelopt/onnx/llm_export_utils/export_utils.py`	Deleted LLM ONNX export helpers: `RopeType`, `ModelLoader`, `WrapperModelForCausalLM`, and export functions (`llm_to_onnx`, `torch_to_onnx`) including tracing/export logic and I/O conventions.
Removed Quantization Utilities `modelopt/onnx/llm_export_utils/quantization_utils.py`	Deleted quantization code: calibration loop, dataset/dataloader handling, `get_quant_config`, and public `quantize` entrypoint covering FP8/NVFP4/INT4 flows.
Removed ONNX Surgeon Utilities `modelopt/onnx/llm_export_utils/surgeon_utils.py`	Deleted ONNX graph-surgery utilities: input/output clearing, layer-id extraction, FP8 Q/DQ→DQ folding, weight constantization, and graph cleanup/toposort logic.
Removed Tests `tests/examples/torch_onnx/test_llm_export.py`	Removed pytest module that parameterized integration tests invoking the legacy `llm_export.py` example across multiple model/dtype/lm_head combinations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: replacing the in-repo LLM ONNX export workflow with TensorRT-Edge-LLM, which aligns with the substantial deletions and README updates throughout the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR removes in-repo LLM export code and updates documentation to reference TensorRT-Edge-LLM. No new Python code introducing security anti-patterns is added.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ajrasane/edge_llm_example

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/torch_onnx/README.md (1)
105-110: Pin TensorRT-Edge-LLM to a stable release tag for reproducible docs.

The current install flow tracks the default branch, so commands can silently drift and break over time. The latest stable release is v0.6.0 (Mar 19, 2026), which includes both the tensorrt-edgellm-quantize-llm and tensorrt-edgellm-export-llm CLI commands. Update the snippet to clone with --branch v0.6.0 or reference a specific commit SHA instead of relying on the default branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 105 - 110, The clone step in the
README currently tracks the default branch causing non-reproducible installs;
update the git clone command for TensorRT-Edge-LLM to pin the stable release by
adding --branch v0.6.0 (or replace with a specific commit SHA) so the sequence
starting at the git clone line uses a fixed tag; ensure the README shows git
clone --branch v0.6.0 https://github.com/NVIDIA/TensorRT-Edge-LLM.git (or the
chosen SHA) before the subsequent git submodule update --init --recursive, venv
creation, activation, and pip3 install steps.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/README.md`:
- Around line 117-123: The README's System requirements section currently lists
conflicting VRAM guidance ("GPU VRAM: 8-16 GB ... 20-48 GB for models up to 8B")
that contradicts the later statement saying "80 GB may be needed for up to 8B";
update both places under the "**System requirements:**" heading and the later
mention so they match and are split by model size, precision and loading method
(e.g., FP16/INT8 with tensor-parallel or offloading: 20–48 GB; FP32 or no
offload: ~80 GB), and add a short parenthetical note naming the methods that
require the higher VRAM (e.g., "full FP32/no offload") so readers can provision
correctly; ensure the same phrasing and numbers are used in both occurrences.

---

Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Around line 105-110: The clone step in the README currently tracks the default
branch causing non-reproducible installs; update the git clone command for
TensorRT-Edge-LLM to pin the stable release by adding --branch v0.6.0 (or
replace with a specific commit SHA) so the sequence starting at the git clone
line uses a fixed tag; ensure the README shows git clone --branch v0.6.0
https://github.com/NVIDIA/TensorRT-Edge-LLM.git (or the chosen SHA) before the
subsequent git submodule update --init --recursive, venv creation, activation,
and pip3 install steps.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c95fd1ad-b69a-4966-ad57-7cfd28bbe508

📥 Commits

Reviewing files that changed from the base of the PR and between 4a70040 and d89138a.

📒 Files selected for processing (7)

examples/torch_onnx/README.md
examples/torch_onnx/llm_export.py
modelopt/onnx/llm_export_utils/__init__.py
modelopt/onnx/llm_export_utils/export_utils.py
modelopt/onnx/llm_export_utils/quantization_utils.py
modelopt/onnx/llm_export_utils/surgeon_utils.py
tests/examples/torch_onnx/test_llm_export.py

💤 Files with no reviewable changes (6)

tests/examples/torch_onnx/test_llm_export.py
modelopt/onnx/llm_export_utils/init.py
modelopt/onnx/llm_export_utils/surgeon_utils.py
modelopt/onnx/llm_export_utils/quantization_utils.py
examples/torch_onnx/llm_export.py
modelopt/onnx/llm_export_utils/export_utils.py

examples/torch_onnx/README.md

github-actions · 2026-04-08T20:02:52Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-09 21:03 UTC

codecov · 2026-04-08T20:34:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.72%. Comparing base (9b4f43a) to head (df9056c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1210      +/-   ##
==========================================
+ Coverage   75.42%   76.72%   +1.30%     
==========================================
  Files         353      353              
  Lines       40603    40596       -7     
==========================================
+ Hits        30623    31149     +526     
+ Misses       9980     9447     -533

Flag	Coverage Δ
examples	`44.16% <ø> (+2.39%)`	⬆️
unit	`55.35% <ø> (+0.25%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

The modelopt.onnx.llm_export_utils module was only used by the now-removed examples/torch_onnx/llm_export.py script. No other code in the codebase imports from it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

examples/torch_onnx/README.md (1)
117-123: ⚠️ Potential issue | 🟠 Major

Conflicting VRAM requirements remain unresolved.

Line 122 specifies 20-48 GB for models up to 8B, but Line 249 states 80 GB may be needed for up to 8B. This conflicts with the previous review comment and should be reconciled to avoid user confusion.

Also applies to: 249-249
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 117 - 123, The README's "System
requirements" section contains conflicting GPU VRAM guidance for ~8B models: the
"GPU VRAM" bullet states 20-48 GB while another place later claims 80 GB;
reconcile them by picking a single authoritative recommendation or by clarifying
conditions that produce different requirements (e.g., full-precision vs
quantized, CPU/GPU offload, sequence length, and pipeline/PEFT usage). Update
the "GPU VRAM" bullet in the System requirements and the later VRAM statement so
they match and, if needed, add a short parenthetical explaining when 20-48 GB
suffices vs when 80 GB might be required (e.g., no quantization and full context
on a single GPU).

🧹 Nitpick comments (2)

examples/torch_onnx/README.md (2)

120-120: Clarify compute capability requirements.

Line 120 states "Compute Capability 8.0+ (Ampere or newer)" as a general requirement, but Line 204 indicates FP8 specifically requires "SM89+ hardware (Hopper, Ada)", which is Compute Capability 8.9+. Consider clarifying that different quantization methods have different minimum requirements.

📝 Suggested clarification

-- GPU VRAM: 8-16 GB for models up to 3B, 20-48 GB for models up to 8B
+- NVIDIA GPU: Compute Capability 8.0+ (Ampere or newer) for INT4; 8.9+ (Hopper, Ada) for FP8; Blackwell for NVFP4
+- GPU VRAM: 8-16 GB for models up to 3B, 20-48 GB for models up to 8B

Also applies to: 204-204

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` at line 120, Update the README text to clearly
distinguish general GPU requirements from FP8-specific requirements: change the
general bullet "NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)" to
note that most quantization methods require Compute Capability 8.0+ (Ampere),
and add an explicit note that FP8 quantization requires SM89 / Compute
Capability 8.9+ (Hopper, Ada) as stated later; ensure the README entries around
the two mentions (the original 8.0+ line and the FP8 line at 204) are harmonized
so readers know which quantization modes need SM89+ versus which work on 8.0+.

104-110: Consider pinning the TensorRT-Edge-LLM version.

The installation clones the latest code from the main branch, which could introduce breaking changes over time. Consider adding a specific tag or commit hash for reproducibility.

📌 Suggested enhancement

 # Clone and install TensorRT-Edge-LLM
 git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
 cd TensorRT-Edge-LLM
+git checkout v1.0.0  # Pin to a specific stable version
 git submodule update --init --recursive

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 104 - 110, The README currently
clones the latest TensorRT-Edge-LLM main branch which can break reproducibility;
update the git clone step that references "TensorRT-Edge-LLM" so it pins a
stable release (tag) or a specific commit hash (e.g., use a clone+checkout to
the chosen tag/commit or clone with --branch <tag> --depth 1) and document which
tag/commit you pinned so installs are deterministic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/torch_onnx/README.md`:
- Around line 117-123: The README's "System requirements" section contains
conflicting GPU VRAM guidance for ~8B models: the "GPU VRAM" bullet states 20-48
GB while another place later claims 80 GB; reconcile them by picking a single
authoritative recommendation or by clarifying conditions that produce different
requirements (e.g., full-precision vs quantized, CPU/GPU offload, sequence
length, and pipeline/PEFT usage). Update the "GPU VRAM" bullet in the System
requirements and the later VRAM statement so they match and, if needed, add a
short parenthetical explaining when 20-48 GB suffices vs when 80 GB might be
required (e.g., no quantization and full context on a single GPU).

---

Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Line 120: Update the README text to clearly distinguish general GPU
requirements from FP8-specific requirements: change the general bullet "NVIDIA
GPU with Compute Capability 8.0+ (Ampere or newer)" to note that most
quantization methods require Compute Capability 8.0+ (Ampere), and add an
explicit note that FP8 quantization requires SM89 / Compute Capability 8.9+
(Hopper, Ada) as stated later; ensure the README entries around the two mentions
(the original 8.0+ line and the FP8 line at 204) are harmonized so readers know
which quantization modes need SM89+ versus which work on 8.0+.
- Around line 104-110: The README currently clones the latest TensorRT-Edge-LLM
main branch which can break reproducibility; update the git clone step that
references "TensorRT-Edge-LLM" so it pins a stable release (tag) or a specific
commit hash (e.g., use a clone+checkout to the chosen tag/commit or clone with
--branch <tag> --depth 1) and document which tag/commit you pinned so installs
are deterministic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9fae19b7-ee9f-4d1c-bfb8-d24b2558d6de

📥 Commits

Reviewing files that changed from the base of the PR and between d89138a and 9f26f2c.

📒 Files selected for processing (6)

examples/torch_onnx/README.md
examples/torch_onnx/llm_export.py
modelopt/onnx/llm_export_utils/export_utils.py
modelopt/onnx/llm_export_utils/quantization_utils.py
modelopt/onnx/llm_export_utils/surgeon_utils.py
tests/examples/torch_onnx/test_llm_export.py

💤 Files with no reviewable changes (5)

tests/examples/torch_onnx/test_llm_export.py
modelopt/onnx/llm_export_utils/surgeon_utils.py
modelopt/onnx/llm_export_utils/quantization_utils.py
modelopt/onnx/llm_export_utils/export_utils.py
examples/torch_onnx/llm_export.py

examples/torch_onnx/README.md

…_template (#1225) ## Summary - `apply_chat_template(..., return_tensors="pt")` returns a `BatchEncoding` in transformers 4.46+, which no longer subclasses `dict` - The old guard `isinstance(tokenized, dict)` evaluates to `False` for `BatchEncoding`, so `input_ids` was set to the whole `BatchEncoding` object - Calling `.shape[1]` on a `BatchEncoding` triggers `__getattr__("shape")` → `AttributeError` - Fix: check `isinstance(tokenized, torch.Tensor)` instead, which correctly handles both old transformers (plain tensor) and new transformers (BatchEncoding) This is causing `test_collect_hidden_states` to fail in the speculative decoding CI for all open PRs (#1207, #1210, #1221). ## Test plan - [ ] `torch-pr (speculative_decoding, 26.01)` CI passes - [ ] Verify fix handles both `torch.Tensor` return (old transformers) and `BatchEncoding` return (new transformers 4.46+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (3)

examples/torch_onnx/README.md (3)
210-247: Consider adding a version/date stamp for the supported-model matrix.

This matrix will drift quickly as upstream support changes. Adding “validated against TensorRT-Edge-LLM <version/date>” (or linking to a canonical upstream support page) will keep expectations clear.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 210 - 247, The Supported Models
matrix under the "Supported Models" header in README.md lacks a validation
timestamp or version; add a clear version/date stamp or a link to a canonical
upstream compatibility page adjacent to the matrix (e.g., "Validated against
TensorRT-Edge-LLM vX.Y — YYYY-MM-DD" or a URL) so readers know when the table
was last verified and which upstream version it corresponds to; update the
header or add a short footnote beneath the LLM/VLM tables and include a
changelog entry or comment indicating where to update this stamp in future.
85-97: Add an explicit migration note from removed llm_export.py to new CLI commands.

This section introduces the new flow, but it doesn’t directly help existing users map old commands/options to tensorrt-edgellm-*. A short “Migration from legacy llm_export.py” subsection would reduce upgrade friction and support burden.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 85 - 97, Add a short "Migration
from legacy llm_export.py" subsection in the README that maps common
llm_export.py commands/options to the new tensorrt-edgellm-* CLI equivalents;
explicitly list the most-used llm_export.py actions
(quantize/export/build/infer) and show the corresponding
tensorrt-edgellm-<stage> command names, flag mappings, and any changed argument
semantics so users can translate old invocations (e.g., quantize ->
tensorrt-edgellm-quantize, export -> tensorrt-edgellm-export, build ->
tensorrt-edgellm-build, inference -> tensorrt-edgellm-infer) and note any
deprecated flags or required new flags.
100-112: Pin TensorRT-Edge-LLM to a tagged release (or commit) in docs.

git clone + pip install . on default branch is non-reproducible and can break over time. Please document a known-good tag/commit for this repo to make the instructions stable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 100 - 112, Update the README's
repository checkout/install steps so they pin TensorRT-Edge-LLM to a known-good
tag or commit instead of cloning the default branch; replace the current `git
clone` and `pip3 install .` flow in the instructions with a step that clones a
specific release tag or clones then checks out a pinned commit (e.g., use `git
clone ... && git checkout <TAG_OR_COMMIT>` or `git clone --branch <TAG> --depth
1`) and state the exact tag/commit string used, so the subsequent `python3 -m
venv venv`, `source venv/bin/activate`, and `pip3 install .` operate on a
reproducible checkout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Around line 210-247: The Supported Models matrix under the "Supported Models"
header in README.md lacks a validation timestamp or version; add a clear
version/date stamp or a link to a canonical upstream compatibility page adjacent
to the matrix (e.g., "Validated against TensorRT-Edge-LLM vX.Y — YYYY-MM-DD" or
a URL) so readers know when the table was last verified and which upstream
version it corresponds to; update the header or add a short footnote beneath the
LLM/VLM tables and include a changelog entry or comment indicating where to
update this stamp in future.
- Around line 85-97: Add a short "Migration from legacy llm_export.py"
subsection in the README that maps common llm_export.py commands/options to the
new tensorrt-edgellm-* CLI equivalents; explicitly list the most-used
llm_export.py actions (quantize/export/build/infer) and show the corresponding
tensorrt-edgellm-<stage> command names, flag mappings, and any changed argument
semantics so users can translate old invocations (e.g., quantize ->
tensorrt-edgellm-quantize, export -> tensorrt-edgellm-export, build ->
tensorrt-edgellm-build, inference -> tensorrt-edgellm-infer) and note any
deprecated flags or required new flags.
- Around line 100-112: Update the README's repository checkout/install steps so
they pin TensorRT-Edge-LLM to a known-good tag or commit instead of cloning the
default branch; replace the current `git clone` and `pip3 install .` flow in the
instructions with a step that clones a specific release tag or clones then
checks out a pinned commit (e.g., use `git clone ... && git checkout
<TAG_OR_COMMIT>` or `git clone --branch <TAG> --depth 1`) and state the exact
tag/commit string used, so the subsequent `python3 -m venv venv`, `source
venv/bin/activate`, and `pip3 install .` operate on a reproducible checkout.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 669c18fe-40bf-4584-a968-a2c28e46e882

📥 Commits

Reviewing files that changed from the base of the PR and between 9f26f2c and 14ba626.

📒 Files selected for processing (1)

examples/torch_onnx/README.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/torch_onnx/README.md (1)
213-250: Avoid duplicating the support matrix inline when you already link the canonical source.

You already point to the live TensorRT-Edge-LLM support page on Line 213. Keeping large static ✅ tables here is likely to drift and create stale docs. Consider replacing these tables with a brief snapshot note (date/version) or removing them and relying on the canonical link.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` around lines 213 - 250, Summary: The README
duplicates the live "TensorRT-Edge-LLM Supported Models" matrix (the linked
page) causing potential drift; remove the large static LLMs/VLMs tables and
replace them with a short snapshot note and canonical link. Fix: open the README
content that contains the "TensorRT-Edge-LLM Supported Models" link and the
following LLMs and VLMs tables, delete the duplicated tables labeled "LLMs" and
"VLMs", and add a one-line snapshot summary like "Snapshot (YYYY-MM-DD): see
canonical support matrix" plus the existing link to "TensorRT-Edge-LLM Supported
Models"; optionally include a version or date field if you want to preserve a
historic capture. Reference points: the linked text "TensorRT-Edge-LLM Supported
Models" and the table headings "LLMs" and "VLMs" to locate the content to
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/README.md`:
- Around line 93-97: The overview "Quantize" step currently lists only "FP8,
INT4 AWQ, NVFP4" while the methods section (lines ~201-210) also documents "INT8
SmoothQuant" and "INT4 GPTQ"; update the overview's Quantize line to either
enumerate the full set (FP8, INT8 SmoothQuant, INT4 AWQ, INT4 GPTQ, NVFP4) to
match the methods section, or explicitly mark the overview as "common/default
examples" (e.g., prepend "Common examples:") so readers aren’t misled; ensure
you change the text in the README's overview "Quantize" step and keep the
methods section unchanged.

---

Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Around line 213-250: Summary: The README duplicates the live
"TensorRT-Edge-LLM Supported Models" matrix (the linked page) causing potential
drift; remove the large static LLMs/VLMs tables and replace them with a short
snapshot note and canonical link. Fix: open the README content that contains the
"TensorRT-Edge-LLM Supported Models" link and the following LLMs and VLMs
tables, delete the duplicated tables labeled "LLMs" and "VLMs", and add a
one-line snapshot summary like "Snapshot (YYYY-MM-DD): see canonical support
matrix" plus the existing link to "TensorRT-Edge-LLM Supported Models";
optionally include a version or date field if you want to preserve a historic
capture. Reference points: the linked text "TensorRT-Edge-LLM Supported Models"
and the table headings "LLMs" and "VLMs" to locate the content to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f04c1836-ef09-4ea0-b3f4-4504d39e88c8

📥 Commits

Reviewing files that changed from the base of the PR and between 14ba626 and f273ad0.

📒 Files selected for processing (1)

examples/torch_onnx/README.md

examples/torch_onnx/README.md

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

examples/torch_onnx/README.md (1)
93-93: ⚠️ Potential issue | 🟡 Minor

Still incomplete: Quantization methods list in overview.

Line 93 lists only "FP8, INT4 AWQ, NVFP4" but the methods table (lines 208-210) also documents MXFP8, INT8 SmoothQuant, and INT4 GPTQ. This discrepancy was already flagged in a previous review but remains unresolved.
📝 Suggested fix

Either include the full set in the overview or label it as showing common examples:
-1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4)
+1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ/GPTQ, NVFP4, MXFP8, INT8 SmoothQuant)
Or:
-1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4)
+1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (common: FP8, INT4 AWQ, NVFP4; see [Quantization Methods](`#quantization-methods`) for full list)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` at line 93, The "Quantize" overview entry
currently lists only "FP8, INT4 AWQ, NVFP4" but the methods table includes
additional techniques (MXFP8, INT8 SmoothQuant, INT4 GPTQ); update the README
overview to either enumerate the full set (add MXFP8, INT8 SmoothQuant, INT4
GPTQ) or explicitly mark the listed items as examples (e.g., "common examples:
FP8, INT4 AWQ, NVFP4") so the overview and the methods table (the Quantize
section and the methods table) are consistent.

🧹 Nitpick comments (1)

examples/torch_onnx/README.md (1)
123-123: Minor: Add 3B VRAM guidance to troubleshooting for completeness.

Line 123 specifies "16 GB for models up to 3B" but line 254's troubleshooting section only mentions 4B and 8B. For consistency and completeness, consider adding the 3B guidance.
📝 Suggested addition
-- **GPU out of memory**: Use a larger GPU (40 GB for models up to 4B, 80 GB for models up to 8B) or try `--device cpu` (limited precision support).
+- **GPU out of memory**: Use a larger GPU (16 GB for models up to 3B, 40 GB for models up to 4B, 80 GB for models up to 8B) or try `--device cpu` (limited precision support).
Also applies to: 254-254
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/README.md` at line 123, Add the missing 3B VRAM guidance
to the Troubleshooting section: locate the "Troubleshooting" heading and the
existing GPU VRAM guidance that mentions 4B and 8B and insert a line mirroring
the top-of-file note ("GPU VRAM: 16 GB for models up to 3B") so the
Troubleshooting section consistently documents VRAM for 3B, 4B, and 8B models.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/torch_onnx/README.md`:
- Line 93: The "Quantize" overview entry currently lists only "FP8, INT4 AWQ,
NVFP4" but the methods table includes additional techniques (MXFP8, INT8
SmoothQuant, INT4 GPTQ); update the README overview to either enumerate the full
set (add MXFP8, INT8 SmoothQuant, INT4 GPTQ) or explicitly mark the listed items
as examples (e.g., "common examples: FP8, INT4 AWQ, NVFP4") so the overview and
the methods table (the Quantize section and the methods table) are consistent.

---

Nitpick comments:
In `@examples/torch_onnx/README.md`:
- Line 123: Add the missing 3B VRAM guidance to the Troubleshooting section:
locate the "Troubleshooting" heading and the existing GPU VRAM guidance that
mentions 4B and 8B and insert a line mirroring the top-of-file note ("GPU VRAM:
16 GB for models up to 3B") so the Troubleshooting section consistently
documents VRAM for 3B, 4B, and 8B models.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 451a8ec8-b483-455a-b76a-5de63f66fdc2

📥 Commits

Reviewing files that changed from the base of the PR and between f273ad0 and 4428528.

📒 Files selected for processing (1)

examples/torch_onnx/README.md

…_template (#1225) ## Summary - `apply_chat_template(..., return_tensors="pt")` returns a `BatchEncoding` in transformers 4.46+, which no longer subclasses `dict` - The old guard `isinstance(tokenized, dict)` evaluates to `False` for `BatchEncoding`, so `input_ids` was set to the whole `BatchEncoding` object - Calling `.shape[1]` on a `BatchEncoding` triggers `__getattr__("shape")` → `AttributeError` - Fix: check `isinstance(tokenized, torch.Tensor)` instead, which correctly handles both old transformers (plain tensor) and new transformers (BatchEncoding) This is causing `test_collect_hidden_states` to fail in the speculative decoding CI for all open PRs (#1207, #1210, #1221). ## Test plan - [ ] `torch-pr (speculative_decoding, 26.01)` CI passes - [ ] Verify fix handles both `torch.Tensor` return (old transformers) and `BatchEncoding` return (new transformers 4.46+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolve conflict: accept deletion of modelopt/onnx/llm_export_utils/export_utils.py (removed on main by #1210). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

ajrasane requested a review from a team as a code owner April 8, 2026 19:55

ajrasane requested a review from gcunhase April 8, 2026 19:55

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

examples/torch_onnx/README.md Show resolved Hide resolved

ajrasane and others added 2 commits April 9, 2026 15:16

Update documentation for edge-LLM

4a20f3f

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane force-pushed the ajrasane/edge_llm_example branch from 977e68c to 9f26f2c Compare April 9, 2026 15:19

ajrasane enabled auto-merge (squash) April 9, 2026 15:19

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

gcunhase reviewed Apr 9, 2026

View reviewed changes

examples/torch_onnx/README.md Show resolved Hide resolved

gcunhase reviewed Apr 9, 2026

View reviewed changes

examples/torch_onnx/README.md Show resolved Hide resolved

yeyu-nvidia mentioned this pull request Apr 9, 2026

Fix compute_hidden_states_hf.py: handle BatchEncoding from apply_chat_template #1225

Merged

2 tasks

ajrasane force-pushed the ajrasane/edge_llm_example branch from b5246a8 to 14ba626 Compare April 9, 2026 19:34

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

ajrasane force-pushed the ajrasane/edge_llm_example branch from 14ba626 to f273ad0 Compare April 9, 2026 19:44

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

examples/torch_onnx/README.md Show resolved Hide resolved

Update readme

4428528

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane force-pushed the ajrasane/edge_llm_example branch from f273ad0 to 4428528 Compare April 9, 2026 19:59

gcunhase approved these changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into ajrasane/edge_llm_example

df9056c

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

ajrasane merged commit b3feebf into main Apr 9, 2026
42 checks passed

ajrasane deleted the ajrasane/edge_llm_example branch April 9, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace in-repo LLM ONNX export with TensorRT-Edge-LLM#1210

Replace in-repo LLM ONNX export with TensorRT-Edge-LLM#1210
ajrasane merged 4 commits intomainfrom
ajrasane/edge_llm_example

ajrasane commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajrasane commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajrasane commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading