[https://nvbugs/6162940][fix] Added a `SentencePieceTokenizer` wrapper in `examples/utils.py` that drives `sen by tensorrt-cicd · Pull Request #13983 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-05-11T08:39:08Z

Summary

Root cause: transformers v5.3 moved T5Tokenizer to the Rust tokenizers backend, so loading a raw SentencePiece .model vocab_file no longer populates the vocab — vocab_size became 104 and every token encoded/decoded as <unk>, producing rouge1=0.
Fix: Added a SentencePieceTokenizer wrapper in examples/utils.py that drives sentencepiece.SentencePieceProcessor directly (preserving vocab_size=256000, pad=0, eos=3) and use it instead of T5Tokenizer(vocab_file=...) for the NEMO/gpt-next path.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6162940

Summary by CodeRabbit

Bug Fixes
- Fixed tokenizer loading to improve compatibility with specific model vocabulary file formats
- Enhanced tokenization reliability and consistency across supported model architectures
Refactor
- Streamlined tokenizer initialization and loading process while maintaining existing behavior and special token handling

coderabbitai · 2026-05-11T08:43:37Z

📝 Walkthrough

Walkthrough

This PR replaces T5Tokenizer with a custom SentencePieceTokenizer wrapper in examples/utils.py. The new wrapper directly loads sentencepiece.SentencePieceProcessor and provides transformers-compatible encode, decode, and batch_decode methods, preserving left-side padding and truncation semantics for SentencePiece .model vocab files.

Changes

SentencePiece Tokenizer Wrapper

Layer / File(s)	Summary
SentencePieceTokenizer Class Definition `examples/utils.py`	New class loads `.model` via `sentencepiece.SentencePieceProcessor`, computes special token IDs with -1 fallback handling, and provides `encode()` with optional `return_tensors='pt'` support, `decode()`, and `batch_decode()` with left-side padding and truncation.
Tokenizer Loading Integration `examples/utils.py`	`_load_tokenizer()` now instantiates `SentencePieceTokenizer(vocab_file, padding_side='left', truncation_side='left')` for relevant models instead of `T5Tokenizer`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title is cut off mid-sentence ('that drives `sen') and doesn't fully convey the main change; the complete message is unclear.	Complete the title to clearly summarize the change, e.g., '[https://nvbugs/6162940][fix] Add SentencePieceTokenizer wrapper to fix vocab loading for SentencePiece models'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description covers root cause, solution, and test plan comprehensively; it clearly explains the transformers v5.3 compatibility issue and the wrapper-based fix.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/utils.py (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update SPDX copyright year range for this modified file.

Line 1 still ends at 2024, but this file is modified in 2026. Please extend the year range to include 2026.

As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/utils.py` at line 1, Update the SPDX copyright year range in the
file's header comment so it includes 2026 (change the trailing year from 2024 to
2026); locate the SPDX header line that currently reads "SPDX-FileCopyrightText:
Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved."
and modify it to end with "2026" (e.g., "2022-2026") to reflect the file
modification year.

🧹 Nitpick comments (1)

examples/utils.py (1)

45-91: ⚡ Quick win

Add explicit type annotations to the new tokenizer API methods.

Several new method signatures leave parameters/return types implicit (__init__, encode, decode(ids, **kwargs), batch_decode(sequences, **kwargs)), which weakens interface clarity and violates the repo typing rule.

Proposed typing-focused patch

-from typing import List, Optional
+from typing import Optional
+from collections.abc import Sequence

 class SentencePieceTokenizer:
@@
-    def __init__(self,
-                 vocab_file: str,
-                 padding_side: str = 'left',
-                 truncation_side: str = 'left'):
+    def __init__(self,
+                 vocab_file: str,
+                 padding_side: str = 'left',
+                 truncation_side: str = 'left') -> None:
@@
-    def encode(self,
-               text: str,
-               return_tensors: Optional[str] = None,
-               add_special_tokens: bool = True,
-               truncation: bool = False,
-               max_length: Optional[int] = None,
-               **kwargs):
+    def encode(self,
+               text: str,
+               return_tensors: Optional[str] = None,
+               add_special_tokens: bool = True,
+               truncation: bool = False,
+               max_length: Optional[int] = None,
+               **kwargs) -> list[int] | torch.Tensor:
@@
-    def decode(self, ids, skip_special_tokens: bool = False, **kwargs) -> str:
+    def decode(self,
+               ids: Sequence[int] | torch.Tensor,
+               skip_special_tokens: bool = False,
+               **kwargs) -> str:
@@
-    def batch_decode(self,
-                     sequences,
-                     skip_special_tokens: bool = False,
-                     **kwargs) -> List[str]:
+    def batch_decode(self,
+                     sequences: Sequence[Sequence[int]] | torch.Tensor,
+                     skip_special_tokens: bool = False,
+                     **kwargs) -> list[str]:

As per coding guidelines: “Python code should use type annotations for all function arguments and return types”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/utils.py` around lines 45 - 91, The new tokenizer methods lack
explicit type annotations; update __init__, encode, decode, and batch_decode to
include full parameter and return type hints (e.g., annotate vocab_file: str,
padding_side: str, truncation_side: str in __init__; for encode annotate text:
str, return_tensors: Optional[str], add_special_tokens: bool, truncation: bool,
max_length: Optional[int] and return Union[List[int], torch.Tensor]; for decode
annotate ids: Union[torch.Tensor, Sequence[int], List[int]],
skip_special_tokens: bool and return str; for batch_decode annotate sequences:
Sequence[Union[torch.Tensor, Sequence[int], List[int]]], skip_special_tokens:
bool and return List[str]). Also ensure required typing imports (Optional, List,
Sequence, Union) are present at top of file.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/utils.py`:
- Around line 56-57: Replace the inline lambda assigned to _opt with a small
named helper function (e.g., def _opt(value: int) -> Optional[int]: ...) that
checks if value >= 0 and returns the int or None, then call that helper to set
self.pad_token_id = _opt(sp.pad_id()); update imports to include typing.Optional
if needed and keep the function name _opt to minimize changes and satisfy the
linter rule (Ruff E731).

---

Outside diff comments:
In `@examples/utils.py`:
- Line 1: Update the SPDX copyright year range in the file's header comment so
it includes 2026 (change the trailing year from 2024 to 2026); locate the SPDX
header line that currently reads "SPDX-FileCopyrightText: Copyright (c)
2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and modify it
to end with "2026" (e.g., "2022-2026") to reflect the file modification year.

---

Nitpick comments:
In `@examples/utils.py`:
- Around line 45-91: The new tokenizer methods lack explicit type annotations;
update __init__, encode, decode, and batch_decode to include full parameter and
return type hints (e.g., annotate vocab_file: str, padding_side: str,
truncation_side: str in __init__; for encode annotate text: str, return_tensors:
Optional[str], add_special_tokens: bool, truncation: bool, max_length:
Optional[int] and return Union[List[int], torch.Tensor]; for decode annotate
ids: Union[torch.Tensor, Sequence[int], List[int]], skip_special_tokens: bool
and return str; for batch_decode annotate sequences:
Sequence[Union[torch.Tensor, Sequence[int], List[int]]], skip_special_tokens:
bool and return List[str]). Also ensure required typing imports (Optional, List,
Sequence, Union) are present at top of file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77c0822c-fdd3-4ad7-a4b2-2e801fe7c3d5

📥 Commits

Reviewing files that changed from the base of the PR and between 9547230 and 0a82ef9.

📒 Files selected for processing (1)

examples/utils.py

transformers v5 replaced the pure-Python SentencePiece backend of T5Tokenizer / LlamaTokenizer with the Rust 'tokenizers' backend, so passing a raw SentencePiece .model vocab file (as done for NEMO gpt-next in examples/utils.py) no longer reads the actual vocabulary: vocab_size collapses to 104 and all tokens encode/decode to <unk>, yielding rouge1=0.0 for TestGptNext::test_auto_dtype. Replace the T5Tokenizer(vocab_file=...) path with a small SentencePiece-backed wrapper that exposes the transformers-like API (encode / decode / batch_decode / pad_token_id / eos_token_id / vocab_size) by delegating to SentencePieceProcessor directly. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

longlee0622 · 2026-05-11T10:17:46Z

/bot run

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

longlee0622 · 2026-05-11T10:21:58Z

/bot run

tensorrt-cicd · 2026-05-11T10:23:21Z

PR_Github #47726 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

tensorrt-cicd · 2026-05-11T10:28:13Z

PR_Github #47729 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

tensorrt-cicd · 2026-05-11T11:31:57Z

PR_Github #47729 [ run ] completed with state SUCCESS. Commit: d80ac0c
/LLM/main/L0_MergeRequest_PR pipeline #37625 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

longlee0622 · 2026-05-12T00:07:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-12T00:13:14Z

PR_Github #47802 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

tensorrt-cicd · 2026-05-12T04:55:21Z

PR_Github #47802 [ run ] completed with state SUCCESS. Commit: d80ac0c
/LLM/main/L0_MergeRequest_PR pipeline #37693 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

longlee0622 · 2026-05-12T04:56:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-12T05:02:31Z

PR_Github #47874 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

tensorrt-cicd · 2026-05-12T06:22:16Z

PR_Github #47874 [ run ] completed with state SUCCESS. Commit: d80ac0c
/LLM/main/L0_MergeRequest_PR pipeline #37734 completed with status: 'SUCCESS'

CI Report

Link to invocation

…r in `examples/utils.py` that drives `sen (NVIDIA#13983) Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com> Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com> Co-authored-by: Jonas Li <6110159+longlee0622@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

tensorrt-cicd requested a review from a team as a code owner May 11, 2026 08:39

tensorrt-cicd requested review from Shixiaowei02 and chang-l May 11, 2026 08:39

tensorrt-cicd assigned longlee0622 May 11, 2026

github-actions Bot assigned tensorrt-cicd May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread examples/utils.py Outdated

longlee0622 self-requested a review May 11, 2026 10:09

longlee0622 approved these changes May 11, 2026

View reviewed changes

longlee0622 force-pushed the repair-bot-bug6162940 branch from 0a82ef9 to 7a891c2 Compare May 11, 2026 10:11

Apply suggestion from @coderabbitai[bot]

4333ac8

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

longlee0622 enabled auto-merge (squash) May 11, 2026 10:17

longlee0622 reviewed May 11, 2026

View reviewed changes

Comment thread examples/utils.py

Apply suggestion from @longlee0622

d80ac0c

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

chang-l approved these changes May 11, 2026

View reviewed changes

longlee0622 merged commit da7b8b3 into NVIDIA:main May 12, 2026
6 checks passed

Conversation

tensorrt-cicd commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

longlee0622 commented May 11, 2026

Uh oh!

Uh oh!

longlee0622 commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

longlee0622 commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

longlee0622 commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tensorrt-cicd commented May 11, 2026 •

edited by coderabbitai Bot

Loading