Skip to content

[https://nvbugs/6162940][fix] Added a SentencePieceTokenizer wrapper in examples/utils.py that drives `sen#13983

Merged
longlee0622 merged 3 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6162940
May 12, 2026
Merged

[https://nvbugs/6162940][fix] Added a SentencePieceTokenizer wrapper in examples/utils.py that drives `sen#13983
longlee0622 merged 3 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6162940

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented May 11, 2026

Summary

  • Root cause: transformers v5.3 moved T5Tokenizer to the Rust tokenizers backend, so loading a raw SentencePiece .model vocab_file no longer populates the vocab — vocab_size became 104 and every token encoded/decoded as <unk>, producing rouge1=0.
  • Fix: Added a SentencePieceTokenizer wrapper in examples/utils.py that drives sentencepiece.SentencePieceProcessor directly (preserving vocab_size=256000, pad=0, eos=3) and use it instead of T5Tokenizer(vocab_file=...) for the NEMO/gpt-next path.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Fixed tokenizer loading to improve compatibility with specific model vocabulary file formats
    • Enhanced tokenization reliability and consistency across supported model architectures
  • Refactor

    • Streamlined tokenizer initialization and loading process while maintaining existing behavior and special token handling

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

This PR replaces T5Tokenizer with a custom SentencePieceTokenizer wrapper in examples/utils.py. The new wrapper directly loads sentencepiece.SentencePieceProcessor and provides transformers-compatible encode, decode, and batch_decode methods, preserving left-side padding and truncation semantics for SentencePiece .model vocab files.

Changes

SentencePiece Tokenizer Wrapper

Layer / File(s) Summary
SentencePieceTokenizer Class Definition
examples/utils.py
New class loads .model via sentencepiece.SentencePieceProcessor, computes special token IDs with -1 fallback handling, and provides encode() with optional return_tensors='pt' support, decode(), and batch_decode() with left-side padding and truncation.
Tokenizer Loading Integration
examples/utils.py
_load_tokenizer() now instantiates SentencePieceTokenizer(vocab_file, padding_side='left', truncation_side='left') for relevant models instead of T5Tokenizer.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title is cut off mid-sentence ('that drives `sen') and doesn't fully convey the main change; the complete message is unclear. Complete the title to clearly summarize the change, e.g., '[https://nvbugs/6162940][fix] Add SentencePieceTokenizer wrapper to fix vocab loading for SentencePiece models'.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description covers root cause, solution, and test plan comprehensively; it clearly explains the transformers v5.3 compatibility issue and the wrapper-based fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/utils.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update SPDX copyright year range for this modified file.

Line 1 still ends at 2024, but this file is modified in 2026. Please extend the year range to include 2026.

As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/utils.py` at line 1, Update the SPDX copyright year range in the
file's header comment so it includes 2026 (change the trailing year from 2024 to
2026); locate the SPDX header line that currently reads "SPDX-FileCopyrightText:
Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved."
and modify it to end with "2026" (e.g., "2022-2026") to reflect the file
modification year.
🧹 Nitpick comments (1)
examples/utils.py (1)

45-91: ⚡ Quick win

Add explicit type annotations to the new tokenizer API methods.

Several new method signatures leave parameters/return types implicit (__init__, encode, decode(ids, **kwargs), batch_decode(sequences, **kwargs)), which weakens interface clarity and violates the repo typing rule.

Proposed typing-focused patch
-from typing import List, Optional
+from typing import Optional
+from collections.abc import Sequence

 class SentencePieceTokenizer:
@@
-    def __init__(self,
-                 vocab_file: str,
-                 padding_side: str = 'left',
-                 truncation_side: str = 'left'):
+    def __init__(self,
+                 vocab_file: str,
+                 padding_side: str = 'left',
+                 truncation_side: str = 'left') -> None:
@@
-    def encode(self,
-               text: str,
-               return_tensors: Optional[str] = None,
-               add_special_tokens: bool = True,
-               truncation: bool = False,
-               max_length: Optional[int] = None,
-               **kwargs):
+    def encode(self,
+               text: str,
+               return_tensors: Optional[str] = None,
+               add_special_tokens: bool = True,
+               truncation: bool = False,
+               max_length: Optional[int] = None,
+               **kwargs) -> list[int] | torch.Tensor:
@@
-    def decode(self, ids, skip_special_tokens: bool = False, **kwargs) -> str:
+    def decode(self,
+               ids: Sequence[int] | torch.Tensor,
+               skip_special_tokens: bool = False,
+               **kwargs) -> str:
@@
-    def batch_decode(self,
-                     sequences,
-                     skip_special_tokens: bool = False,
-                     **kwargs) -> List[str]:
+    def batch_decode(self,
+                     sequences: Sequence[Sequence[int]] | torch.Tensor,
+                     skip_special_tokens: bool = False,
+                     **kwargs) -> list[str]:

As per coding guidelines: “Python code should use type annotations for all function arguments and return types”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/utils.py` around lines 45 - 91, The new tokenizer methods lack
explicit type annotations; update __init__, encode, decode, and batch_decode to
include full parameter and return type hints (e.g., annotate vocab_file: str,
padding_side: str, truncation_side: str in __init__; for encode annotate text:
str, return_tensors: Optional[str], add_special_tokens: bool, truncation: bool,
max_length: Optional[int] and return Union[List[int], torch.Tensor]; for decode
annotate ids: Union[torch.Tensor, Sequence[int], List[int]],
skip_special_tokens: bool and return str; for batch_decode annotate sequences:
Sequence[Union[torch.Tensor, Sequence[int], List[int]]], skip_special_tokens:
bool and return List[str]). Also ensure required typing imports (Optional, List,
Sequence, Union) are present at top of file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/utils.py`:
- Around line 56-57: Replace the inline lambda assigned to _opt with a small
named helper function (e.g., def _opt(value: int) -> Optional[int]: ...) that
checks if value >= 0 and returns the int or None, then call that helper to set
self.pad_token_id = _opt(sp.pad_id()); update imports to include typing.Optional
if needed and keep the function name _opt to minimize changes and satisfy the
linter rule (Ruff E731).

---

Outside diff comments:
In `@examples/utils.py`:
- Line 1: Update the SPDX copyright year range in the file's header comment so
it includes 2026 (change the trailing year from 2024 to 2026); locate the SPDX
header line that currently reads "SPDX-FileCopyrightText: Copyright (c)
2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and modify it
to end with "2026" (e.g., "2022-2026") to reflect the file modification year.

---

Nitpick comments:
In `@examples/utils.py`:
- Around line 45-91: The new tokenizer methods lack explicit type annotations;
update __init__, encode, decode, and batch_decode to include full parameter and
return type hints (e.g., annotate vocab_file: str, padding_side: str,
truncation_side: str in __init__; for encode annotate text: str, return_tensors:
Optional[str], add_special_tokens: bool, truncation: bool, max_length:
Optional[int] and return Union[List[int], torch.Tensor]; for decode annotate
ids: Union[torch.Tensor, Sequence[int], List[int]], skip_special_tokens: bool
and return str; for batch_decode annotate sequences:
Sequence[Union[torch.Tensor, Sequence[int], List[int]]], skip_special_tokens:
bool and return List[str]). Also ensure required typing imports (Optional, List,
Sequence, Union) are present at top of file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77c0822c-fdd3-4ad7-a4b2-2e801fe7c3d5

📥 Commits

Reviewing files that changed from the base of the PR and between 9547230 and 0a82ef9.

📒 Files selected for processing (1)
  • examples/utils.py

Comment thread examples/utils.py Outdated
@longlee0622 longlee0622 self-requested a review May 11, 2026 10:09
transformers v5 replaced the pure-Python SentencePiece backend of
T5Tokenizer / LlamaTokenizer with the Rust 'tokenizers' backend, so
passing a raw SentencePiece .model vocab file (as done for NEMO
gpt-next in examples/utils.py) no longer reads the actual vocabulary:
vocab_size collapses to 104 and all tokens encode/decode to <unk>,
yielding rouge1=0.0 for TestGptNext::test_auto_dtype.

Replace the T5Tokenizer(vocab_file=...) path with a small
SentencePiece-backed wrapper that exposes the transformers-like API
(encode / decode / batch_decode / pad_token_id / eos_token_id /
vocab_size) by delegating to SentencePieceProcessor directly.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@longlee0622 longlee0622 force-pushed the repair-bot-bug6162940 branch from 0a82ef9 to 7a891c2 Compare May 11, 2026 10:11
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
@longlee0622 longlee0622 enabled auto-merge (squash) May 11, 2026 10:17
@longlee0622
Copy link
Copy Markdown
Collaborator

/bot run

Comment thread examples/utils.py
Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
@longlee0622
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47726 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47729 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47729 [ run ] completed with state SUCCESS. Commit: d80ac0c
/LLM/main/L0_MergeRequest_PR pipeline #37625 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@longlee0622
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47802 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47802 [ run ] completed with state SUCCESS. Commit: d80ac0c
/LLM/main/L0_MergeRequest_PR pipeline #37693 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@longlee0622
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47874 [ run ] triggered by Bot. Commit: d80ac0c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47874 [ run ] completed with state SUCCESS. Commit: d80ac0c
/LLM/main/L0_MergeRequest_PR pipeline #37734 completed with status: 'SUCCESS'

CI Report

Link to invocation

@longlee0622 longlee0622 merged commit da7b8b3 into NVIDIA:main May 12, 2026
6 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
…r in `examples/utils.py` that drives `sen (NVIDIA#13983)

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
Co-authored-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants