[https://nvbugs/6162940][fix] Added a SentencePieceTokenizer wrapper in examples/utils.py that drives `sen#13983
Conversation
📝 WalkthroughWalkthroughThis PR replaces ChangesSentencePiece Tokenizer Wrapper
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/utils.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate SPDX copyright year range for this modified file.
Line 1 still ends at
2024, but this file is modified in 2026. Please extend the year range to include 2026.As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/utils.py` at line 1, Update the SPDX copyright year range in the file's header comment so it includes 2026 (change the trailing year from 2024 to 2026); locate the SPDX header line that currently reads "SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and modify it to end with "2026" (e.g., "2022-2026") to reflect the file modification year.
🧹 Nitpick comments (1)
examples/utils.py (1)
45-91: ⚡ Quick winAdd explicit type annotations to the new tokenizer API methods.
Several new method signatures leave parameters/return types implicit (
__init__,encode,decode(ids, **kwargs),batch_decode(sequences, **kwargs)), which weakens interface clarity and violates the repo typing rule.Proposed typing-focused patch
-from typing import List, Optional +from typing import Optional +from collections.abc import Sequence class SentencePieceTokenizer: @@ - def __init__(self, - vocab_file: str, - padding_side: str = 'left', - truncation_side: str = 'left'): + def __init__(self, + vocab_file: str, + padding_side: str = 'left', + truncation_side: str = 'left') -> None: @@ - def encode(self, - text: str, - return_tensors: Optional[str] = None, - add_special_tokens: bool = True, - truncation: bool = False, - max_length: Optional[int] = None, - **kwargs): + def encode(self, + text: str, + return_tensors: Optional[str] = None, + add_special_tokens: bool = True, + truncation: bool = False, + max_length: Optional[int] = None, + **kwargs) -> list[int] | torch.Tensor: @@ - def decode(self, ids, skip_special_tokens: bool = False, **kwargs) -> str: + def decode(self, + ids: Sequence[int] | torch.Tensor, + skip_special_tokens: bool = False, + **kwargs) -> str: @@ - def batch_decode(self, - sequences, - skip_special_tokens: bool = False, - **kwargs) -> List[str]: + def batch_decode(self, + sequences: Sequence[Sequence[int]] | torch.Tensor, + skip_special_tokens: bool = False, + **kwargs) -> list[str]:As per coding guidelines: “Python code should use type annotations for all function arguments and return types”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/utils.py` around lines 45 - 91, The new tokenizer methods lack explicit type annotations; update __init__, encode, decode, and batch_decode to include full parameter and return type hints (e.g., annotate vocab_file: str, padding_side: str, truncation_side: str in __init__; for encode annotate text: str, return_tensors: Optional[str], add_special_tokens: bool, truncation: bool, max_length: Optional[int] and return Union[List[int], torch.Tensor]; for decode annotate ids: Union[torch.Tensor, Sequence[int], List[int]], skip_special_tokens: bool and return str; for batch_decode annotate sequences: Sequence[Union[torch.Tensor, Sequence[int], List[int]]], skip_special_tokens: bool and return List[str]). Also ensure required typing imports (Optional, List, Sequence, Union) are present at top of file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/utils.py`:
- Around line 56-57: Replace the inline lambda assigned to _opt with a small
named helper function (e.g., def _opt(value: int) -> Optional[int]: ...) that
checks if value >= 0 and returns the int or None, then call that helper to set
self.pad_token_id = _opt(sp.pad_id()); update imports to include typing.Optional
if needed and keep the function name _opt to minimize changes and satisfy the
linter rule (Ruff E731).
---
Outside diff comments:
In `@examples/utils.py`:
- Line 1: Update the SPDX copyright year range in the file's header comment so
it includes 2026 (change the trailing year from 2024 to 2026); locate the SPDX
header line that currently reads "SPDX-FileCopyrightText: Copyright (c)
2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and modify it
to end with "2026" (e.g., "2022-2026") to reflect the file modification year.
---
Nitpick comments:
In `@examples/utils.py`:
- Around line 45-91: The new tokenizer methods lack explicit type annotations;
update __init__, encode, decode, and batch_decode to include full parameter and
return type hints (e.g., annotate vocab_file: str, padding_side: str,
truncation_side: str in __init__; for encode annotate text: str, return_tensors:
Optional[str], add_special_tokens: bool, truncation: bool, max_length:
Optional[int] and return Union[List[int], torch.Tensor]; for decode annotate
ids: Union[torch.Tensor, Sequence[int], List[int]], skip_special_tokens: bool
and return str; for batch_decode annotate sequences:
Sequence[Union[torch.Tensor, Sequence[int], List[int]]], skip_special_tokens:
bool and return List[str]). Also ensure required typing imports (Optional, List,
Sequence, Union) are present at top of file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 77c0822c-fdd3-4ad7-a4b2-2e801fe7c3d5
📒 Files selected for processing (1)
examples/utils.py
transformers v5 replaced the pure-Python SentencePiece backend of T5Tokenizer / LlamaTokenizer with the Rust 'tokenizers' backend, so passing a raw SentencePiece .model vocab file (as done for NEMO gpt-next in examples/utils.py) no longer reads the actual vocabulary: vocab_size collapses to 104 and all tokens encode/decode to <unk>, yielding rouge1=0.0 for TestGptNext::test_auto_dtype. Replace the T5Tokenizer(vocab_file=...) path with a small SentencePiece-backed wrapper that exposes the transformers-like API (encode / decode / batch_decode / pad_token_id / eos_token_id / vocab_size) by delegating to SentencePieceProcessor directly. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
0a82ef9 to
7a891c2
Compare
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
|
/bot run |
Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
|
/bot run |
|
PR_Github #47726 [ run ] triggered by Bot. Commit: |
|
PR_Github #47729 [ run ] triggered by Bot. Commit: |
|
PR_Github #47729 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47802 [ run ] triggered by Bot. Commit: |
|
PR_Github #47802 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47874 [ run ] triggered by Bot. Commit: |
|
PR_Github #47874 [ run ] completed with state |
…r in `examples/utils.py` that drives `sen (NVIDIA#13983) Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com> Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com> Co-authored-by: Jonas Li <6110159+longlee0622@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Summary
.modelvocab_file no longer populates the vocab — vocab_size became 104 and every token encoded/decoded as<unk>, producing rouge1=0.SentencePieceTokenizerwrapper inexamples/utils.pythat drivessentencepiece.SentencePieceProcessordirectly (preserving vocab_size=256000, pad=0, eos=3) and use it instead ofT5Tokenizer(vocab_file=...)for the NEMO/gpt-next path.Test plan
Links
Summary by CodeRabbit
Bug Fixes
Refactor