Conversation
…1337) Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
📝 WalkthroughWalkthroughReplaces the token-prefix merge helper with a new EOS-boundary-based replacement, updates tokenization and preprocessing to use it, adds an async OpenAI-serving tokenization override, deep-copies chat messages, handles tool_calls, and installs a vLLM log filter to suppress “Added request” messages. Tests are updated and new fixtures added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant OpenAI API Server
participant NeMoRLOpenAIServingTokenization as Tokenization Override
participant Tokenizer
participant Helper as _replace_prefix_tokens
Client->>OpenAI API Server: ChatCompletion.create(request)
OpenAI API Server->>Tokenization Override: create_tokenize(request, raw_request)
Tokenization Override->>Tokenizer: tokenize(template/messages)
Tokenizer-->>Tokenization Override: template_prefix_token_ids, template_token_ids
Tokenization Override->>Tokenizer: tokenize(model prefix)
Tokenizer-->>Tokenization Override: model_prefix_token_ids
Tokenization Override->>Helper: replace(model_prefix, template_prefix, template_tokens)
Helper-->>Tokenization Override: combined_token_ids
Tokenization Override-->>OpenAI API Server: TokenizeResult(tokens=combined, count)
OpenAI API Server-->>Client: proceed with generation
sequenceDiagram
autonumber
participant vLLM as vLLM Server Logger
participant Filter as LoggingFilter
vLLM->>Filter: LogRecord("Added request ...")
alt message contains "Added request"
Filter-->>vLLM: reject
else other messages
Filter-->>vLLM: pass through
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (3 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
289-297: Use the actual template prefix tokens when splicing; current arg passes the wrong sequenceYou compute the template prefix token IDs in actual_corresponding_token_ids but then pass request.required_prefix_token_ids twice. This can splice at the wrong EOS boundary and breaks monotonic token alignment. Fix by using actual_corresponding_token_ids.
- final_prompt_token_ids = _replace_prefix_tokens( - tokenizer=tokenizer, - model_prefix_token_ids=request.required_prefix_token_ids, - template_prefix_token_ids=request.required_prefix_token_ids, - template_token_ids=engine_prompt["prompt_token_ids"], - ) + final_prompt_token_ids = _replace_prefix_tokens( + tokenizer=tokenizer, + model_prefix_token_ids=request.required_prefix_token_ids, + template_prefix_token_ids=actual_corresponding_token_ids, + template_token_ids=engine_prompt["prompt_token_ids"], + )
🧹 Nitpick comments (4)
nemo_rl/models/generation/vllm/vllm_worker_async.py (3)
109-116: Compare EOS against the prefix slice directly for clarity and safetyThe loop bounds by len(template_prefix_token_ids) but compares template_token_ids[pos]. Compare against template_prefix_token_ids[pos] to make intent explicit and avoid accidental mismatches.
- for pos in reversed(range(len(template_prefix_token_ids))): - if template_token_ids[pos] == eos_token_id: + for pos in reversed(range(len(template_prefix_token_ids))): + if template_prefix_token_ids[pos] == eos_token_id: template_cut_start = pos break
72-78: Fix docstring typo (“uppdate” → “update”)Nit but easy to clean up.
- and image tokenization is non-unique, then we will need to uppdate this + and image tokenization is non-unique, then we will need to update this function.
426-437: Add alias route /v1/tokenize to avoid clients using “/../tokenize”Exposing /v1/tokenize alongside /tokenize removes the need for path backtracking in tests/clients.
@app.post("/tokenize") async def tokenize(request: NeMoRLTokenizeRequest, raw_request: Request): generator = await openai_serving_tokenization.create_tokenize( request, raw_request ) @@ elif isinstance(generator, TokenizeResponse): return JSONResponse(content=generator.model_dump()) + + # Alias for versioned path to match /v1/chat/completions + app.add_api_route("/v1/tokenize", tokenize, methods=["POST"])tests/unit/models/generation/test_vllm_generation.py (1)
1188-1190: Prefer calling /v1/tokenize rather than “/../tokenize”Backtracking from “/v1” to reach “/tokenize” is brittle. Once the server exposes /v1/tokenize, update these calls to use it directly.
After applying the server alias, update:
- Line 1189: f"{base_urls[0]}/v1/tokenize"
- Line 1476: f"{base_urls[0]}/v1/tokenize"
Also applies to: 1475-1477
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py(7 hunks)tests/unit/models/generation/test_vllm_generation.py(2 hunks)tests/unit/models/generation/test_vllmasyncgenerationworker_replace_prefix_worker.json(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
tests/unit/models/generation/test_vllm_generation.pynemo_rl/models/generation/vllm/vllm_worker_async.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧠 Learnings (1)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (2)
tests/unit/models/generation/test_vllm_generation.py (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
_replace_prefix_tokens(38-121)tests/unit/environments/test_code_environment.py (1)
tokenizer(85-94)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
tests/unit/models/generation/test_vllm_generation.py (1)
tokenizer(238-241)nemo_rl/algorithms/utils.py (1)
get_tokenizer(157-288)
🪛 Ruff (0.14.0)
nemo_rl/models/generation/vllm/vllm_worker_async.py
281-281: Local variable actual_corresponding_token_ids is assigned to but never used
Remove assignment to unused variable actual_corresponding_token_ids
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
442-452: vLLM log noise filter looks goodGood use of a Filter to suppress spammy “Added request” lines while keeping errors visible.
tests/unit/models/generation/test_vllmasyncgenerationworker_replace_prefix_worker.json (1)
1-5: Fixture OKStructure and contents align with the new replace-prefix tests.
| async def create_tokenize(self, request, raw_request): | ||
| """Override to handle required_prefix_token_ids for tokenization.""" | ||
| # Call parent's create_tokenize first | ||
| result = await super().create_tokenize(request, raw_request) | ||
|
|
||
| # If there's an error or no required_prefix_token_ids, return as-is | ||
| if isinstance(result, ErrorResponse): | ||
| return result | ||
|
|
||
| # Only process chat requests (not completion requests) | ||
| if not hasattr(request, "messages"): | ||
| return result | ||
|
|
||
| # Get the template-tokenized tokens from the result | ||
| template_token_ids = result.tokens | ||
|
|
||
| # Get the tokenizer from the engine client | ||
| tokenizer = await self.engine_client.get_tokenizer() | ||
|
|
||
| # Apply _replace_prefix_tokens to fix up the tokenization | ||
| final_token_ids = _replace_prefix_tokens( | ||
| tokenizer=tokenizer, | ||
| model_prefix_token_ids=request.required_prefix_token_ids, | ||
| template_prefix_token_ids=request.required_prefix_token_ids, | ||
| template_token_ids=template_token_ids, | ||
| ) | ||
|
|
||
| # Update the result with the corrected tokens | ||
| result.tokens = final_token_ids | ||
| result.count = len(final_token_ids) | ||
|
|
||
| return result | ||
|
|
There was a problem hiding this comment.
Derive template_prefix_token_ids in /tokenize override; current code uses model prefix length as a proxy
In create_tokenize you pass template_prefix_token_ids=request.required_prefix_token_ids, which can misidentify the EOS cut point when template and model retokenizations differ. Compute the template prefix via the base _preprocess_chat for messages up to the last assistant turn, then splice.
async def create_tokenize(self, request, raw_request):
"""Override to handle required_prefix_token_ids for tokenization."""
# Call parent's create_tokenize first
result = await super().create_tokenize(request, raw_request)
# If there's an error or no required_prefix_token_ids, return as-is
if isinstance(result, ErrorResponse):
return result
# Only process chat requests (not completion requests)
if not hasattr(request, "messages"):
return result
- # Get the template-tokenized tokens from the result
- template_token_ids = result.tokens
-
- # Get the tokenizer from the engine client
- tokenizer = await self.engine_client.get_tokenizer()
-
- # Apply _replace_prefix_tokens to fix up the tokenization
- final_token_ids = _replace_prefix_tokens(
- tokenizer=tokenizer,
- model_prefix_token_ids=request.required_prefix_token_ids,
- template_prefix_token_ids=request.required_prefix_token_ids,
- template_token_ids=template_token_ids,
- )
+ # Get the template-tokenized tokens from the result
+ template_token_ids = result.tokens
+
+ # Get the tokenizer from the engine client
+ tokenizer = await self.engine_client.get_tokenizer()
+
+ # Find last assistant message index
+ last_assistant_message_idx = None
+ for i in reversed(range(len(request.messages))):
+ if request.messages[i].get("role") == "assistant":
+ last_assistant_message_idx = i
+ break
+ if last_assistant_message_idx is None:
+ return result
+
+ # Build messages up to and including the last assistant turn
+ messages_to_last_assistant = deepcopy(
+ request.messages[: last_assistant_message_idx + 1]
+ )
+
+ # Compute the template prefix tokens using the base _preprocess_chat
+ corresponding_res = await super(NeMoRLOpenAIServingMixin, self)._preprocess_chat( # type: ignore[misc]
+ request=request,
+ tokenizer=tokenizer,
+ messages=messages_to_last_assistant,
+ chat_template=self.chat_template,
+ chat_template_content_format=self.chat_template_content_format,
+ add_generation_prompt=False,
+ continue_final_message=False,
+ tool_dicts=None,
+ documents=None,
+ chat_template_kwargs=None,
+ tool_parser=None,
+ truncate_prompt_tokens=None,
+ add_special_tokens=False,
+ )
+ template_prefix_token_ids = corresponding_res[2][0]["prompt_token_ids"]
+
+ # Apply _replace_prefix_tokens to fix up the tokenization
+ final_token_ids = _replace_prefix_tokens(
+ tokenizer=tokenizer,
+ model_prefix_token_ids=request.required_prefix_token_ids,
+ template_prefix_token_ids=template_prefix_token_ids,
+ template_token_ids=template_token_ids,
+ )
# Update the result with the corrected tokens
result.tokens = final_token_ids
result.count = len(final_token_ids)
return result📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def create_tokenize(self, request, raw_request): | |
| """Override to handle required_prefix_token_ids for tokenization.""" | |
| # Call parent's create_tokenize first | |
| result = await super().create_tokenize(request, raw_request) | |
| # If there's an error or no required_prefix_token_ids, return as-is | |
| if isinstance(result, ErrorResponse): | |
| return result | |
| # Only process chat requests (not completion requests) | |
| if not hasattr(request, "messages"): | |
| return result | |
| # Get the template-tokenized tokens from the result | |
| template_token_ids = result.tokens | |
| # Get the tokenizer from the engine client | |
| tokenizer = await self.engine_client.get_tokenizer() | |
| # Apply _replace_prefix_tokens to fix up the tokenization | |
| final_token_ids = _replace_prefix_tokens( | |
| tokenizer=tokenizer, | |
| model_prefix_token_ids=request.required_prefix_token_ids, | |
| template_prefix_token_ids=request.required_prefix_token_ids, | |
| template_token_ids=template_token_ids, | |
| ) | |
| # Update the result with the corrected tokens | |
| result.tokens = final_token_ids | |
| result.count = len(final_token_ids) | |
| return result | |
| async def create_tokenize(self, request, raw_request): | |
| """Override to handle required_prefix_token_ids for tokenization.""" | |
| # Call parent's create_tokenize first | |
| result = await super().create_tokenize(request, raw_request) | |
| # If there's an error or no required_prefix_token_ids, return as-is | |
| if isinstance(result, ErrorResponse): | |
| return result | |
| # Only process chat requests (not completion requests) | |
| if not hasattr(request, "messages"): | |
| return result | |
| # Get the template-tokenized tokens from the result | |
| template_token_ids = result.tokens | |
| # Get the tokenizer from the engine client | |
| tokenizer = await self.engine_client.get_tokenizer() | |
| # Find last assistant message index | |
| last_assistant_message_idx = None | |
| for i in reversed(range(len(request.messages))): | |
| if request.messages[i].get("role") == "assistant": | |
| last_assistant_message_idx = i | |
| break | |
| if last_assistant_message_idx is None: | |
| return result | |
| # Build messages up to and including the last assistant turn | |
| messages_to_last_assistant = deepcopy( | |
| request.messages[: last_assistant_message_idx + 1] | |
| ) | |
| # Compute the template prefix tokens using the base _preprocess_chat | |
| corresponding_res = await super(NeMoRLOpenAIServingMixin, self)._preprocess_chat( # type: ignore[misc] | |
| request=request, | |
| tokenizer=tokenizer, | |
| messages=messages_to_last_assistant, | |
| chat_template=self.chat_template, | |
| chat_template_content_format=self.chat_template_content_format, | |
| add_generation_prompt=False, | |
| continue_final_message=False, | |
| tool_dicts=None, | |
| documents=None, | |
| chat_template_kwargs=None, | |
| tool_parser=None, | |
| truncate_prompt_tokens=None, | |
| add_special_tokens=False, | |
| ) | |
| template_prefix_token_ids = corresponding_res[2][0]["prompt_token_ids"] | |
| # Apply _replace_prefix_tokens to fix up the tokenization | |
| final_token_ids = _replace_prefix_tokens( | |
| tokenizer=tokenizer, | |
| model_prefix_token_ids=request.required_prefix_token_ids, | |
| template_prefix_token_ids=template_prefix_token_ids, | |
| template_token_ids=template_token_ids, | |
| ) | |
| # Update the result with the corrected tokens | |
| result.tokens = final_token_ids | |
| result.count = len(final_token_ids) | |
| return result |
🤖 Prompt for AI Agents
In nemo_rl/models/generation/vllm/vllm_worker_async.py around lines 382 to 414,
the override currently passes request.required_prefix_token_ids as
template_prefix_token_ids which is incorrect when template and model
tokenizations differ; instead call the base chat preprocessing (e.g.
_preprocess_chat) to derive the template tokenization for the messages up to the
last assistant turn, compute template_prefix_token_ids from that template token
stream, then pass those computed template_prefix_token_ids into
_replace_prefix_tokens and splice template_token_ids accordingly before updating
result.tokens and result.count.
…cing (1337)` into `r0.4.0` (#1360) Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com>
beep boop [🤖]: Hi @parthchadha 👋,
Summary by CodeRabbit