Skip to content

doc: example demonstrating how to train CosmosReason2 Eagle3#977

Open
h-guo18 wants to merge 4 commits intomainfrom
haoguo/skierat-cr2-guide
Open

doc: example demonstrating how to train CosmosReason2 Eagle3#977
h-guo18 wants to merge 4 commits intomainfrom
haoguo/skierat-cr2-guide

Conversation

@h-guo18
Copy link
Contributor

@h-guo18 h-guo18 commented Mar 5, 2026

What does this PR do?

Type of change: new example

Original PR: #965

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features
    • Added configuration template for speculative decoding model setups.
    • New utility script for processing and preparing chat conversation datasets.
    • Training guide notebook with step-by-step workflow, calibration instructions, multi-GPU execution, and deployment examples.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a complete speculative decoding training pipeline for the EAGLE3 draft head on Nvidia Cosmos-Reason2-8B. It includes a JSON configuration file defining model parameters, a comprehensive Jupyter notebook documenting the entire training workflow from data preparation through deployment, and a Python script for converting Nemotron dataset conversations into the required training format.

Changes

Cohort / File(s) Summary
Configuration
examples/speculative_decoding/guides/CR2_eagle_config.json
JSON configuration specifying draft vocabulary size (32000), attention implementation (flex_attention), and rope scaling parameters (yarn rope type with beta/factor settings).
Training Workflow
examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb
Jupyter notebook documenting end-to-end training pipeline: Nemotron data preparation with format validation (89511 conversations expected), 32000-token vocabulary calibration, FSDP2-enabled multi-GPU training, checkpoint export to Hugging Face format, and vLLM deployment with speculative decoding configuration.
Data Preparation
examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py
Python script for ingesting Nemotron-Post-Training-Dataset-v2 from Hugging Face Hub. Features parquet shard discovery, streaming/mapping-based modes for data loading, conversation role normalization (user/assistant), and dataset persistence via utility functions.

Sequence Diagram

sequenceDiagram
    participant User
    participant DataPrep as add_nemotron_chat.py
    participant HFHub as Hugging Face Hub
    participant Training as launch_train.sh<br/>(FSDP2)
    participant Export as export_hf_checkpoint.py
    participant vLLM as vLLM Server

    User->>DataPrep: Run with Nemotron dataset config
    DataPrep->>HFHub: Discover & download chat shards
    HFHub-->>DataPrep: Parquet data
    DataPrep->>DataPrep: Parse conversations,<br/>normalize roles
    DataPrep->>DataPrep: Calibrate 32k<br/>draft vocabulary
    DataPrep-->>User: input_conversations.jsonl<br/>+ d2t.pt mapping

    User->>Training: Launch with CR2_eagle_config.json<br/>+ prepared data
    Training->>Training: Multi-GPU training<br/>with FSDP2
    Training-->>User: Trained checkpoint

    User->>Export: Convert checkpoint<br/>to HF format
    Export-->>User: HF-compatible model

    User->>vLLM: Load model + speculative<br/>decoding config
    vLLM-->>User: Ready for inference
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main contribution: adding an example/guide demonstrating how to train CosmosReason2 with Eagle3 speculative decoding, which aligns with the primary changes (new notebook, configuration, and training script).
Security Anti-Patterns ✅ Passed Comprehensive scan found no security anti-patterns including torch.load, numpy.load, trust_remote_code, eval/exec, or nosec comments in the repository.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch haoguo/skierat-cr2-guide

Comment @coderabbitai help to get the list of available commands and usage tips.

skierat added 4 commits March 5, 2026 00:39
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
@h-guo18 h-guo18 force-pushed the haoguo/skierat-cr2-guide branch from 1f8944f to 345b0bb Compare March 5, 2026 00:40
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@h-guo18 h-guo18 marked this pull request as ready for review March 5, 2026 00:41
@h-guo18 h-guo18 requested a review from a team as a code owner March 5, 2026 00:41
@h-guo18 h-guo18 requested a review from yeyu-nvidia March 5, 2026 00:41
@h-guo18 h-guo18 changed the title Haoguo/skierat cr2 guide doc: example demonstrating how to train CosmosReason2 Eagle3 Mar 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb (2)

84-94: Consider making the output paths explicit in the data-prep command.

Right now the notebook relies on add_nemotron_chat.py defaults for --output-dir / --output-split-name. For a guide, being explicit helps users avoid “where did my file go?” issues when their working directory differs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 84 - 94, The notebook cell calling add_nemotron_chat.py relies on
defaults for output locations; update the call to include explicit output flags
so users know where files will be written. Modify the bash cell that runs python
../prepare_input_conversations/add_nemotron_chat.py to add --output-dir with a
clear path (e.g., ./data/nemotron) and --output-split-name (e.g., train.jsonl)
so the script invocation explicitly sets output locations; keep the script name
add_nemotron_chat.py and the existing --mapping-file flag unchanged.

36-46: Clarify whether the notebook is meant to run against the repo checkout or the PyPI wheel (reproducibility).

On Line 40-41, pip install -U nvidia-modelopt[hf] plus pip install -r ../requirements.txt can lead to version skew (installed wheel vs checked-out scripts) and non-reproducible environments. Consider explicitly stating the intended mode, and optionally pinning a known-good nvidia-modelopt version for this guide.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 36 - 46, The notebook currently runs "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt", which can produce
version skew between an installed PyPI wheel and the local repo; update the
notebook to explicitly state whether the guide targets the checked-out repo or
the PyPI wheel, and change the install cell to one of two clear options: (A) for
repo development, install local editable packages (e.g., instructions to pip
install -e . from the repo) and skip the PyPI wheel, or (B) for reproducible
runs, pin a specific known-good nvidia-modelopt version (replace "pip install -U
nvidia-modelopt[hf]" with a pinned version) before running "pip install -r
../requirements.txt"; reference the exact commands "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt" in the notebook
text so readers know which mode to follow.
examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py (2)

162-236: Ensure --output-dir exists; also consider the memory profile in streaming mode.

  • If args.output_dir doesn’t exist, update_dataset_file_with_conversations(...) may fail depending on how utils is implemented. Creating it here makes the script more robust.
  • In streaming mode with --max-samples omitted (Line 206), the code can end up accumulating the entire split in input_conversations (Line 209), which may be very large.
Proposed robustness fix (directory creation)
 async def main(args: argparse.Namespace) -> None:
+    args.output_dir.mkdir(parents=True, exist_ok=True)
     # Resolve Hub download URLs for chat parquet shards only.
     # Using load_dataset("parquet", data_files=urls) bypasses the repo's split
     # metadata validation, so only the chat shards are ever downloaded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 162 - 236, The script can fail if args.output_dir doesn't exist and
can OOM in streaming mode since input_conversations accumulates the whole split;
before calling update_dataset_file_with_conversations in main, ensure
args.output_dir is created (mkdir parents=True, exist_ok=True) and in the
streaming branch (where load_dataset(..., streaming=True) / stream and iterable
are used) avoid building input_conversations for unbounded runs — either
require/validate args.max_samples is set or switch to incremental writing:
process samples in batches and call update_dataset_file_with_conversations (or a
new append-mode helper) per-batch instead of collecting all entries in
input_conversations.

43-66: Consider truncating “available parquet files” in the error message.

On Line 61-63, the ValueError embeds a full list of parquet files. If the dataset grows, this can create extremely noisy exceptions (and logs).

Proposed tweak (limit output, keep it actionable)
     if not split_files:
+        available_parquet_files = [f for f in all_files if f.endswith(".parquet")]
+        preview = available_parquet_files[:25]
         err_msg = (
             f"No parquet files found for split '{split_name}' in {dataset_id}. "
-            f"Available files: {[f for f in all_files if f.endswith('.parquet')]}"
+            f"Available parquet files (showing {len(preview)}/{len(available_parquet_files)}): {preview}"
         )
         raise ValueError(err_msg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 43 - 66, In get_split_parquet_urls, the ValueError currently embeds
the full list of available parquet files which can be huge; change the error
message construction in the ValueError raised in get_split_parquet_urls to
truncate the printed list (e.g., show the first N entries like 5, then "... (X
more)"), or if total <= N show them all, so the message stays actionable but
bounded. Keep the same error context (split_name and dataset_id) and include the
total count of parquet files when truncating to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 211-225: The export cell is running the script from the wrong
working directory; update the command that invokes
scripts/export_hf_checkpoint.py so it runs from the repo root like the training
cell does—either cd up one directory before running (mirror the training cell’s
cd .. behavior) or call the script with a relative path to the root (e.g.,
../scripts/export_hf_checkpoint.py); ensure CKPT_DIR and EXPORT_PATH environment
variables remain unchanged and that the invocation in the notebook references
the same script name (export_hf_checkpoint.py) so it finds the script and
checkpoint correctly.
- Around line 96-108: The echoed path is inconsistent with the file actually
being counted: the bash cell computes count with count=$(wc -l <
input_conversations/nemotron-chat.jsonl) but prints
"../input_conversations/nemotron-chat.jsonl". Update the echo statement to
reference the same path used to compute count
(input_conversations/nemotron-chat.jsonl) so the message matches the counted
file; locate the bash cell containing the count assignment and the echo and make
the echo text consistent.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 115-160: In parse_nemotron_conversation, calling content.strip()
can raise for non-string payloads (None, dict, list); guard by ensuring content
is a str before stripping—if msg["value"]/["content"]/["text"] yields a non-str,
skip or coerce safely (e.g., if not isinstance(content, str): continue or
convert to str only when appropriate) so msgs.append only receives string
content; update references to the local variable content within
parse_nemotron_conversation to perform this type-check/coercion prior to calling
strip().
- Around line 16-41: The example script imports external packages (datasets,
huggingface_hub, tqdm) — update the example's requirements file to list these
direct dependencies so the example is reproducible: add entries for
datasets>=3.0.0 and huggingface_hub>=0.24.0 and ensure tqdm is present (or
explicitly add it) in examples/speculative_decoding/requirements.txt; reference
the imports used in the script such as load_dataset, hf_hub_url, list_repo_files
and tqdm when adding the entries so the requirements match the actual runtime
dependencies.

---

Nitpick comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 84-94: The notebook cell calling add_nemotron_chat.py relies on
defaults for output locations; update the call to include explicit output flags
so users know where files will be written. Modify the bash cell that runs python
../prepare_input_conversations/add_nemotron_chat.py to add --output-dir with a
clear path (e.g., ./data/nemotron) and --output-split-name (e.g., train.jsonl)
so the script invocation explicitly sets output locations; keep the script name
add_nemotron_chat.py and the existing --mapping-file flag unchanged.
- Around line 36-46: The notebook currently runs "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt", which can produce
version skew between an installed PyPI wheel and the local repo; update the
notebook to explicitly state whether the guide targets the checked-out repo or
the PyPI wheel, and change the install cell to one of two clear options: (A) for
repo development, install local editable packages (e.g., instructions to pip
install -e . from the repo) and skip the PyPI wheel, or (B) for reproducible
runs, pin a specific known-good nvidia-modelopt version (replace "pip install -U
nvidia-modelopt[hf]" with a pinned version) before running "pip install -r
../requirements.txt"; reference the exact commands "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt" in the notebook
text so readers know which mode to follow.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 162-236: The script can fail if args.output_dir doesn't exist and
can OOM in streaming mode since input_conversations accumulates the whole split;
before calling update_dataset_file_with_conversations in main, ensure
args.output_dir is created (mkdir parents=True, exist_ok=True) and in the
streaming branch (where load_dataset(..., streaming=True) / stream and iterable
are used) avoid building input_conversations for unbounded runs — either
require/validate args.max_samples is set or switch to incremental writing:
process samples in batches and call update_dataset_file_with_conversations (or a
new append-mode helper) per-batch instead of collecting all entries in
input_conversations.
- Around line 43-66: In get_split_parquet_urls, the ValueError currently embeds
the full list of available parquet files which can be huge; change the error
message construction in the ValueError raised in get_split_parquet_urls to
truncate the printed list (e.g., show the first N entries like 5, then "... (X
more)"), or if total <= N show them all, so the message stays actionable but
bounded. Keep the same error context (split_name and dataset_id) and include the
total count of parquet files when truncating to aid debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2bf1ef45-2895-41c3-8ae2-94e18ce66eeb

📥 Commits

Reviewing files that changed from the base of the PR and between e8f9687 and 345b0bb.

⛔ Files ignored due to path filters (1)
  • examples/speculative_decoding/guides/nemotron_mapping.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • examples/speculative_decoding/guides/CR2_eagle_config.json
  • examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb
  • examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

Comment on lines +96 to +108
"cell_type": "code",
"metadata": {},
"source": [
"%%bash\n",
"# Expect exactly 89511 conversations.\n",
"count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n",
"echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n",
"[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }"
],
"execution_count": null,
"outputs": [],
"id": "d05b97d3"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the echoed file path (currently inconsistent with what’s actually counted).

On Line 102, the cell counts input_conversations/nemotron-chat.jsonl but echoes ../input_conversations/nemotron-chat.jsonl.

Proposed fix
-        "echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n",
+        "echo \"${count} conversations in input_conversations/nemotron-chat.jsonl\"\n",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"cell_type": "code",
"metadata": {},
"source": [
"%%bash\n",
"# Expect exactly 89511 conversations.\n",
"count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n",
"echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n",
"[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }"
],
"execution_count": null,
"outputs": [],
"id": "d05b97d3"
},
"cell_type": "code",
"metadata": {},
"source": [
"%%bash\n",
"# Expect exactly 89511 conversations.\n",
"count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n",
"echo \"${count} conversations in input_conversations/nemotron-chat.jsonl\"\n",
"[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }"
],
"execution_count": null,
"outputs": [],
"id": "d05b97d3"
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 96 - 108, The echoed path is inconsistent with the file actually
being counted: the bash cell computes count with count=$(wc -l <
input_conversations/nemotron-chat.jsonl) but prints
"../input_conversations/nemotron-chat.jsonl". Update the echo statement to
reference the same path used to compute count
(input_conversations/nemotron-chat.jsonl) so the message matches the counted
file; locate the bash cell containing the count assignment and the echo and make
the echo text consistent.

Comment on lines +211 to +225
"cell_type": "code",
"metadata": {},
"source": [
"%%bash\n",
"CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",
"EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",
"\n",
"python scripts/export_hf_checkpoint.py \\\n",
" --model_path $CKPT_DIR \\\n",
" --export_path $EXPORT_PATH"
],
"execution_count": null,
"outputs": [],
"id": "63880f67"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Export step likely runs in the wrong working directory (missing cd .. / wrong relative script path).

Earlier cells call ../scripts/... from the guides/ directory, and the training cell explicitly does cd .. before invoking ./launch_train.sh. This export cell (Line 218) calls python scripts/export_hf_checkpoint.py without changing directories, so it will likely fail when executed from guides/.

Proposed fix (mirror the training cell’s working directory)
-        "%%bash\n",
+        "%%bash\n",
+        "cd ..\n",
         "CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",
         "EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",
         "\n",
         "python scripts/export_hf_checkpoint.py \\\n",
         "    --model_path $CKPT_DIR \\\n",
         "    --export_path $EXPORT_PATH"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"cell_type": "code",
"metadata": {},
"source": [
"%%bash\n",
"CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",
"EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",
"\n",
"python scripts/export_hf_checkpoint.py \\\n",
" --model_path $CKPT_DIR \\\n",
" --export_path $EXPORT_PATH"
],
"execution_count": null,
"outputs": [],
"id": "63880f67"
},
"cell_type": "code",
"metadata": {},
"source": [
"%%bash\n",
"cd ..\n",
"CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",
"EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",
"\n",
"python scripts/export_hf_checkpoint.py \\\n",
" --model_path $CKPT_DIR \\\n",
" --export_path $EXPORT_PATH"
],
"execution_count": null,
"outputs": [],
"id": "63880f67"
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 211 - 225, The export cell is running the script from the wrong
working directory; update the command that invokes
scripts/export_hf_checkpoint.py so it runs from the repo root like the training
cell does—either cd up one directory before running (mirror the training cell’s
cd .. behavior) or call the script with a relative path to the root (e.g.,
../scripts/export_hf_checkpoint.py); ensure CKPT_DIR and EXPORT_PATH environment
variables remain unchanged and that the invocation in the notebook references
the same script name (export_hf_checkpoint.py) so it finds the script and
checkpoint correctly.

Comment on lines +16 to +41
"""Add Nemotron-Post-Training-Dataset-v2 chat conversations to a conversation dataset.

Dataset: https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2

Note: This dataset requires agreeing to the terms of use on Hugging Face.
Make sure you are logged in with `huggingface-cli login` before running this script.
"""

from __future__ import annotations

import argparse
import itertools
from pathlib import Path

from datasets import load_dataset
from huggingface_hub import hf_hub_url, list_repo_files
from tqdm import tqdm
from utils import (
dataset_splits_explanation,
id_for_conversation,
update_dataset_file_with_conversations,
)

NEMOTRON_DATASET_ID = "nvidia/Nemotron-Post-Training-Dataset-v2"
NEMOTRON_CHAT_SPLIT = "chat"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# 1) Check repo-stated Python version constraints (pyproject/setup/CI).
rg -nP --glob='!**/.git/**' '(python_requires|Requires-Python|python-version|PYTHON_VERSION)' -S

Repository: NVIDIA/Model-Optimizer

Length of output: 323


🏁 Script executed:

# 2) Check that examples requirements mention the runtime deps.
fd -e 'requirements*.txt' examples/ | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 407


🏁 Script executed:

# 3) Find and examine pyproject.toml or setup.py files
fd -e 'pyproject.toml' -e 'setup.py' -e 'setup.cfg' | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Correct fd search for requirements files in examples directory
fd . 'examples/' -name 'requirements*.txt' | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 296


🏁 Script executed:

# Check for pyproject.toml at repo root
cat pyproject.toml | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3889


🏁 Script executed:

# Examine the specific example file for PEP 604 syntax
cat -n examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2371


🏁 Script executed:

# Check if examples directory or subdirectories have requirements files
find examples/ -name 'requirements*.txt' -o -name 'pyproject.toml' -o -name 'setup.py'

Repository: NVIDIA/Model-Optimizer

Length of output: 1182


🏁 Script executed:

# Check for README or documentation in examples/speculative_decoding
find examples/speculative_decoding -name 'README*' -o -name '*.md'

Repository: NVIDIA/Model-Optimizer

Length of output: 157


🏁 Script executed:

# Look for any setup/install instructions in the example directory
cat examples/speculative_decoding/README.md 2>/dev/null || cat examples/speculative_decoding/prepare_input_conversations/README.md 2>/dev/null || echo "No README found"

Repository: NVIDIA/Model-Optimizer

Length of output: 14158


🏁 Script executed:

# Check the speculative_decoding requirements.txt
cat examples/speculative_decoding/requirements.txt

Repository: NVIDIA/Model-Optimizer

Length of output: 107


🏁 Script executed:

# Check if the example file uses PEP 604 union syntax (|) or type hints
grep -n 'def \|: .*|' examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 323


🏁 Script executed:

# Check full function signatures in the example file to see typing syntax used
sed -n '43,100p' examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2254


🏁 Script executed:

# Check what other scripts are in the same directory and their imports
ls -la examples/speculative_decoding/prepare_input_conversations/

Repository: NVIDIA/Model-Optimizer

Length of output: 713


🏁 Script executed:

# Check if other scripts in the same directory have similar dependencies
grep -h "^from\|^import" examples/speculative_decoding/prepare_input_conversations/*.py | sort -u | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 346


Add missing dependencies to examples/speculative_decoding/requirements.txt.

The script imports datasets, huggingface_hub, and tqdm, but these are missing from the example's requirements.txt. While the README.md instructs installing nvidia-modelopt[hf] (which provides some of these), the example's own requirements file should document all direct dependencies. Python 3.10+ requirement is properly documented in pyproject.toml and aligns with the PEP 604 union syntax used here.

Add to examples/speculative_decoding/requirements.txt:

  • datasets>=3.0.0
  • huggingface_hub>=0.24.0
  • tqdm (or verify it's already in base install)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 16 - 41, The example script imports external packages (datasets,
huggingface_hub, tqdm) — update the example's requirements file to list these
direct dependencies so the example is reproducible: add entries for
datasets>=3.0.0 and huggingface_hub>=0.24.0 and ensure tqdm is present (or
explicitly add it) in examples/speculative_decoding/requirements.txt; reference
the imports used in the script such as load_dataset, hf_hub_url, list_repo_files
and tqdm when adding the entries so the requirements match the actual runtime
dependencies.

Comment on lines +115 to +160
def parse_nemotron_conversation(raw_conversations: list) -> list[dict] | None:
"""Parse a Nemotron conversation into a list of messages with standardized roles.

Args:
raw_conversations: List of message dicts from the Nemotron dataset.

Returns:
List of parsed message dicts with 'role' and 'content' keys, or None if the
conversation should be skipped.
"""
msgs = []
for msg in raw_conversations:
# Resolve role field (datasets may use "from" or "role")
raw_role = msg.get("from") or msg.get("role")
if not isinstance(raw_role, str):
continue
role = raw_role.lower()

# Normalize role names to standard values
if role in ("human", "user"):
role = "user"
elif role in ("gpt", "assistant"):
role = "assistant"
elif role == "system":
# Skip system messages; they are metadata not part of the conversation turns
continue
else:
# Skip unrecognized roles rather than failing, as the dataset may evolve
continue

# Resolve content field
if "value" in msg:
content = msg["value"]
elif "content" in msg:
content = msg["content"]
elif "text" in msg:
content = msg["text"]
else:
continue

content = content.strip()
if content:
msgs.append({"role": role, "content": content})

return msgs if msgs else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard content.strip() against non-string payloads (can crash the whole run).

On Line 155, content.strip() will raise if content is None, a dict/list (common in some chat datasets / multimodal formats), or otherwise non-string. Safer to type-check and skip (or normalize) instead of crashing.

Proposed fix
         # Resolve content field
         if "value" in msg:
             content = msg["value"]
         elif "content" in msg:
             content = msg["content"]
         elif "text" in msg:
             content = msg["text"]
         else:
             continue
 
-        content = content.strip()
+        if not isinstance(content, str):
+            continue
+        content = content.strip()
         if content:
             msgs.append({"role": role, "content": content})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 115 - 160, In parse_nemotron_conversation, calling content.strip()
can raise for non-string payloads (None, dict, list); guard by ensuring content
is a str before stripping—if msg["value"]/["content"]/["text"] yields a non-str,
skip or coerce safely (e.g., if not isinstance(content, str): continue or
convert to str only when appropriate) so msgs.append only receives string
content; update references to the local variable content within
parse_nemotron_conversation to perform this type-check/coercion prior to calling
strip().

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (e8f9687) to head (345b0bb).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #977   +/-   ##
=======================================
  Coverage   72.12%   72.12%           
=======================================
  Files         209      209           
  Lines       23628    23628           
=======================================
  Hits        17042    17042           
  Misses       6586     6586           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants