doc: example demonstrating how to train CosmosReason2 Eagle3 by h-guo18 · Pull Request #977 · NVIDIA/Model-Optimizer

h-guo18 · 2026-03-05T00:38:26Z

What does this PR do?

Type of change: new example

Original PR: #965

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added configuration template for speculative decoding model setups.
- New utility script for processing and preparing chat conversation datasets.
- Training guide notebook with step-by-step workflow, calibration instructions, multi-GPU execution, and deployment examples.

copy-pr-bot · 2026-03-05T00:38:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-05T00:38:36Z

📝 Walkthrough

Walkthrough

This pull request introduces a complete speculative decoding training pipeline for the EAGLE3 draft head on Nvidia Cosmos-Reason2-8B. It includes a JSON configuration file defining model parameters, a comprehensive Jupyter notebook documenting the entire training workflow from data preparation through deployment, and a Python script for converting Nemotron dataset conversations into the required training format.

Changes

Cohort / File(s)	Summary
Configuration `examples/speculative_decoding/guides/CR2_eagle_config.json`	JSON configuration specifying draft vocabulary size (32000), attention implementation (flex_attention), and rope scaling parameters (yarn rope type with beta/factor settings).
Training Workflow `examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`	Jupyter notebook documenting end-to-end training pipeline: Nemotron data preparation with format validation (89511 conversations expected), 32000-token vocabulary calibration, FSDP2-enabled multi-GPU training, checkpoint export to Hugging Face format, and vLLM deployment with speculative decoding configuration.
Data Preparation `examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`	Python script for ingesting Nemotron-Post-Training-Dataset-v2 from Hugging Face Hub. Features parquet shard discovery, streaming/mapping-based modes for data loading, conversation role normalization (user/assistant), and dataset persistence via utility functions.

Sequence Diagram

sequenceDiagram
    participant User
    participant DataPrep as add_nemotron_chat.py
    participant HFHub as Hugging Face Hub
    participant Training as launch_train.sh<br/>(FSDP2)
    participant Export as export_hf_checkpoint.py
    participant vLLM as vLLM Server

    User->>DataPrep: Run with Nemotron dataset config
    DataPrep->>HFHub: Discover & download chat shards
    HFHub-->>DataPrep: Parquet data
    DataPrep->>DataPrep: Parse conversations,<br/>normalize roles
    DataPrep->>DataPrep: Calibrate 32k<br/>draft vocabulary
    DataPrep-->>User: input_conversations.jsonl<br/>+ d2t.pt mapping

    User->>Training: Launch with CR2_eagle_config.json<br/>+ prepared data
    Training->>Training: Multi-GPU training<br/>with FSDP2
    Training-->>User: Trained checkpoint

    User->>Export: Convert checkpoint<br/>to HF format
    Export-->>User: HF-compatible model

    User->>vLLM: Load model + speculative<br/>decoding config
    vLLM-->>User: Ready for inference

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main contribution: adding an example/guide demonstrating how to train CosmosReason2 with Eagle3 speculative decoding, which aligns with the primary changes (new notebook, configuration, and training script).
Security Anti-Patterns	✅ Passed	Comprehensive scan found no security anti-patterns including torch.load, numpy.load, trust_remote_code, eval/exec, or nosec comments in the repository.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch haoguo/skierat-cr2-guide

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Slawek Kierat <skierat@nvidia.com>

copy-pr-bot · 2026-03-05T00:40:04Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb (2)

84-94: Consider making the output paths explicit in the data-prep command.

Right now the notebook relies on add_nemotron_chat.py defaults for --output-dir / --output-split-name. For a guide, being explicit helps users avoid “where did my file go?” issues when their working directory differs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 84 - 94, The notebook cell calling add_nemotron_chat.py relies on
defaults for output locations; update the call to include explicit output flags
so users know where files will be written. Modify the bash cell that runs python
../prepare_input_conversations/add_nemotron_chat.py to add --output-dir with a
clear path (e.g., ./data/nemotron) and --output-split-name (e.g., train.jsonl)
so the script invocation explicitly sets output locations; keep the script name
add_nemotron_chat.py and the existing --mapping-file flag unchanged.
36-46: Clarify whether the notebook is meant to run against the repo checkout or the PyPI wheel (reproducibility).

On Line 40-41, pip install -U nvidia-modelopt[hf] plus pip install -r ../requirements.txt can lead to version skew (installed wheel vs checked-out scripts) and non-reproducible environments. Consider explicitly stating the intended mode, and optionally pinning a known-good nvidia-modelopt version for this guide.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 36 - 46, The notebook currently runs "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt", which can produce
version skew between an installed PyPI wheel and the local repo; update the
notebook to explicitly state whether the guide targets the checked-out repo or
the PyPI wheel, and change the install cell to one of two clear options: (A) for
repo development, install local editable packages (e.g., instructions to pip
install -e . from the repo) and skip the PyPI wheel, or (B) for reproducible
runs, pin a specific known-good nvidia-modelopt version (replace "pip install -U
nvidia-modelopt[hf]" with a pinned version) before running "pip install -r
../requirements.txt"; reference the exact commands "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt" in the notebook
text so readers know which mode to follow.

examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py (2)

162-236: Ensure --output-dir exists; also consider the memory profile in streaming mode.

If args.output_dir doesn’t exist, update_dataset_file_with_conversations(...) may fail depending on how utils is implemented. Creating it here makes the script more robust.
In streaming mode with --max-samples omitted (Line 206), the code can end up accumulating the entire split in input_conversations (Line 209), which may be very large.

Proposed robustness fix (directory creation)

 async def main(args: argparse.Namespace) -> None:
+    args.output_dir.mkdir(parents=True, exist_ok=True)
     # Resolve Hub download URLs for chat parquet shards only.
     # Using load_dataset("parquet", data_files=urls) bypasses the repo's split
     # metadata validation, so only the chat shards are ever downloaded.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 162 - 236, The script can fail if args.output_dir doesn't exist and
can OOM in streaming mode since input_conversations accumulates the whole split;
before calling update_dataset_file_with_conversations in main, ensure
args.output_dir is created (mkdir parents=True, exist_ok=True) and in the
streaming branch (where load_dataset(..., streaming=True) / stream and iterable
are used) avoid building input_conversations for unbounded runs — either
require/validate args.max_samples is set or switch to incremental writing:
process samples in batches and call update_dataset_file_with_conversations (or a
new append-mode helper) per-batch instead of collecting all entries in
input_conversations.

43-66: Consider truncating “available parquet files” in the error message.

On Line 61-63, the ValueError embeds a full list of parquet files. If the dataset grows, this can create extremely noisy exceptions (and logs).

Proposed tweak (limit output, keep it actionable)

     if not split_files:
+        available_parquet_files = [f for f in all_files if f.endswith(".parquet")]
+        preview = available_parquet_files[:25]
         err_msg = (
             f"No parquet files found for split '{split_name}' in {dataset_id}. "
-            f"Available files: {[f for f in all_files if f.endswith('.parquet')]}"
+            f"Available parquet files (showing {len(preview)}/{len(available_parquet_files)}): {preview}"
         )
         raise ValueError(err_msg)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 43 - 66, In get_split_parquet_urls, the ValueError currently embeds
the full list of available parquet files which can be huge; change the error
message construction in the ValueError raised in get_split_parquet_urls to
truncate the printed list (e.g., show the first N entries like 5, then "... (X
more)"), or if total <= N show them all, so the message stays actionable but
bounded. Keep the same error context (split_name and dataset_id) and include the
total count of parquet files when truncating to aid debugging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 211-225: The export cell is running the script from the wrong
working directory; update the command that invokes
scripts/export_hf_checkpoint.py so it runs from the repo root like the training
cell does—either cd up one directory before running (mirror the training cell’s
cd .. behavior) or call the script with a relative path to the root (e.g.,
../scripts/export_hf_checkpoint.py); ensure CKPT_DIR and EXPORT_PATH environment
variables remain unchanged and that the invocation in the notebook references
the same script name (export_hf_checkpoint.py) so it finds the script and
checkpoint correctly.
- Around line 96-108: The echoed path is inconsistent with the file actually
being counted: the bash cell computes count with count=$(wc -l <
input_conversations/nemotron-chat.jsonl) but prints
"../input_conversations/nemotron-chat.jsonl". Update the echo statement to
reference the same path used to compute count
(input_conversations/nemotron-chat.jsonl) so the message matches the counted
file; locate the bash cell containing the count assignment and the echo and make
the echo text consistent.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 115-160: In parse_nemotron_conversation, calling content.strip()
can raise for non-string payloads (None, dict, list); guard by ensuring content
is a str before stripping—if msg["value"]/["content"]/["text"] yields a non-str,
skip or coerce safely (e.g., if not isinstance(content, str): continue or
convert to str only when appropriate) so msgs.append only receives string
content; update references to the local variable content within
parse_nemotron_conversation to perform this type-check/coercion prior to calling
strip().
- Around line 16-41: The example script imports external packages (datasets,
huggingface_hub, tqdm) — update the example's requirements file to list these
direct dependencies so the example is reproducible: add entries for
datasets>=3.0.0 and huggingface_hub>=0.24.0 and ensure tqdm is present (or
explicitly add it) in examples/speculative_decoding/requirements.txt; reference
the imports used in the script such as load_dataset, hf_hub_url, list_repo_files
and tqdm when adding the entries so the requirements match the actual runtime
dependencies.

---

Nitpick comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 84-94: The notebook cell calling add_nemotron_chat.py relies on
defaults for output locations; update the call to include explicit output flags
so users know where files will be written. Modify the bash cell that runs python
../prepare_input_conversations/add_nemotron_chat.py to add --output-dir with a
clear path (e.g., ./data/nemotron) and --output-split-name (e.g., train.jsonl)
so the script invocation explicitly sets output locations; keep the script name
add_nemotron_chat.py and the existing --mapping-file flag unchanged.
- Around line 36-46: The notebook currently runs "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt", which can produce
version skew between an installed PyPI wheel and the local repo; update the
notebook to explicitly state whether the guide targets the checked-out repo or
the PyPI wheel, and change the install cell to one of two clear options: (A) for
repo development, install local editable packages (e.g., instructions to pip
install -e . from the repo) and skip the PyPI wheel, or (B) for reproducible
runs, pin a specific known-good nvidia-modelopt version (replace "pip install -U
nvidia-modelopt[hf]" with a pinned version) before running "pip install -r
../requirements.txt"; reference the exact commands "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt" in the notebook
text so readers know which mode to follow.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 162-236: The script can fail if args.output_dir doesn't exist and
can OOM in streaming mode since input_conversations accumulates the whole split;
before calling update_dataset_file_with_conversations in main, ensure
args.output_dir is created (mkdir parents=True, exist_ok=True) and in the
streaming branch (where load_dataset(..., streaming=True) / stream and iterable
are used) avoid building input_conversations for unbounded runs — either
require/validate args.max_samples is set or switch to incremental writing:
process samples in batches and call update_dataset_file_with_conversations (or a
new append-mode helper) per-batch instead of collecting all entries in
input_conversations.
- Around line 43-66: In get_split_parquet_urls, the ValueError currently embeds
the full list of available parquet files which can be huge; change the error
message construction in the ValueError raised in get_split_parquet_urls to
truncate the printed list (e.g., show the first N entries like 5, then "... (X
more)"), or if total <= N show them all, so the message stays actionable but
bounded. Keep the same error context (split_name and dataset_id) and include the
total count of parquet files when truncating to aid debugging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2bf1ef45-2895-41c3-8ae2-94e18ce66eeb

📥 Commits

Reviewing files that changed from the base of the PR and between e8f9687 and 345b0bb.

⛔ Files ignored due to path filters (1)

examples/speculative_decoding/guides/nemotron_mapping.csv is excluded by !**/*.csv

📒 Files selected for processing (3)

examples/speculative_decoding/guides/CR2_eagle_config.json
examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb
examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

coderabbitai · 2026-03-05T00:52:47Z

examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb

+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "%%bash\n",
+        "# Expect exactly 89511 conversations.\n",
+        "count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n",
+        "echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n",
+        "[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "d05b97d3"
+    },


⚠️ Potential issue | 🟡 Minor

Fix the echoed file path (currently inconsistent with what’s actually counted).

On Line 102, the cell counts input_conversations/nemotron-chat.jsonl but echoes ../input_conversations/nemotron-chat.jsonl.

Proposed fix

- "echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n", + "echo \"${count} conversations in input_conversations/nemotron-chat.jsonl\"\n",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"cell_type": "code",

"metadata": {},

"source": [

"%%bash\n",

"# Expect exactly 89511 conversations.\n",

"count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n",

"echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n",

"[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }"

],

"execution_count": null,

"outputs": [],

"id": "d05b97d3"

},

"cell_type": "code",

"metadata": {},

"source": [

"%%bash\n",

"# Expect exactly 89511 conversations.\n",

"count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n",

"echo \"${count} conversations in input_conversations/nemotron-chat.jsonl\"\n",

"[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }"

],

"execution_count": null,

"outputs": [],

"id": "d05b97d3"

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb` around lines 96 - 108, The echoed path is inconsistent with the file actually being counted: the bash cell computes count with count=$(wc -l < input_conversations/nemotron-chat.jsonl) but prints "../input_conversations/nemotron-chat.jsonl". Update the echo statement to reference the same path used to compute count (input_conversations/nemotron-chat.jsonl) so the message matches the counted file; locate the bash cell containing the count assignment and the echo and make the echo text consistent.

coderabbitai · 2026-03-05T00:52:48Z

examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb

+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "%%bash\n",
+        "CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",
+        "EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",
+        "\n",
+        "python scripts/export_hf_checkpoint.py \\\n",
+        "    --model_path $CKPT_DIR \\\n",
+        "    --export_path $EXPORT_PATH"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "63880f67"
+    },


⚠️ Potential issue | 🟠 Major

Export step likely runs in the wrong working directory (missing cd .. / wrong relative script path).

Earlier cells call ../scripts/... from the guides/ directory, and the training cell explicitly does cd .. before invoking ./launch_train.sh. This export cell (Line 218) calls python scripts/export_hf_checkpoint.py without changing directories, so it will likely fail when executed from guides/.

Proposed fix (mirror the training cell’s working directory)

- "%%bash\n", + "%%bash\n", + "cd ..\n", "CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n", "EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n", "\n", "python scripts/export_hf_checkpoint.py \\\n", " --model_path $CKPT_DIR \\\n", " --export_path $EXPORT_PATH"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"cell_type": "code",

"metadata": {},

"source": [

"%%bash\n",

"CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",

"EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",

"\n",

"python scripts/export_hf_checkpoint.py \\\n",

" --model_path $CKPT_DIR \\\n",

" --export_path $EXPORT_PATH"

],

"execution_count": null,

"outputs": [],

"id": "63880f67"

},

"cell_type": "code",

"metadata": {},

"source": [

"%%bash\n",

"cd ..\n",

"CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",

"EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",

"\n",

"python scripts/export_hf_checkpoint.py \\\n",

" --model_path $CKPT_DIR \\\n",

" --export_path $EXPORT_PATH"

],

"execution_count": null,

"outputs": [],

"id": "63880f67"

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb` around lines 211 - 225, The export cell is running the script from the wrong working directory; update the command that invokes scripts/export_hf_checkpoint.py so it runs from the repo root like the training cell does—either cd up one directory before running (mirror the training cell’s cd .. behavior) or call the script with a relative path to the root (e.g., ../scripts/export_hf_checkpoint.py); ensure CKPT_DIR and EXPORT_PATH environment variables remain unchanged and that the invocation in the notebook references the same script name (export_hf_checkpoint.py) so it finds the script and checkpoint correctly.

coderabbitai · 2026-03-05T00:52:48Z

examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

+"""Add Nemotron-Post-Training-Dataset-v2 chat conversations to a conversation dataset.
+
+Dataset: https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2
+
+Note: This dataset requires agreeing to the terms of use on Hugging Face.
+      Make sure you are logged in with `huggingface-cli login` before running this script.
+"""
+
+from __future__ import annotations
+
+import argparse
+import itertools
+from pathlib import Path
+
+from datasets import load_dataset
+from huggingface_hub import hf_hub_url, list_repo_files
+from tqdm import tqdm
+from utils import (
+    dataset_splits_explanation,
+    id_for_conversation,
+    update_dataset_file_with_conversations,
+)
+
+NEMOTRON_DATASET_ID = "nvidia/Nemotron-Post-Training-Dataset-v2"
+NEMOTRON_CHAT_SPLIT = "chat"
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# 1) Check repo-stated Python version constraints (pyproject/setup/CI). rg -nP --glob='!**/.git/**' '(python_requires|Requires-Python|python-version|PYTHON_VERSION)' -S

Repository: NVIDIA/Model-Optimizer

Length of output: 323

🏁 Script executed:

# 2) Check that examples requirements mention the runtime deps. fd -e 'requirements*.txt' examples/ | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 407

🏁 Script executed:

# 3) Find and examine pyproject.toml or setup.py files fd -e 'pyproject.toml' -e 'setup.py' -e 'setup.cfg' | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Correct fd search for requirements files in examples directory fd . 'examples/' -name 'requirements*.txt' | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 296

🏁 Script executed:

# Check for pyproject.toml at repo root cat pyproject.toml | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3889

🏁 Script executed:

# Examine the specific example file for PEP 604 syntax cat -n examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2371

🏁 Script executed:

# Check if examples directory or subdirectories have requirements files find examples/ -name 'requirements*.txt' -o -name 'pyproject.toml' -o -name 'setup.py'

Repository: NVIDIA/Model-Optimizer

Length of output: 1182

🏁 Script executed:

# Check for README or documentation in examples/speculative_decoding find examples/speculative_decoding -name 'README*' -o -name '*.md'

Repository: NVIDIA/Model-Optimizer

Length of output: 157

🏁 Script executed:

# Look for any setup/install instructions in the example directory cat examples/speculative_decoding/README.md 2>/dev/null || cat examples/speculative_decoding/prepare_input_conversations/README.md 2>/dev/null || echo "No README found"

Repository: NVIDIA/Model-Optimizer

Length of output: 14158

🏁 Script executed:

# Check the speculative_decoding requirements.txt cat examples/speculative_decoding/requirements.txt

Repository: NVIDIA/Model-Optimizer

Length of output: 107

🏁 Script executed:

# Check if the example file uses PEP 604 union syntax (|) or type hints grep -n 'def \|: .*|' examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 323

🏁 Script executed:

# Check full function signatures in the example file to see typing syntax used sed -n '43,100p' examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2254

🏁 Script executed:

# Check what other scripts are in the same directory and their imports ls -la examples/speculative_decoding/prepare_input_conversations/

Repository: NVIDIA/Model-Optimizer

Length of output: 713

🏁 Script executed:

# Check if other scripts in the same directory have similar dependencies grep -h "^from\|^import" examples/speculative_decoding/prepare_input_conversations/*.py | sort -u | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 346

Add missing dependencies to examples/speculative_decoding/requirements.txt.

The script imports datasets, huggingface_hub, and tqdm, but these are missing from the example's requirements.txt. While the README.md instructs installing nvidia-modelopt[hf] (which provides some of these), the example's own requirements file should document all direct dependencies. Python 3.10+ requirement is properly documented in pyproject.toml and aligns with the PEP 604 union syntax used here.

Add to examples/speculative_decoding/requirements.txt:

datasets>=3.0.0

huggingface_hub>=0.24.0

tqdm (or verify it's already in base install)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py` around lines 16 - 41, The example script imports external packages (datasets, huggingface_hub, tqdm) — update the example's requirements file to list these direct dependencies so the example is reproducible: add entries for datasets>=3.0.0 and huggingface_hub>=0.24.0 and ensure tqdm is present (or explicitly add it) in examples/speculative_decoding/requirements.txt; reference the imports used in the script such as load_dataset, hf_hub_url, list_repo_files and tqdm when adding the entries so the requirements match the actual runtime dependencies.

coderabbitai · 2026-03-05T00:52:48Z

examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

+def parse_nemotron_conversation(raw_conversations: list) -> list[dict] | None:
+    """Parse a Nemotron conversation into a list of messages with standardized roles.
+
+    Args:
+        raw_conversations: List of message dicts from the Nemotron dataset.
+
+    Returns:
+        List of parsed message dicts with 'role' and 'content' keys, or None if the
+        conversation should be skipped.
+    """
+    msgs = []
+    for msg in raw_conversations:
+        # Resolve role field (datasets may use "from" or "role")
+        raw_role = msg.get("from") or msg.get("role")
+        if not isinstance(raw_role, str):
+            continue
+        role = raw_role.lower()
+
+        # Normalize role names to standard values
+        if role in ("human", "user"):
+            role = "user"
+        elif role in ("gpt", "assistant"):
+            role = "assistant"
+        elif role == "system":
+            # Skip system messages; they are metadata not part of the conversation turns
+            continue
+        else:
+            # Skip unrecognized roles rather than failing, as the dataset may evolve
+            continue
+
+        # Resolve content field
+        if "value" in msg:
+            content = msg["value"]
+        elif "content" in msg:
+            content = msg["content"]
+        elif "text" in msg:
+            content = msg["text"]
+        else:
+            continue
+
+        content = content.strip()
+        if content:
+            msgs.append({"role": role, "content": content})
+
+    return msgs if msgs else None
+


⚠️ Potential issue | 🟠 Major

Guard content.strip() against non-string payloads (can crash the whole run).

On Line 155, content.strip() will raise if content is None, a dict/list (common in some chat datasets / multimodal formats), or otherwise non-string. Safer to type-check and skip (or normalize) instead of crashing.

Proposed fix

# Resolve content field if "value" in msg: content = msg["value"] elif "content" in msg: content = msg["content"] elif "text" in msg: content = msg["text"] else: continue - content = content.strip() + if not isinstance(content, str): + continue + content = content.strip() if content: msgs.append({"role": role, "content": content})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py` around lines 115 - 160, In parse_nemotron_conversation, calling content.strip() can raise for non-string payloads (None, dict, list); guard by ensuring content is a str before stripping—if msg["value"]/["content"]/["text"] yields a non-str, skip or coerce safely (e.g., if not isinstance(content, str): continue or convert to str only when appropriate) so msgs.append only receives string content; update references to the local variable content within parse_nemotron_conversation to perform this type-check/coercion prior to calling strip().

codecov · 2026-03-05T01:03:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (e8f9687) to head (345b0bb).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #977   +/-   ##
=======================================
  Coverage   72.12%   72.12%           
=======================================
  Files         209      209           
  Lines       23628    23628           
=======================================
  Hits        17042    17042           
  Misses       6586     6586

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

skierat added 4 commits March 5, 2026 00:39

Add Jupyter notebook guide for training Eagle on the CosmosReason2 model

6385bd0

Signed-off-by: Slawek Kierat <skierat@nvidia.com>

Minor updates and fixes

2d9e53f

Signed-off-by: Slawek Kierat <skierat@nvidia.com>

pre commit fixes

6129332

Signed-off-by: Slawek Kierat <skierat@nvidia.com>

Avoid exposing HF token in CLI args in CR2 training notebook

345b0bb

Signed-off-by: Slawek Kierat <skierat@nvidia.com>

h-guo18 force-pushed the haoguo/skierat-cr2-guide branch from 1f8944f to 345b0bb Compare March 5, 2026 00:40

h-guo18 marked this pull request as ready for review March 5, 2026 00:41

h-guo18 requested a review from a team as a code owner March 5, 2026 00:41

h-guo18 requested a review from yeyu-nvidia March 5, 2026 00:41

h-guo18 changed the title ~~Haoguo/skierat cr2 guide~~ doc: example demonstrating how to train CosmosReason2 Eagle3 Mar 5, 2026

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

ChenhanYu approved these changes Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: example demonstrating how to train CosmosReason2 Eagle3#977

doc: example demonstrating how to train CosmosReason2 Eagle3#977
h-guo18 wants to merge 4 commits intomainfrom
haoguo/skierat-cr2-guide

h-guo18 commented Mar 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

codecov bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

h-guo18 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 5, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

h-guo18 commented Mar 5, 2026 •

edited

Loading

coderabbitai bot commented Mar 5, 2026 •

edited

Loading