doc: example demonstrating how to train CosmosReason2 Eagle3#977
doc: example demonstrating how to train CosmosReason2 Eagle3#977
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a complete speculative decoding training pipeline for the EAGLE3 draft head on Nvidia Cosmos-Reason2-8B. It includes a JSON configuration file defining model parameters, a comprehensive Jupyter notebook documenting the entire training workflow from data preparation through deployment, and a Python script for converting Nemotron dataset conversations into the required training format. Changes
Sequence DiagramsequenceDiagram
participant User
participant DataPrep as add_nemotron_chat.py
participant HFHub as Hugging Face Hub
participant Training as launch_train.sh<br/>(FSDP2)
participant Export as export_hf_checkpoint.py
participant vLLM as vLLM Server
User->>DataPrep: Run with Nemotron dataset config
DataPrep->>HFHub: Discover & download chat shards
HFHub-->>DataPrep: Parquet data
DataPrep->>DataPrep: Parse conversations,<br/>normalize roles
DataPrep->>DataPrep: Calibrate 32k<br/>draft vocabulary
DataPrep-->>User: input_conversations.jsonl<br/>+ d2t.pt mapping
User->>Training: Launch with CR2_eagle_config.json<br/>+ prepared data
Training->>Training: Multi-GPU training<br/>with FSDP2
Training-->>User: Trained checkpoint
User->>Export: Convert checkpoint<br/>to HF format
Export-->>User: HF-compatible model
User->>vLLM: Load model + speculative<br/>decoding config
vLLM-->>User: Ready for inference
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
Signed-off-by: Slawek Kierat <skierat@nvidia.com>
1f8944f to
345b0bb
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb (2)
84-94: Consider making the output paths explicit in the data-prep command.Right now the notebook relies on
add_nemotron_chat.pydefaults for--output-dir/--output-split-name. For a guide, being explicit helps users avoid “where did my file go?” issues when their working directory differs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb` around lines 84 - 94, The notebook cell calling add_nemotron_chat.py relies on defaults for output locations; update the call to include explicit output flags so users know where files will be written. Modify the bash cell that runs python ../prepare_input_conversations/add_nemotron_chat.py to add --output-dir with a clear path (e.g., ./data/nemotron) and --output-split-name (e.g., train.jsonl) so the script invocation explicitly sets output locations; keep the script name add_nemotron_chat.py and the existing --mapping-file flag unchanged.
36-46: Clarify whether the notebook is meant to run against the repo checkout or the PyPI wheel (reproducibility).On Line 40-41,
pip install -U nvidia-modelopt[hf]pluspip install -r ../requirements.txtcan lead to version skew (installed wheel vs checked-out scripts) and non-reproducible environments. Consider explicitly stating the intended mode, and optionally pinning a known-goodnvidia-modeloptversion for this guide.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb` around lines 36 - 46, The notebook currently runs "pip install -U nvidia-modelopt[hf]" and "pip install -r ../requirements.txt", which can produce version skew between an installed PyPI wheel and the local repo; update the notebook to explicitly state whether the guide targets the checked-out repo or the PyPI wheel, and change the install cell to one of two clear options: (A) for repo development, install local editable packages (e.g., instructions to pip install -e . from the repo) and skip the PyPI wheel, or (B) for reproducible runs, pin a specific known-good nvidia-modelopt version (replace "pip install -U nvidia-modelopt[hf]" with a pinned version) before running "pip install -r ../requirements.txt"; reference the exact commands "pip install -U nvidia-modelopt[hf]" and "pip install -r ../requirements.txt" in the notebook text so readers know which mode to follow.examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py (2)
162-236: Ensure--output-direxists; also consider the memory profile in streaming mode.
- If
args.output_dirdoesn’t exist,update_dataset_file_with_conversations(...)may fail depending on howutilsis implemented. Creating it here makes the script more robust.- In streaming mode with
--max-samplesomitted (Line 206), the code can end up accumulating the entire split ininput_conversations(Line 209), which may be very large.Proposed robustness fix (directory creation)
async def main(args: argparse.Namespace) -> None: + args.output_dir.mkdir(parents=True, exist_ok=True) # Resolve Hub download URLs for chat parquet shards only. # Using load_dataset("parquet", data_files=urls) bypasses the repo's split # metadata validation, so only the chat shards are ever downloaded.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py` around lines 162 - 236, The script can fail if args.output_dir doesn't exist and can OOM in streaming mode since input_conversations accumulates the whole split; before calling update_dataset_file_with_conversations in main, ensure args.output_dir is created (mkdir parents=True, exist_ok=True) and in the streaming branch (where load_dataset(..., streaming=True) / stream and iterable are used) avoid building input_conversations for unbounded runs — either require/validate args.max_samples is set or switch to incremental writing: process samples in batches and call update_dataset_file_with_conversations (or a new append-mode helper) per-batch instead of collecting all entries in input_conversations.
43-66: Consider truncating “available parquet files” in the error message.On Line 61-63, the
ValueErrorembeds a full list of parquet files. If the dataset grows, this can create extremely noisy exceptions (and logs).Proposed tweak (limit output, keep it actionable)
if not split_files: + available_parquet_files = [f for f in all_files if f.endswith(".parquet")] + preview = available_parquet_files[:25] err_msg = ( f"No parquet files found for split '{split_name}' in {dataset_id}. " - f"Available files: {[f for f in all_files if f.endswith('.parquet')]}" + f"Available parquet files (showing {len(preview)}/{len(available_parquet_files)}): {preview}" ) raise ValueError(err_msg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py` around lines 43 - 66, In get_split_parquet_urls, the ValueError currently embeds the full list of available parquet files which can be huge; change the error message construction in the ValueError raised in get_split_parquet_urls to truncate the printed list (e.g., show the first N entries like 5, then "... (X more)"), or if total <= N show them all, so the message stays actionable but bounded. Keep the same error context (split_name and dataset_id) and include the total count of parquet files when truncating to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 211-225: The export cell is running the script from the wrong
working directory; update the command that invokes
scripts/export_hf_checkpoint.py so it runs from the repo root like the training
cell does—either cd up one directory before running (mirror the training cell’s
cd .. behavior) or call the script with a relative path to the root (e.g.,
../scripts/export_hf_checkpoint.py); ensure CKPT_DIR and EXPORT_PATH environment
variables remain unchanged and that the invocation in the notebook references
the same script name (export_hf_checkpoint.py) so it finds the script and
checkpoint correctly.
- Around line 96-108: The echoed path is inconsistent with the file actually
being counted: the bash cell computes count with count=$(wc -l <
input_conversations/nemotron-chat.jsonl) but prints
"../input_conversations/nemotron-chat.jsonl". Update the echo statement to
reference the same path used to compute count
(input_conversations/nemotron-chat.jsonl) so the message matches the counted
file; locate the bash cell containing the count assignment and the echo and make
the echo text consistent.
In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 115-160: In parse_nemotron_conversation, calling content.strip()
can raise for non-string payloads (None, dict, list); guard by ensuring content
is a str before stripping—if msg["value"]/["content"]/["text"] yields a non-str,
skip or coerce safely (e.g., if not isinstance(content, str): continue or
convert to str only when appropriate) so msgs.append only receives string
content; update references to the local variable content within
parse_nemotron_conversation to perform this type-check/coercion prior to calling
strip().
- Around line 16-41: The example script imports external packages (datasets,
huggingface_hub, tqdm) — update the example's requirements file to list these
direct dependencies so the example is reproducible: add entries for
datasets>=3.0.0 and huggingface_hub>=0.24.0 and ensure tqdm is present (or
explicitly add it) in examples/speculative_decoding/requirements.txt; reference
the imports used in the script such as load_dataset, hf_hub_url, list_repo_files
and tqdm when adding the entries so the requirements match the actual runtime
dependencies.
---
Nitpick comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 84-94: The notebook cell calling add_nemotron_chat.py relies on
defaults for output locations; update the call to include explicit output flags
so users know where files will be written. Modify the bash cell that runs python
../prepare_input_conversations/add_nemotron_chat.py to add --output-dir with a
clear path (e.g., ./data/nemotron) and --output-split-name (e.g., train.jsonl)
so the script invocation explicitly sets output locations; keep the script name
add_nemotron_chat.py and the existing --mapping-file flag unchanged.
- Around line 36-46: The notebook currently runs "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt", which can produce
version skew between an installed PyPI wheel and the local repo; update the
notebook to explicitly state whether the guide targets the checked-out repo or
the PyPI wheel, and change the install cell to one of two clear options: (A) for
repo development, install local editable packages (e.g., instructions to pip
install -e . from the repo) and skip the PyPI wheel, or (B) for reproducible
runs, pin a specific known-good nvidia-modelopt version (replace "pip install -U
nvidia-modelopt[hf]" with a pinned version) before running "pip install -r
../requirements.txt"; reference the exact commands "pip install -U
nvidia-modelopt[hf]" and "pip install -r ../requirements.txt" in the notebook
text so readers know which mode to follow.
In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 162-236: The script can fail if args.output_dir doesn't exist and
can OOM in streaming mode since input_conversations accumulates the whole split;
before calling update_dataset_file_with_conversations in main, ensure
args.output_dir is created (mkdir parents=True, exist_ok=True) and in the
streaming branch (where load_dataset(..., streaming=True) / stream and iterable
are used) avoid building input_conversations for unbounded runs — either
require/validate args.max_samples is set or switch to incremental writing:
process samples in batches and call update_dataset_file_with_conversations (or a
new append-mode helper) per-batch instead of collecting all entries in
input_conversations.
- Around line 43-66: In get_split_parquet_urls, the ValueError currently embeds
the full list of available parquet files which can be huge; change the error
message construction in the ValueError raised in get_split_parquet_urls to
truncate the printed list (e.g., show the first N entries like 5, then "... (X
more)"), or if total <= N show them all, so the message stays actionable but
bounded. Keep the same error context (split_name and dataset_id) and include the
total count of parquet files when truncating to aid debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2bf1ef45-2895-41c3-8ae2-94e18ce66eeb
⛔ Files ignored due to path filters (1)
examples/speculative_decoding/guides/nemotron_mapping.csvis excluded by!**/*.csv
📒 Files selected for processing (3)
examples/speculative_decoding/guides/CR2_eagle_config.jsonexamples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynbexamples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py
| "cell_type": "code", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "%%bash\n", | ||
| "# Expect exactly 89511 conversations.\n", | ||
| "count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n", | ||
| "echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n", | ||
| "[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }" | ||
| ], | ||
| "execution_count": null, | ||
| "outputs": [], | ||
| "id": "d05b97d3" | ||
| }, |
There was a problem hiding this comment.
Fix the echoed file path (currently inconsistent with what’s actually counted).
On Line 102, the cell counts input_conversations/nemotron-chat.jsonl but echoes ../input_conversations/nemotron-chat.jsonl.
Proposed fix
- "echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n",
+ "echo \"${count} conversations in input_conversations/nemotron-chat.jsonl\"\n",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "%%bash\n", | |
| "# Expect exactly 89511 conversations.\n", | |
| "count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n", | |
| "echo \"${count} conversations in ../input_conversations/nemotron-chat.jsonl\"\n", | |
| "[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }" | |
| ], | |
| "execution_count": null, | |
| "outputs": [], | |
| "id": "d05b97d3" | |
| }, | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "%%bash\n", | |
| "# Expect exactly 89511 conversations.\n", | |
| "count=$(wc -l < input_conversations/nemotron-chat.jsonl)\n", | |
| "echo \"${count} conversations in input_conversations/nemotron-chat.jsonl\"\n", | |
| "[ \"$count\" -eq 89511 ] || { echo \"ERROR: expected 89511, got ${count}\"; exit 1; }" | |
| ], | |
| "execution_count": null, | |
| "outputs": [], | |
| "id": "d05b97d3" | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 96 - 108, The echoed path is inconsistent with the file actually
being counted: the bash cell computes count with count=$(wc -l <
input_conversations/nemotron-chat.jsonl) but prints
"../input_conversations/nemotron-chat.jsonl". Update the echo statement to
reference the same path used to compute count
(input_conversations/nemotron-chat.jsonl) so the message matches the counted
file; locate the bash cell containing the count assignment and the echo and make
the echo text consistent.
| "cell_type": "code", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "%%bash\n", | ||
| "CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n", | ||
| "EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n", | ||
| "\n", | ||
| "python scripts/export_hf_checkpoint.py \\\n", | ||
| " --model_path $CKPT_DIR \\\n", | ||
| " --export_path $EXPORT_PATH" | ||
| ], | ||
| "execution_count": null, | ||
| "outputs": [], | ||
| "id": "63880f67" | ||
| }, |
There was a problem hiding this comment.
Export step likely runs in the wrong working directory (missing cd .. / wrong relative script path).
Earlier cells call ../scripts/... from the guides/ directory, and the training cell explicitly does cd .. before invoking ./launch_train.sh. This export cell (Line 218) calls python scripts/export_hf_checkpoint.py without changing directories, so it will likely fail when executed from guides/.
Proposed fix (mirror the training cell’s working directory)
- "%%bash\n",
+ "%%bash\n",
+ "cd ..\n",
"CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n",
"EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n",
"\n",
"python scripts/export_hf_checkpoint.py \\\n",
" --model_path $CKPT_DIR \\\n",
" --export_path $EXPORT_PATH"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "%%bash\n", | |
| "CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n", | |
| "EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n", | |
| "\n", | |
| "python scripts/export_hf_checkpoint.py \\\n", | |
| " --model_path $CKPT_DIR \\\n", | |
| " --export_path $EXPORT_PATH" | |
| ], | |
| "execution_count": null, | |
| "outputs": [], | |
| "id": "63880f67" | |
| }, | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "%%bash\n", | |
| "cd ..\n", | |
| "CKPT_DIR=ckpts/cosmos-reason2-8b-eagle3/checkpoint-110000\n", | |
| "EXPORT_PATH=export/cosmos-reason2-8b-eagle3\n", | |
| "\n", | |
| "python scripts/export_hf_checkpoint.py \\\n", | |
| " --model_path $CKPT_DIR \\\n", | |
| " --export_path $EXPORT_PATH" | |
| ], | |
| "execution_count": null, | |
| "outputs": [], | |
| "id": "63880f67" | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 211 - 225, The export cell is running the script from the wrong
working directory; update the command that invokes
scripts/export_hf_checkpoint.py so it runs from the repo root like the training
cell does—either cd up one directory before running (mirror the training cell’s
cd .. behavior) or call the script with a relative path to the root (e.g.,
../scripts/export_hf_checkpoint.py); ensure CKPT_DIR and EXPORT_PATH environment
variables remain unchanged and that the invocation in the notebook references
the same script name (export_hf_checkpoint.py) so it finds the script and
checkpoint correctly.
| """Add Nemotron-Post-Training-Dataset-v2 chat conversations to a conversation dataset. | ||
|
|
||
| Dataset: https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2 | ||
|
|
||
| Note: This dataset requires agreeing to the terms of use on Hugging Face. | ||
| Make sure you are logged in with `huggingface-cli login` before running this script. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| import itertools | ||
| from pathlib import Path | ||
|
|
||
| from datasets import load_dataset | ||
| from huggingface_hub import hf_hub_url, list_repo_files | ||
| from tqdm import tqdm | ||
| from utils import ( | ||
| dataset_splits_explanation, | ||
| id_for_conversation, | ||
| update_dataset_file_with_conversations, | ||
| ) | ||
|
|
||
| NEMOTRON_DATASET_ID = "nvidia/Nemotron-Post-Training-Dataset-v2" | ||
| NEMOTRON_CHAT_SPLIT = "chat" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# 1) Check repo-stated Python version constraints (pyproject/setup/CI).
rg -nP --glob='!**/.git/**' '(python_requires|Requires-Python|python-version|PYTHON_VERSION)' -SRepository: NVIDIA/Model-Optimizer
Length of output: 323
🏁 Script executed:
# 2) Check that examples requirements mention the runtime deps.
fd -e 'requirements*.txt' examples/ | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 407
🏁 Script executed:
# 3) Find and examine pyproject.toml or setup.py files
fd -e 'pyproject.toml' -e 'setup.py' -e 'setup.cfg' | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Correct fd search for requirements files in examples directory
fd . 'examples/' -name 'requirements*.txt' | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 296
🏁 Script executed:
# Check for pyproject.toml at repo root
cat pyproject.toml | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 3889
🏁 Script executed:
# Examine the specific example file for PEP 604 syntax
cat -n examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 2371
🏁 Script executed:
# Check if examples directory or subdirectories have requirements files
find examples/ -name 'requirements*.txt' -o -name 'pyproject.toml' -o -name 'setup.py'Repository: NVIDIA/Model-Optimizer
Length of output: 1182
🏁 Script executed:
# Check for README or documentation in examples/speculative_decoding
find examples/speculative_decoding -name 'README*' -o -name '*.md'Repository: NVIDIA/Model-Optimizer
Length of output: 157
🏁 Script executed:
# Look for any setup/install instructions in the example directory
cat examples/speculative_decoding/README.md 2>/dev/null || cat examples/speculative_decoding/prepare_input_conversations/README.md 2>/dev/null || echo "No README found"Repository: NVIDIA/Model-Optimizer
Length of output: 14158
🏁 Script executed:
# Check the speculative_decoding requirements.txt
cat examples/speculative_decoding/requirements.txtRepository: NVIDIA/Model-Optimizer
Length of output: 107
🏁 Script executed:
# Check if the example file uses PEP 604 union syntax (|) or type hints
grep -n 'def \|: .*|' examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 323
🏁 Script executed:
# Check full function signatures in the example file to see typing syntax used
sed -n '43,100p' examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2254
🏁 Script executed:
# Check what other scripts are in the same directory and their imports
ls -la examples/speculative_decoding/prepare_input_conversations/Repository: NVIDIA/Model-Optimizer
Length of output: 713
🏁 Script executed:
# Check if other scripts in the same directory have similar dependencies
grep -h "^from\|^import" examples/speculative_decoding/prepare_input_conversations/*.py | sort -u | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 346
Add missing dependencies to examples/speculative_decoding/requirements.txt.
The script imports datasets, huggingface_hub, and tqdm, but these are missing from the example's requirements.txt. While the README.md instructs installing nvidia-modelopt[hf] (which provides some of these), the example's own requirements file should document all direct dependencies. Python 3.10+ requirement is properly documented in pyproject.toml and aligns with the PEP 604 union syntax used here.
Add to examples/speculative_decoding/requirements.txt:
datasets>=3.0.0huggingface_hub>=0.24.0tqdm(or verify it's already in base install)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 16 - 41, The example script imports external packages (datasets,
huggingface_hub, tqdm) — update the example's requirements file to list these
direct dependencies so the example is reproducible: add entries for
datasets>=3.0.0 and huggingface_hub>=0.24.0 and ensure tqdm is present (or
explicitly add it) in examples/speculative_decoding/requirements.txt; reference
the imports used in the script such as load_dataset, hf_hub_url, list_repo_files
and tqdm when adding the entries so the requirements match the actual runtime
dependencies.
| def parse_nemotron_conversation(raw_conversations: list) -> list[dict] | None: | ||
| """Parse a Nemotron conversation into a list of messages with standardized roles. | ||
|
|
||
| Args: | ||
| raw_conversations: List of message dicts from the Nemotron dataset. | ||
|
|
||
| Returns: | ||
| List of parsed message dicts with 'role' and 'content' keys, or None if the | ||
| conversation should be skipped. | ||
| """ | ||
| msgs = [] | ||
| for msg in raw_conversations: | ||
| # Resolve role field (datasets may use "from" or "role") | ||
| raw_role = msg.get("from") or msg.get("role") | ||
| if not isinstance(raw_role, str): | ||
| continue | ||
| role = raw_role.lower() | ||
|
|
||
| # Normalize role names to standard values | ||
| if role in ("human", "user"): | ||
| role = "user" | ||
| elif role in ("gpt", "assistant"): | ||
| role = "assistant" | ||
| elif role == "system": | ||
| # Skip system messages; they are metadata not part of the conversation turns | ||
| continue | ||
| else: | ||
| # Skip unrecognized roles rather than failing, as the dataset may evolve | ||
| continue | ||
|
|
||
| # Resolve content field | ||
| if "value" in msg: | ||
| content = msg["value"] | ||
| elif "content" in msg: | ||
| content = msg["content"] | ||
| elif "text" in msg: | ||
| content = msg["text"] | ||
| else: | ||
| continue | ||
|
|
||
| content = content.strip() | ||
| if content: | ||
| msgs.append({"role": role, "content": content}) | ||
|
|
||
| return msgs if msgs else None | ||
|
|
There was a problem hiding this comment.
Guard content.strip() against non-string payloads (can crash the whole run).
On Line 155, content.strip() will raise if content is None, a dict/list (common in some chat datasets / multimodal formats), or otherwise non-string. Safer to type-check and skip (or normalize) instead of crashing.
Proposed fix
# Resolve content field
if "value" in msg:
content = msg["value"]
elif "content" in msg:
content = msg["content"]
elif "text" in msg:
content = msg["text"]
else:
continue
- content = content.strip()
+ if not isinstance(content, str):
+ continue
+ content = content.strip()
if content:
msgs.append({"role": role, "content": content})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 115 - 160, In parse_nemotron_conversation, calling content.strip()
can raise for non-string payloads (None, dict, list); guard by ensuring content
is a str before stripping—if msg["value"]/["content"]/["text"] yields a non-str,
skip or coerce safely (e.g., if not isinstance(content, str): continue or
convert to str only when appropriate) so msgs.append only receives string
content; update references to the local variable content within
parse_nemotron_conversation to perform this type-check/coercion prior to calling
strip().
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #977 +/- ##
=======================================
Coverage 72.12% 72.12%
=======================================
Files 209 209
Lines 23628 23628
=======================================
Hits 17042 17042
Misses 6586 6586 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: new example
Original PR: #965
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
Release Notes