-
Notifications
You must be signed in to change notification settings - Fork 341
feat(speculative): add vLLM data synthesis pipeline and Nemotron dataset preparation scripts #1176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+1,337
−39
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # Dataset Preparation Scripts | ||
|
|
||
| Utilities for building conversation datasets from NVIDIA Nemotron Post-Training | ||
| collections and other HuggingFace sources. These scripts produce datasets in | ||
| **standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`) | ||
| and can be used for any downstream fine-tuning task — SFT, distillation, | ||
| speculative decoding draft-model training, etc. | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Description | | ||
| |---|---| | ||
| | `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix | | ||
| | `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) | | ||
| | `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) | | ||
| | `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping | | ||
| | `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset | | ||
| | `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` | | ||
| | `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` | | ||
| | `example_data_config.yaml` | Example YAML config for `make_dataset.py` | | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Install dependencies | ||
|
|
||
| ```bash | ||
| pip install datasets huggingface_hub pyyaml | ||
| huggingface-cli login # required for gated datasets | ||
| ```text | ||
|
|
||
| ### Build a Nemotron PT v3 dataset | ||
|
|
||
| ```bash | ||
| # Synthetic data generation inputs (strips last assistant turn so a model can regenerate it) | ||
| python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen | ||
|
|
||
| # Full conversations for direct SFT training | ||
| python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train | ||
|
|
||
| # Use a custom dataset mix | ||
| python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom | ||
| ```text | ||
|
|
||
| ### Build a Nemotron PT v2 dataset | ||
|
|
||
| ```bash | ||
| python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen | ||
| python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train | ||
| ```text | ||
|
|
||
| ### Build a general-purpose mixed dataset | ||
|
|
||
| ```bash | ||
| python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed | ||
| ```text | ||
|
|
||
| ## Dataset Modes | ||
|
|
||
| Both `make_nemotron_pt*.py` scripts support two modes: | ||
|
|
||
| | Mode | Description | Use case | | ||
| |---|---|---| | ||
| | `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) | | ||
| | `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training | | ||
|
|
||
| ## Synthetic Generation Pipeline | ||
|
|
||
| The `generate` mode produces conversation skeletons that are fed to a target model | ||
| via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training | ||
| data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student: | ||
|
|
||
| ```text | ||
| make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl | ||
| ↓ | ||
| query.py (target model generates responses turn-by-turn) | ||
| ↓ | ||
| training data for draft model / student | ||
| ```text | ||
|
|
||
| ## Augmentations | ||
|
|
||
| `augmentations.yaml` defines language-redirect and style-hint variants that are | ||
| applied cyclically across the dataset. Each enabled entry produces one augmented | ||
| copy of the source rows. | ||
|
|
||
| To customize augmentations: | ||
| - **Disable** a variant: add `enabled: false` | ||
| - **Add** a language redirect: append a `user_suffix` entry | ||
| - **Add** a system prompt: append a `system_prompt` entry | ||
|
|
||
| ```yaml | ||
| augmentations: | ||
| - type: user_suffix | ||
| text: " Please reply in French instead of English." | ||
| - type: system_prompt | ||
| content: "You are a helpful assistant." | ||
| enabled: false # disable without deleting | ||
| ```text | ||
|
|
||
| ## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`) | ||
|
|
||
| Edit this file to add, remove, or re-weight datasets without touching the script: | ||
|
|
||
| ```yaml | ||
| datasets: | ||
| - repo_id: nvidia/Nemotron-Math-v2 | ||
| splits: [high_part00, high_part01] | ||
| cap_per_split: 200000 | ||
| augment: true | ||
|
|
||
| - repo_id: nvidia/OpenMathReasoning-mini | ||
| splits: [train] | ||
| augment: false # multilingual — skip language-redirect augmentation | ||
| ```text | ||
|
|
||
| ## Output Format | ||
|
|
||
| Every output row is a JSONL object with a single `messages` key: | ||
|
|
||
| ```json | ||
| {"messages": [ | ||
| {"role": "system", "content": "You are a helpful assistant."}, | ||
| {"role": "user", "content": "What is 2+2?"}, | ||
| {"role": "assistant", "content": "4"} | ||
| ]} | ||
| ```text | ||
|
|
||
| In `generate` mode, assistant turns are stripped so the row ends with a user turn. | ||
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py | ||
| # | ||
| # Each entry defines one augmentation variant applied cyclically across the dataset. | ||
| # The augmented copy is the same size as the source — each row gets exactly one variant. | ||
| # | ||
| # Supported types: | ||
| # | ||
| # user_suffix | ||
| # Appends `text` to the content of every user message in the conversation. | ||
| # Example use: language-redirect instructions, style/length hints. | ||
| # | ||
| # system_prompt | ||
| # Prepends a {"role": "system", "content": <content>} message to the conversation. | ||
| # Use this for model-specific flags (e.g. /no_think) or persona instructions. | ||
| # Set `enabled: false` for variants that are not supported by your target model. | ||
| # | ||
| # To disable an entry without deleting it, add `enabled: false`. | ||
| # To add a new variant, append a new entry following the same schema. | ||
|
|
||
| augmentations: | ||
|
|
||
| # --- Language redirects (user_suffix) ------------------------------------ | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in French instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Italian instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in German instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Spanish instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Mandarin Chinese instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Japanese instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Korean instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Turkish instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Modern Standard Arabic instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Russian instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Brazilian Portuguese instead of English." | ||
|
|
||
| - type: user_suffix | ||
| text: " Please reply in Vietnamese instead of English." | ||
|
|
||
| # --- Style / format hints (user_suffix) ---------------------------------- | ||
|
|
||
| - type: user_suffix | ||
| text: " Be concise and answer in as few words as possible." | ||
|
|
||
| - type: user_suffix | ||
| text: " Provide a detailed, step-by-step explanation." | ||
|
|
||
| - type: user_suffix | ||
| text: " Format your response using Markdown (headers, bullet points, code blocks where appropriate)." | ||
|
|
||
| - type: user_suffix | ||
| text: " Do not use Markdown formatting; reply in plain text only." | ||
|
|
||
| - type: user_suffix | ||
| text: " Explain your answer as if I am a complete beginner with no prior knowledge." | ||
|
|
||
| - type: user_suffix | ||
| text: " Assume I am an expert; skip basic explanations and go straight to the details." | ||
|
|
||
| - type: user_suffix | ||
| text: " Think step by step before giving your final answer." | ||
|
|
||
| # --- System-prompt variants (system_prompt) ------------------------------ | ||
|
|
||
| # /no_think: suppresses chain-of-thought in models that support it (e.g. Qwen3). | ||
| # Set enabled: false if your target model does not support this flag. | ||
| - type: system_prompt | ||
| content: "You are a helpful assistant. /no_think" | ||
| enabled: false | ||
|
|
||
| # Generic helpful-assistant system prompt (no special flags). | ||
| - type: system_prompt | ||
| content: "You are a helpful, respectful, and honest assistant." |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.