NVIDIA · ChenhanYu · Apr 8, 2026 · Apr 8, 2026 · kevalmorabia97 · Apr 7, 2026
@@ -0,0 +1,128 @@
+# Dataset Preparation Scripts
+
+Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
+collections and other HuggingFace sources.  These scripts produce datasets in
+**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
+and can be used for any downstream fine-tuning task — SFT, distillation,
+speculative decoding draft-model training, etc.
+
+## Files
+
+| File | Description |
+|---|---|
+| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
+| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
+| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
+| `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping |
+| `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset |
+| `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` |
+| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
+| `example_data_config.yaml` | Example YAML config for `make_dataset.py` |
+
+## Quick Start
+
+### Install dependencies
+
+```bash
+pip install datasets huggingface_hub pyyaml
-pip install datasets huggingface_hub pyyaml
+pip install nvidia-modelopt[hf]
-pip install datasets huggingface_hub pyyaml
+pip install nvidia-modelopt[hf]
+huggingface-cli login   # required for gated datasets
+```text
+
+### Build a Nemotron PT v3 dataset
+
+```bash
+# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
+python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
+
+# Full conversations for direct SFT training
+python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
+
+# Use a custom dataset mix
+python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
+```text
+
+### Build a Nemotron PT v2 dataset
+
+```bash
+python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
+python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
+```text
+
+### Build a general-purpose mixed dataset
+
+```bash
+python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
+```text
+
+## Dataset Modes
+
+Both `make_nemotron_pt*.py` scripts support two modes:
+
+| Mode | Description | Use case |
+|---|---|---|
+| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
+| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
+
+## Synthetic Generation Pipeline
+
+The `generate` mode produces conversation skeletons that are fed to a target model
+via `tools/launcher/common/query.py` (vLLM or TRT-LLM).  The output becomes training
+data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
+
+```text
+make_nemotron_ptv3_dataset.py --mode generate  →  skeleton.jsonl
+        ↓
+query.py  (target model generates responses turn-by-turn)
+        ↓
+training data for draft model / student
+```text
+
+## Augmentations
+
+`augmentations.yaml` defines language-redirect and style-hint variants that are
+applied cyclically across the dataset.  Each enabled entry produces one augmented
+copy of the source rows.
+
+To customize augmentations:
+- **Disable** a variant: add `enabled: false`
+- **Add** a language redirect: append a `user_suffix` entry
+- **Add** a system prompt: append a `system_prompt` entry
+
+```yaml
+augmentations:
+  - type: user_suffix
+    text: " Please reply in French instead of English."
+  - type: system_prompt
+    content: "You are a helpful assistant."
+    enabled: false   # disable without deleting
+```text
+
+## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
+
+Edit this file to add, remove, or re-weight datasets without touching the script:
+
+```yaml
+datasets:
+  - repo_id: nvidia/Nemotron-Math-v2
+    splits: [high_part00, high_part01]
+    cap_per_split: 200000
+    augment: true
+
+  - repo_id: nvidia/OpenMathReasoning-mini
+    splits: [train]
+    augment: false   # multilingual — skip language-redirect augmentation
+```text
+
+## Output Format
+
+Every output row is a JSONL object with a single `messages` key:
+
+```json
+{"messages": [
+  {"role": "system",    "content": "You are a helpful assistant."},
+  {"role": "user",      "content": "What is 2+2?"},
+  {"role": "assistant", "content": "4"}
+]}
+```text
+
+In `generate` mode, assistant turns are stripped so the row ends with a user turn.
@@ -0,0 +1,93 @@
+# Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py
+#
+# Each entry defines one augmentation variant applied cyclically across the dataset.
+# The augmented copy is the same size as the source — each row gets exactly one variant.
+#
+# Supported types:
+#
+#   user_suffix
+#     Appends `text` to the content of every user message in the conversation.
+#     Example use: language-redirect instructions, style/length hints.
+#
+#   system_prompt
+#     Prepends a {"role": "system", "content": <content>} message to the conversation.
+#     Use this for model-specific flags (e.g. /no_think) or persona instructions.
+#     Set `enabled: false` for variants that are not supported by your target model.
+#
+# To disable an entry without deleting it, add `enabled: false`.
+# To add a new variant, append a new entry following the same schema.
+
+augmentations:
+
+  # --- Language redirects (user_suffix) ------------------------------------
+
+  - type: user_suffix
+    text: " Please reply in French instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Italian instead of English."
+
+  - type: user_suffix
+    text: " Please reply in German instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Spanish instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Mandarin Chinese instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Japanese instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Korean instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Turkish instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Modern Standard Arabic instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Russian instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Brazilian Portuguese instead of English."
+
+  - type: user_suffix
+    text: " Please reply in Vietnamese instead of English."
+
+  # --- Style / format hints (user_suffix) ----------------------------------
+
+  - type: user_suffix
+    text: " Be concise and answer in as few words as possible."
+
+  - type: user_suffix
+    text: " Provide a detailed, step-by-step explanation."
+
+  - type: user_suffix
+    text: " Format your response using Markdown (headers, bullet points, code blocks where appropriate)."
+
+  - type: user_suffix
+    text: " Do not use Markdown formatting; reply in plain text only."
+
+  - type: user_suffix
+    text: " Explain your answer as if I am a complete beginner with no prior knowledge."
+
+  - type: user_suffix
+    text: " Assume I am an expert; skip basic explanations and go straight to the details."
+
+  - type: user_suffix
+    text: " Think step by step before giving your final answer."
+
+  # --- System-prompt variants (system_prompt) ------------------------------
+
+  # /no_think: suppresses chain-of-thought in models that support it (e.g. Qwen3).
+  # Set enabled: false if your target model does not support this flag.
+  - type: system_prompt
+    content: "You are a helpful assistant. /no_think"
+    enabled: false
+
+  # Generic helpful-assistant system prompt (no special flags).
+  - type: system_prompt
+    content: "You are a helpful, respectful, and honest assistant."