Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions examples/dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Dataset Preparation Scripts

Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
collections and other HuggingFace sources. These scripts produce datasets in
**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
and can be used for any downstream fine-tuning task — SFT, distillation,
speculative decoding draft-model training, etc.

## Files

| File | Description |
|---|---|
| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
| `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping |
| `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset |
| `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` |
| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
| `example_data_config.yaml` | Example YAML config for `make_dataset.py` |

## Quick Start

### Install dependencies

```bash
pip install datasets huggingface_hub pyyaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install datasets huggingface_hub pyyaml
pip install nvidia-modelopt[hf]

huggingface-cli login # required for gated datasets
```text

### Build a Nemotron PT v3 dataset

```bash
# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen

# Full conversations for direct SFT training
python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train

# Use a custom dataset mix
python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
```text

### Build a Nemotron PT v2 dataset

```bash
python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
```text

### Build a general-purpose mixed dataset

```bash
python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
```text

## Dataset Modes

Both `make_nemotron_pt*.py` scripts support two modes:

| Mode | Description | Use case |
|---|---|---|
| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |

## Synthetic Generation Pipeline

The `generate` mode produces conversation skeletons that are fed to a target model
via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training
data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:

```text
make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl
query.py (target model generates responses turn-by-turn)
training data for draft model / student
```text

## Augmentations

`augmentations.yaml` defines language-redirect and style-hint variants that are
applied cyclically across the dataset. Each enabled entry produces one augmented
copy of the source rows.

To customize augmentations:
- **Disable** a variant: add `enabled: false`
- **Add** a language redirect: append a `user_suffix` entry
- **Add** a system prompt: append a `system_prompt` entry

```yaml
augmentations:
- type: user_suffix
text: " Please reply in French instead of English."
- type: system_prompt
content: "You are a helpful assistant."
enabled: false # disable without deleting
```text

## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)

Edit this file to add, remove, or re-weight datasets without touching the script:

```yaml
datasets:
- repo_id: nvidia/Nemotron-Math-v2
splits: [high_part00, high_part01]
cap_per_split: 200000
augment: true

- repo_id: nvidia/OpenMathReasoning-mini
splits: [train]
augment: false # multilingual — skip language-redirect augmentation
```text

## Output Format

Every output row is a JSONL object with a single `messages` key:

```json
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]}
```text

In `generate` mode, assistant turns are stripped so the row ends with a user turn.
93 changes: 93 additions & 0 deletions examples/dataset/augmentations.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py
#
# Each entry defines one augmentation variant applied cyclically across the dataset.
# The augmented copy is the same size as the source — each row gets exactly one variant.
#
# Supported types:
#
# user_suffix
# Appends `text` to the content of every user message in the conversation.
# Example use: language-redirect instructions, style/length hints.
#
# system_prompt
# Prepends a {"role": "system", "content": <content>} message to the conversation.
# Use this for model-specific flags (e.g. /no_think) or persona instructions.
# Set `enabled: false` for variants that are not supported by your target model.
#
# To disable an entry without deleting it, add `enabled: false`.
# To add a new variant, append a new entry following the same schema.

augmentations:

# --- Language redirects (user_suffix) ------------------------------------

- type: user_suffix
text: " Please reply in French instead of English."

- type: user_suffix
text: " Please reply in Italian instead of English."

- type: user_suffix
text: " Please reply in German instead of English."

- type: user_suffix
text: " Please reply in Spanish instead of English."

- type: user_suffix
text: " Please reply in Mandarin Chinese instead of English."

- type: user_suffix
text: " Please reply in Japanese instead of English."

- type: user_suffix
text: " Please reply in Korean instead of English."

- type: user_suffix
text: " Please reply in Turkish instead of English."

- type: user_suffix
text: " Please reply in Modern Standard Arabic instead of English."

- type: user_suffix
text: " Please reply in Russian instead of English."

- type: user_suffix
text: " Please reply in Brazilian Portuguese instead of English."

- type: user_suffix
text: " Please reply in Vietnamese instead of English."

# --- Style / format hints (user_suffix) ----------------------------------

- type: user_suffix
text: " Be concise and answer in as few words as possible."

- type: user_suffix
text: " Provide a detailed, step-by-step explanation."

- type: user_suffix
text: " Format your response using Markdown (headers, bullet points, code blocks where appropriate)."

- type: user_suffix
text: " Do not use Markdown formatting; reply in plain text only."

- type: user_suffix
text: " Explain your answer as if I am a complete beginner with no prior knowledge."

- type: user_suffix
text: " Assume I am an expert; skip basic explanations and go straight to the details."

- type: user_suffix
text: " Think step by step before giving your final answer."

# --- System-prompt variants (system_prompt) ------------------------------

# /no_think: suppresses chain-of-thought in models that support it (e.g. Qwen3).
# Set enabled: false if your target model does not support this flag.
- type: system_prompt
content: "You are a helpful assistant. /no_think"
enabled: false

# Generic helpful-assistant system prompt (no special flags).
- type: system_prompt
content: "You are a helpful, respectful, and honest assistant."
Loading
Loading