Generate high-quality synthetic datasets from scratch or using your own seed data.
Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.
- Generate diverse data using statistical samplers, LLMs, or existing seed datasets
- Control relationships between fields with dependency-aware generation
- Validate quality with built-in Python, SQL, and custom local and remote validators
- Score outputs using LLM-as-a-judge for quality assessment
- Iterate quickly with preview mode before full-scale generation
pip install data-designerOr install from source:
git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make installGet your API key from build.nvidia.com or OpenAI:
export NVIDIA_API_KEY="your-api-key-here"
# Or use OpenAI
export OPENAI_API_KEY="your-openai-api-key-here"from data_designer.essentials import (
CategorySamplerParams,
DataDesigner,
DataDesignerConfigBuilder,
LLMTextColumnConfig,
PersonSamplerParams,
SamplerColumnConfig,
SamplerType,
)
# Initialize with default settings
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()
# Add a product category
config_builder.add_column(
SamplerColumnConfig(
name="product_category",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(
values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
),
)
)
# Generate personalized customer reviews
config_builder.add_column(
LLMTextColumnConfig(
name="review",
model_alias="nvidia-text",
prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""",
)
)
# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()- Quick Start Guide β Detailed walkthrough with more examples
- Tutorial Notebooks β Step-by-step interactive tutorials
- Column Types β Explore samplers, LLM columns, validators, and more
- Validators β Learn how to validate generated data with Python, SQL, and remote validators
- Model Configuration β Configure custom models and providers
- Person Sampling β Learn how to sample realistic person data with demographic attributes
data-designer config providers # Configure model providers
data-designer config models # Set up your model configurations
data-designer config list # View current settings- Contributing Guide β Help improve Data Designer
- GitHub Issues β Report bugs or make a feature request
Apache License 2.0 β see LICENSE for details.
If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:
@misc{nemo-data-designer,
author = {The NeMo Data Designer Team},
title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
year = {2025},
note = {GitHub Repository},
}