Skip to content

[New Skill]: Synthetic Data Generator (High-Entropy) #22

@rosspeili

Description

@rosspeili

Skill Name

data_engineering/synthetic_generator

What should this skill do?

The Problem: We are rapidly running out of human-written internet text to train frontier models. Data scarcity is the immediate bottleneck, and simple LLM-generated text often suffers from "model collapse" due to low entropy.
The Solution: A specialized agent skill that generates high-entropy, highly structured synthetic data intentionally designed to fine-tune other models. This essentially allows an agent to act as an automated synthetic data pipeline for ML engineers.

Documentation Requirement:
When submitting a Pull Request for this skill, the contributor must provide:

  1. A reference card at docs/skills/synthetic_generator.md detailing the entropy logic.
  2. Updates to docs/skills/README.md introducing the data_engineering category.
  3. Example usage in the examples/ directory showing an agent looping this skill to build a .jsonl fine-tuning dataset.

Ideal Inputs & Outputs

Input:
{
"domain": "medical_coding_disputes",
"num_samples": 5,
"entropy_temperature": 0.9,
"diversity_prompt": "Ensure edge-case scenarios involving dual-insurance coverage."
}

Output:
{
"samples": [
{"instruction": "...", "input": "...", "output": "..."},
{"instruction": "...", "input": "...", "output": "..."}
],
"entropy_score": 0.88,
"status": "success"
}

Targeted Models (if applicable)

Model Agnostic (All)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestskill requestRequest for a new capability to be added.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions