-
Notifications
You must be signed in to change notification settings - Fork 9
[New Skill]: Synthetic Data Generator (High-Entropy) #22
Description
Skill Name
data_engineering/synthetic_generator
What should this skill do?
The Problem: We are rapidly running out of human-written internet text to train frontier models. Data scarcity is the immediate bottleneck, and simple LLM-generated text often suffers from "model collapse" due to low entropy.
The Solution: A specialized agent skill that generates high-entropy, highly structured synthetic data intentionally designed to fine-tune other models. This essentially allows an agent to act as an automated synthetic data pipeline for ML engineers.
Documentation Requirement:
When submitting a Pull Request for this skill, the contributor must provide:
- A reference card at
docs/skills/synthetic_generator.mddetailing the entropy logic. - Updates to docs/skills/README.md introducing the
data_engineeringcategory. - Example usage in the
examples/directory showing an agent looping this skill to build a.jsonlfine-tuning dataset.
Ideal Inputs & Outputs
Input:
{
"domain": "medical_coding_disputes",
"num_samples": 5,
"entropy_temperature": 0.9,
"diversity_prompt": "Ensure edge-case scenarios involving dual-insurance coverage."
}
Output:
{
"samples": [
{"instruction": "...", "input": "...", "output": "..."},
{"instruction": "...", "input": "...", "output": "..."}
],
"entropy_score": 0.88,
"status": "success"
}
Targeted Models (if applicable)
Model Agnostic (All)