An open-source, hybrid implementation of Meta AI's Autodata. Generates high-quality synthetic training data using Agentic Self-Instruct and Weak-to-Strong gaps.
Mini-Autodata: Agentic Self-Instruct Framework for High-Quality Synthetic DataMini-Autodata is a localized, hybrid implementation inspired by Meta AI's Autodata framework. It acts as an autonomous "data scientist," iteratively generating, evaluating, and refining training and benchmark data using an Agentic Self-Instruct pipeline. Traditional synthetic data often fails to push the boundaries of model capabilities. Mini-Autodata solves this by employing a multi-agent architecture (Challenger, Weak Solver, Strong Solver, and Judge) . The system strictly curates data based on a "Quality Gap": a question is only accepted if a Strong Model (e.g., DeepSeek API) succeeds while a Weak Model (e.g., local llama3.2:1b) fails. This ensures the generated datasets require deep, context-grounded reasoning rather than just general knowledge.
🧠 Weak-vs-Strong Gap Mechanism: Guarantees dataset difficulty by filtering out questions that small models can answer correctly via general knowledge.
⚡ Hybrid Backend Architecture: Prevents OOM crashes and model-swapping latency by routing Strong agent roles to the DS2API, while dedicating local RAM entirely to the Weak Solver (llama3.2:1b).
🛡️ Anti-Cheating Prompt Engineering: Features tightly controlled layer-separated prompts to strictly prevent "Context Leakage" (giving away the answer in the prompt) and "Semantic Cheating".
🔁 Automated Feedback Loop: The Evaluator provides structured, historical feedback to the Challenger agent to autonomously improve question quality across multiple rounds.
💾 Fault Tolerance (Checkpointing): Built-in state-saving prevents data loss during API timeouts or system interruptions.
Link post from metadata: https://facebookresearch.github.io/RAM/blogs/autodata/