Skip to content

DuckCa/Autodata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Autodata

An open-source, hybrid implementation of Meta AI's Autodata. Generates high-quality synthetic training data using Agentic Self-Instruct and Weak-to-Strong gaps.

Mini-Autodata: Agentic Self-Instruct Framework for High-Quality Synthetic DataMini-Autodata is a localized, hybrid implementation inspired by Meta AI's Autodata framework. It acts as an autonomous "data scientist," iteratively generating, evaluating, and refining training and benchmark data using an Agentic Self-Instruct pipeline. Traditional synthetic data often fails to push the boundaries of model capabilities. Mini-Autodata solves this by employing a multi-agent architecture (Challenger, Weak Solver, Strong Solver, and Judge) . The system strictly curates data based on a "Quality Gap": a question is only accepted if a Strong Model (e.g., DeepSeek API) succeeds while a Weak Model (e.g., local llama3.2:1b) fails. This ensures the generated datasets require deep, context-grounded reasoning rather than just general knowledge.

🧠 Weak-vs-Strong Gap Mechanism: Guarantees dataset difficulty by filtering out questions that small models can answer correctly via general knowledge.

⚡ Hybrid Backend Architecture: Prevents OOM crashes and model-swapping latency by routing Strong agent roles to the DS2API, while dedicating local RAM entirely to the Weak Solver (llama3.2:1b).

🛡️ Anti-Cheating Prompt Engineering: Features tightly controlled layer-separated prompts to strictly prevent "Context Leakage" (giving away the answer in the prompt) and "Semantic Cheating".

🔁 Automated Feedback Loop: The Evaluator provides structured, historical feedback to the Challenger agent to autonomously improve question quality across multiple rounds.

💾 Fault Tolerance (Checkpointing): Built-in state-saving prevents data loss during API timeouts or system interruptions.

Link post from metadata: https://facebookresearch.github.io/RAM/blogs/autodata/

About

An open-source, hybrid implementation of Meta AI's Autodata. Generates high-quality synthetic training data using Agentic Self-Instruct and Weak-to-Strong gaps.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors