-
Notifications
You must be signed in to change notification settings - Fork 9
[New Skill]: Novelty Extractor (Data Distillation) #24
Description
Skill Name
optimization/novelty_extractor
What should this skill do?
The Problem: "Information slop" is making model training exponentially expensive and slow. Throwing massive, unfiltered datasets at a model wastes tokens on redundant, low-value information.
The Solution: An AI curation skill that scans massive datasets, compares the semantic vectors against a baseline, and extracts only the 1% that contains "novel" or "high-learning value" data. This acts as a data-distillation middleware, ensuring models are trained on pure signal rather than noise.
Documentation Requirement:
When submitting a Pull Request for this skill, the contributor must provide:
- A reference card at
docs/skills/novelty_extractor.mddetailing the threshold heuristics. - Updates to docs/skills/README.md listing this skill under the
optimizationcategory. - Example usage in
examples/showing how to pipe a large text corpus through this skill.
Ideal Inputs & Outputs
Input:
{
"dataset_chunk": "[10,000 words of raw forum data]",
"novelty_threshold": 0.85
}
Output:
{
"distilled_content": "[150 words of unique, high-value assertions]",
"compression_ratio": "98.5%",
"redundant_chunks_dropped": 42
}
Targeted Models (if applicable)
Model Agnostic (All)