Skip to content

[New Skill]: Novelty Extractor (Data Distillation) #24

@rosspeili

Description

@rosspeili

Skill Name

optimization/novelty_extractor

What should this skill do?

The Problem: "Information slop" is making model training exponentially expensive and slow. Throwing massive, unfiltered datasets at a model wastes tokens on redundant, low-value information.
The Solution: An AI curation skill that scans massive datasets, compares the semantic vectors against a baseline, and extracts only the 1% that contains "novel" or "high-learning value" data. This acts as a data-distillation middleware, ensuring models are trained on pure signal rather than noise.

Documentation Requirement:
When submitting a Pull Request for this skill, the contributor must provide:

  1. A reference card at docs/skills/novelty_extractor.md detailing the threshold heuristics.
  2. Updates to docs/skills/README.md listing this skill under the optimization category.
  3. Example usage in examples/ showing how to pipe a large text corpus through this skill.

Ideal Inputs & Outputs

Input:
{
"dataset_chunk": "[10,000 words of raw forum data]",
"novelty_threshold": 0.85
}

Output:
{
"distilled_content": "[150 words of unique, high-value assertions]",
"compression_ratio": "98.5%",
"redundant_chunks_dropped": 42
}

Targeted Models (if applicable)

Model Agnostic (All)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestskill requestRequest for a new capability to be added.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions