Feat/issue 24 novelty extractor#116
Conversation
|
Thanks @rizzoMartin, this is a strong implementation of #24. The skill bundle is complete, the logic is clear, tests cover the important paths (empty input, redundancy filtering, baseline multi-turn, sentence strategy), and the catalog page with provider snippets is thorough. I ran the skill tests plus What looks good
Please address before we run CI / merge
Skill notes (non-blocking)
Once rebased and the index/agent_loops/ |
009e9d6 to
c5fbdf8
Compare
|
Thanks @rizzoMartin, this is in great shape and you addressed the review feedback well. I ran the skill + issuer/loader tests on Python 3.13 (23/23 pass). 🎉 One thing left before merge: Optional nit: index Once |
|
Hi @rosspeili , everything should be addressed now and ready for merge. Thank you for the thorough reviews and detailed feedback throughout the process, it made the implementation process much clear. And thank you for trusting me with a new skill contribution. Being the first external contributor to add a skill to the registry means a lot. Looking forward to contributing more to the project. |
|
Thanks @rizzoMartin, latest commit looks flawless! Feel free to open a relevant issue, or I will ping you for relevant category skills if you're interested. <3 This skill is yours to maintain, update or create new issues against, you can use the skill name as an issue template eg. [novelty_extractor]. |
Description
Implements the
data_engineering/novelty_extractorskill requested in #24.The skill filters large text datasets by semantic novelty using local embeddings
(
fastembed,BAAI/bge-small-en-v1.5). It retains only chunks that carrygenuinely new information above a configurable cosine similarity threshold,
making it useful for training data distillation, corpus curation, and multi-turn
deduplication pipelines.
Logic (
skill.py):paragraphorsentence)fastembed(no API key, no cloud dependency)baseline_chunksnovelty_threshold; discards the restdistilled_content,compression_ratio, andredundant_chunks_droppedCognition (
instructions.md):Explains when to invoke the skill, how to interpret outputs, and how to handle
multi-turn filtering by passing
distilled_contentasbaseline_chunksacross turns.Governance (
constitution):The skill is stateless by design — no hidden global state, no external calls,
no logging or transmission of input content.
Type of Change (Matches Issue Templates)
manifest.yaml,skill.py, andinstructions.md)base_skill.py,loader.py, etc.)Checklist (all PRs)
python -m flake8 .andpytest tests/locally (or the subset relevant to this change).New or updated skill (complete only if this PR adds or changes a skill under
skills/)Skip this section for framework-only, documentation-only, or other PRs that do not touch the skill registry.
Bundle & metadata
skills/<category>/<skill_name>/(copied fromtemplates/python_skill/or equivalent).manifest.yamlhasname,version,description, validparameters, andconstitution.manifest.yamlincludesissuerwith realnameandemail(not template placeholders).issuer.githubandissuer.orgset when applicable.requirementsandenv_varsare documented when the skill needs them.Logic, cognition, and UI
skill.pyis deterministic Python (no arbitrary LLM-generated code paths).instructions.mdexplains when and how to use the skill.card.jsonis present and itsissuermatchesmanifest.yaml(nameandemailat minimum).Tests & loader
test_skill.pycovers execution and schema expectations.SkillLoader.load_skill("<category>/<skill_name>")succeeds (or missing deps are documented).Documentation & catalog
docs/skills/<skill_name>.mdexists or is updated (ID, Issuer, usage).docs/skills/README.mdlists the skill with ID and Issuer.Constitution & Safety (if adding or modifying a skill)
This skill performs read-only, local-only operations on the provided text.
It does not store, log, or transmit any content from
dataset_chunkorbaseline_chunks. All processing is deterministic and in-memory. No externalAPIs are called.
Related Issues
Fixes #24