Feat/issue 24 novelty extractor by rizzoMartin · Pull Request #116 · ARPAHLS/skillware

rizzoMartin · 2026-05-24T11:02:13Z

Description

Implements the data_engineering/novelty_extractor skill requested in #24.

The skill filters large text datasets by semantic novelty using local embeddings
(fastembed, BAAI/bge-small-en-v1.5). It retains only chunks that carry
genuinely new information above a configurable cosine similarity threshold,
making it useful for training data distillation, corpus curation, and multi-turn
deduplication pipelines.

Logic (skill.py):

Splits input text into chunks using a configurable strategy (paragraph or sentence)
Embeds all chunks in a single batch call using fastembed (no API key, no cloud dependency)
Computes cosine similarity against a seen-vectors list seeded by optional baseline_chunks
Keeps chunks where max similarity is below novelty_threshold; discards the rest
Returns distilled_content, compression_ratio, and redundant_chunks_dropped

Cognition (instructions.md):
Explains when to invoke the skill, how to interpret outputs, and how to handle
multi-turn filtering by passing distilled_content as baseline_chunks across turns.

Governance (constitution):
The skill is stateless by design — no hidden global state, no external calls,
no logging or transmission of input content.

Type of Change (Matches Issue Templates)

Skill Proposal: New Skill (Contains manifest.yaml, skill.py, and instructions.md)
Bug Report Fix: Non-breaking change which fixes an execution error or framework bug
Doc Fix: Documentation Update
Framework Feature / RFC Updates: Core Framework Update (Changes to base_skill.py, loader.py, etc.)

Checklist (all PRs)

My code follows the Agent Code of Conduct.
I have run python -m flake8 . and pytest tests/ locally (or the subset relevant to this change).

New or updated skill (complete only if this PR adds or changes a skill under `skills/`)

Skip this section for framework-only, documentation-only, or other PRs that do not touch the skill registry.

Bundle & metadata

Skill lives at skills/<category>/<skill_name>/ (copied from templates/python_skill/ or equivalent).
manifest.yaml has name, version, description, valid parameters, and constitution.
manifest.yaml includes issuer with real name and email (not template placeholders).
Optional: issuer.github and issuer.org set when applicable.
requirements and env_vars are documented when the skill needs them.

Logic, cognition, and UI

skill.py is deterministic Python (no arbitrary LLM-generated code paths).
instructions.md explains when and how to use the skill.
card.json is present and its issuer matches manifest.yaml (name and email at minimum).

Tests & loader

test_skill.py covers execution and schema expectations.
SkillLoader.load_skill("<category>/<skill_name>") succeeds (or missing deps are documented).

Documentation & catalog

docs/skills/<skill_name>.md exists or is updated (ID, Issuer, usage).
docs/skills/README.md lists the skill with ID and Issuer.

Constitution & Safety (if adding or modifying a skill)

This skill performs read-only, local-only operations on the provided text.
It does not store, log, or transmit any content from dataset_chunk or
baseline_chunks. All processing is deterministic and in-memory. No external
APIs are called.

Related Issues

Fixes #24

rosspeili · 2026-05-24T13:32:34Z

Thanks @rizzoMartin, this is a strong implementation of #24. The skill bundle is complete, the logic is clear, tests cover the important paths (empty input, redundancy filtering, baseline multi-turn, sentence strategy), and the catalog page with provider snippets is thorough. I ran the skill tests plus test_skill_issuer / test_loader locally on Python 3.13.1, all passed.

What looks good

Full bundle: manifest.yaml, skill.py, instructions.md, card.json, test_skill.py, __init__.py
docs/skills/novelty_extractor.md with Usage Examples for all five providers
Catalog row in docs/skills/README.md
examples/novelty_extractor_demo.py, good local multi-turn distillation demo (matches the issue’s “pipe a corpus through this skill” ask)
Deterministic, local-only, no API keys, fits the constitution well

Please address before we run CI / merge

Rebase on latest main, your branch is behind #107 (examples/README.md and contributor workflow updates). Rebase and resolve any conflicts.
examples/README.md, add a row for novelty_extractor_demo.py (required since Add examples index for runnable skill loops #107; see CONTRIBUTING / AI-native workflow Stage 5).
docs/usage/agent_loops.md, add this skill to the reference matrix (even as “Local execute” for the demo script).
skills/data_engineering/novelty_extractor/__init__.py — currently empty; export the class like synthetic_generator does (from .skill import NoveltyExtractor).
Optional but valuable, the catalog page already has Gemini/Ollama snippets; if you have bandwidth, add runnable agent-loop scripts under examples/ (e.g. gemini_novelty_extractor.py patterned on gemini_wallet_check.py, and an Ollama prompt-mode script like ollama_tos_evaluator.py). The local demo is enough for merge; agent-loop examples would make discovery easier via the new examples index.
Small doc fix, DeepSeek snippet in docs/skills/novelty_extractor.md uses os.environ but omits import os.
Unicode pass, GitHub flagged hidden characters on the PR; quick ASCII check on new files before re-push.

Skill notes (non-blocking)

Document first-run model download (~50 MB) in PR description or limitations, CI/first test run will be slower.
compression_ratio is % dropped (matches your tests); worth one explicit line in instructions.md so agents don’t misread it as “% kept”.
Consider noting in limitations that similarity uses raw dot product (works if fastembed vectors are normalized, your tests behave correctly).

Once rebased and the index/agent_loops/__init__.py items are in, we’ll run CI and this should be ready to merge. Really nice and smooth work, and promising addition to the registry. <3

rosspeili · 2026-05-25T12:48:54Z

Thanks @rizzoMartin, this is in great shape and you addressed the review feedback well. I ran the skill + issuer/loader tests on Python 3.13 (23/23 pass). 🎉

One thing left before merge: agent_loops.md still shows (catalog page) for Gemini/Ollama on novelty_extractor, but you added gemini_novelty_extractor.py and ollama_novelty_extractor.py to the index, please align that row with examples/README.md.

Optional nit: index Required extra for the demo could mention pip install fastembed numpy (manifest requirement).

Once agent_loops is synced, good to merge from my side. 🚀

rizzoMartin · 2026-05-25T13:35:35Z

Hi @rosspeili , everything should be addressed now and ready for merge.

Thank you for the thorough reviews and detailed feedback throughout the process, it made the implementation process much clear. And thank you for trusting me with a new skill contribution. Being the first external contributor to add a skill to the registry means a lot. Looking forward to contributing more to the project.

rosspeili · 2026-05-25T16:57:28Z

Thanks @rizzoMartin, latest commit looks flawless! agent_loops.md and the index are in sync now, and I re-ran the skill + issuer/loader tests on Python 3.13 (23/23 pass). This is ready to merge, nice work, and welcome to the registry. 🎉 Can't wait to see other data eng skills from you.

Feel free to open a relevant issue, or I will ping you for relevant category skills if you're interested. <3 This skill is yours to maintain, update or create new issues against, you can use the skill name as an issue template eg. [novelty_extractor].

rizzoMartin added 15 commits May 25, 2026 12:47

New files created, and manifest.yaml

cc771ed

feat Skill.py developed

38f391e

card.json and instructions.md added

b39b0f6

tests added

fcd0219

README.md modified to include the new noveltry-extractor skill

67b817f

Add novelty_extractor catalog page

0843f11

Add usage example of the skill

f6a03d9

Formatting and linting

665221a

script added to examples/README.md

8184850

script added to agent_loops.md

1ad48d9

Add novelty_extractor import in __init__.py

b549e1c

Add import os to DeepSeek in docs/skills/novelty_extractor.md

f0ab1d0

change · for -

d7d2120

added ollama and gemini examples

2a71d96

formatting and linting

c5fbdf8

rizzoMartin force-pushed the feat/issue-24-novelty-extractor branch from 009e9d6 to c5fbdf8 Compare May 25, 2026 11:08

docs/usage/agent_loops.md and examples/README.md updated

2b2b3dd

rosspeili merged commit 5030981 into ARPAHLS:main May 25, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/issue 24 novelty extractor#116

Feat/issue 24 novelty extractor#116
rosspeili merged 16 commits into
ARPAHLS:mainfrom
rizzoMartin:feat/issue-24-novelty-extractor

rizzoMartin commented May 24, 2026

Uh oh!

rosspeili commented May 24, 2026

Uh oh!

rosspeili commented May 25, 2026

Uh oh!

rizzoMartin commented May 25, 2026

Uh oh!

rosspeili commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rizzoMartin commented May 24, 2026

Description

Type of Change (Matches Issue Templates)

Checklist (all PRs)

New or updated skill (complete only if this PR adds or changes a skill under skills/)

Bundle & metadata

Logic, cognition, and UI

Tests & loader

Documentation & catalog

Constitution & Safety (if adding or modifying a skill)

Related Issues

Uh oh!

rosspeili commented May 24, 2026

Uh oh!

rosspeili commented May 25, 2026

Uh oh!

rizzoMartin commented May 25, 2026

Uh oh!

rosspeili commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New or updated skill (complete only if this PR adds or changes a skill under `skills/`)