Skip to content

Feat/issue 24 novelty extractor#116

Merged
rosspeili merged 16 commits into
ARPAHLS:mainfrom
rizzoMartin:feat/issue-24-novelty-extractor
May 25, 2026
Merged

Feat/issue 24 novelty extractor#116
rosspeili merged 16 commits into
ARPAHLS:mainfrom
rizzoMartin:feat/issue-24-novelty-extractor

Conversation

@rizzoMartin
Copy link
Copy Markdown
Contributor

Description

Implements the data_engineering/novelty_extractor skill requested in #24.

The skill filters large text datasets by semantic novelty using local embeddings
(fastembed, BAAI/bge-small-en-v1.5). It retains only chunks that carry
genuinely new information above a configurable cosine similarity threshold,
making it useful for training data distillation, corpus curation, and multi-turn
deduplication pipelines.

Logic (skill.py):

  • Splits input text into chunks using a configurable strategy (paragraph or sentence)
  • Embeds all chunks in a single batch call using fastembed (no API key, no cloud dependency)
  • Computes cosine similarity against a seen-vectors list seeded by optional baseline_chunks
  • Keeps chunks where max similarity is below novelty_threshold; discards the rest
  • Returns distilled_content, compression_ratio, and redundant_chunks_dropped

Cognition (instructions.md):
Explains when to invoke the skill, how to interpret outputs, and how to handle
multi-turn filtering by passing distilled_content as baseline_chunks across turns.

Governance (constitution):
The skill is stateless by design — no hidden global state, no external calls,
no logging or transmission of input content.

Type of Change (Matches Issue Templates)

  • Skill Proposal: New Skill (Contains manifest.yaml, skill.py, and instructions.md)
  • Bug Report Fix: Non-breaking change which fixes an execution error or framework bug
  • Doc Fix: Documentation Update
  • Framework Feature / RFC Updates: Core Framework Update (Changes to base_skill.py, loader.py, etc.)

Checklist (all PRs)

  • My code follows the Agent Code of Conduct.
  • I have run python -m flake8 . and pytest tests/ locally (or the subset relevant to this change).

New or updated skill (complete only if this PR adds or changes a skill under skills/)

Skip this section for framework-only, documentation-only, or other PRs that do not touch the skill registry.

Bundle & metadata

  • Skill lives at skills/<category>/<skill_name>/ (copied from templates/python_skill/ or equivalent).
  • manifest.yaml has name, version, description, valid parameters, and constitution.
  • manifest.yaml includes issuer with real name and email (not template placeholders).
  • Optional: issuer.github and issuer.org set when applicable.
  • requirements and env_vars are documented when the skill needs them.

Logic, cognition, and UI

  • skill.py is deterministic Python (no arbitrary LLM-generated code paths).
  • instructions.md explains when and how to use the skill.
  • card.json is present and its issuer matches manifest.yaml (name and email at minimum).

Tests & loader

  • test_skill.py covers execution and schema expectations.
  • SkillLoader.load_skill("<category>/<skill_name>") succeeds (or missing deps are documented).

Documentation & catalog

  • docs/skills/<skill_name>.md exists or is updated (ID, Issuer, usage).
  • docs/skills/README.md lists the skill with ID and Issuer.

Constitution & Safety (if adding or modifying a skill)

This skill performs read-only, local-only operations on the provided text.
It does not store, log, or transmit any content from dataset_chunk or
baseline_chunks. All processing is deterministic and in-memory. No external
APIs are called.

Related Issues

Fixes #24

@rosspeili
Copy link
Copy Markdown
Contributor

Thanks @rizzoMartin, this is a strong implementation of #24. The skill bundle is complete, the logic is clear, tests cover the important paths (empty input, redundancy filtering, baseline multi-turn, sentence strategy), and the catalog page with provider snippets is thorough. I ran the skill tests plus test_skill_issuer / test_loader locally on Python 3.13.1, all passed.

What looks good

  • Full bundle: manifest.yaml, skill.py, instructions.md, card.json, test_skill.py, __init__.py
  • docs/skills/novelty_extractor.md with Usage Examples for all five providers
  • Catalog row in docs/skills/README.md
  • examples/novelty_extractor_demo.py, good local multi-turn distillation demo (matches the issue’s “pipe a corpus through this skill” ask)
  • Deterministic, local-only, no API keys, fits the constitution well

Please address before we run CI / merge

  1. Rebase on latest main, your branch is behind #107 (examples/README.md and contributor workflow updates). Rebase and resolve any conflicts.

  2. examples/README.md, add a row for novelty_extractor_demo.py (required since Add examples index for runnable skill loops #107; see CONTRIBUTING / AI-native workflow Stage 5).

  3. docs/usage/agent_loops.md, add this skill to the reference matrix (even as “Local execute” for the demo script).

  4. skills/data_engineering/novelty_extractor/__init__.py — currently empty; export the class like synthetic_generator does (from .skill import NoveltyExtractor).

  5. Optional but valuable, the catalog page already has Gemini/Ollama snippets; if you have bandwidth, add runnable agent-loop scripts under examples/ (e.g. gemini_novelty_extractor.py patterned on gemini_wallet_check.py, and an Ollama prompt-mode script like ollama_tos_evaluator.py). The local demo is enough for merge; agent-loop examples would make discovery easier via the new examples index.

  6. Small doc fix, DeepSeek snippet in docs/skills/novelty_extractor.md uses os.environ but omits import os.

  7. Unicode pass, GitHub flagged hidden characters on the PR; quick ASCII check on new files before re-push.

Skill notes (non-blocking)

  • Document first-run model download (~50 MB) in PR description or limitations, CI/first test run will be slower.
  • compression_ratio is % dropped (matches your tests); worth one explicit line in instructions.md so agents don’t misread it as “% kept”.
  • Consider noting in limitations that similarity uses raw dot product (works if fastembed vectors are normalized, your tests behave correctly).

Once rebased and the index/agent_loops/__init__.py items are in, we’ll run CI and this should be ready to merge. Really nice and smooth work, and promising addition to the registry. <3

@rizzoMartin rizzoMartin force-pushed the feat/issue-24-novelty-extractor branch from 009e9d6 to c5fbdf8 Compare May 25, 2026 11:08
@rosspeili
Copy link
Copy Markdown
Contributor

Thanks @rizzoMartin, this is in great shape and you addressed the review feedback well. I ran the skill + issuer/loader tests on Python 3.13 (23/23 pass). 🎉

One thing left before merge: agent_loops.md still shows (catalog page) for Gemini/Ollama on novelty_extractor, but you added gemini_novelty_extractor.py and ollama_novelty_extractor.py to the index, please align that row with examples/README.md.

Optional nit: index Required extra for the demo could mention pip install fastembed numpy (manifest requirement).

Once agent_loops is synced, good to merge from my side. 🚀

@rizzoMartin
Copy link
Copy Markdown
Contributor Author

Hi @rosspeili , everything should be addressed now and ready for merge.

Thank you for the thorough reviews and detailed feedback throughout the process, it made the implementation process much clear. And thank you for trusting me with a new skill contribution. Being the first external contributor to add a skill to the registry means a lot. Looking forward to contributing more to the project.

@rosspeili
Copy link
Copy Markdown
Contributor

Thanks @rizzoMartin, latest commit looks flawless! agent_loops.md and the index are in sync now, and I re-ran the skill + issuer/loader tests on Python 3.13 (23/23 pass). This is ready to merge, nice work, and welcome to the registry. 🎉 Can't wait to see other data eng skills from you.

Feel free to open a relevant issue, or I will ping you for relevant category skills if you're interested. <3 This skill is yours to maintain, update or create new issues against, you can use the skill name as an issue template eg. [novelty_extractor].

@rosspeili rosspeili merged commit 5030981 into ARPAHLS:main May 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Skill]: Novelty Extractor (Data Distillation)

2 participants