feat: Croissant ML dataset discovery via federation-native PAP agents#291
Merged
toadkicker merged 3 commits intomainfrom Apr 15, 2026
Merged
feat: Croissant ML dataset discovery via federation-native PAP agents#291toadkicker merged 3 commits intomainfrom
toadkicker merged 3 commits intomainfrom
Conversation
Adds parallel multi-source ML dataset discovery routed through the full
6-phase PAP handshake. A new `schema:DatasetAction` intent routes dataset
queries to all matching zero-disclosure agents concurrently via JoinSet,
rather than picking a single top-scored agent.
Architecture:
- Two TOML catalog agents (HuggingFace Hub, OpenML) require zero disclosure
and self-configure via the existing FederatedRegistry — adding a third
source requires only a new .toml file, no Rust changes
- `canvas_discover_datasets` fans out to all `schema:DatasetAction` agents
in parallel, merges results sorted by preference-blended relevance score,
and emits progressive `block_updated` events during the JoinSet fan-out
- Long-horizon memex integration: FTS5 prior-art check (7-day TTL) surfaces
cached results immediately while fresh handshakes run; per-agent episodes
feed PreferenceEngine signals so the system learns provider preference over time
- `DatasetState` Leptos context (keyed by block_id) carries discovery phase
FSM and memex hints to the `DatasetSearchTemplate` block renderer
- Mandate TTL set to 7 days — dataset metadata is durable; the existing
Active→Degraded→ReadOnly decay UI prompts re-query naturally
New files: dataset_types.rs, dataset_discovery.rs, dataset_template.rs,
dataset.rs (state), huggingface_datasets.toml, openml_datasets.toml
All 385 tests passing; papillon-shared, papillon backend, and papillon-frontend
WASM targets compile clean.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Closes gaps identified in intent rule coverage: - Free Dictionary: 3 new tests (define/meaning of/definition of) — rule had zero coverage previously; any regression would have gone undetected - Clean-query stripping: 3 tests verify the third return value of detect_intent() (previously almost never checked); confirms "dataset sentiment analysis" strips to "sentiment analysis", "define photosynthesis" to "photosynthesis", etc. - Rule ordering conflicts: 3 tests document first-match-wins behavior for ambiguous inputs (weather beats dataset, dataset beats books, dataset beats arXiv) — establishes intended semantics and guards against rule reordering - Missing keywords: 2 tests for "temperature" (weather rule) and "population of" (REST Countries rule) which were declared in RULES but never exercised All 47 intent tests pass; 0 regressions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
After rebasing on main (BlockUpdate typed events + BlockContext provider pattern), the 6 emit sites in dataset_discovery.rs used the old CanvasBlock struct in BlockEvent — now expects BlockUpdate. Changes: - Replace CanvasBlock with BlockUpdate at all 6 emit sites, dropping linked_block_ids (frontend-only field, absent from BlockUpdate by design) - Introduce HandshakeConfig struct to reduce run_dataset_handshake from 8 to 5 args (fixes clippy::too_many_arguments) - Add HandshakeTaskResult type alias for JoinSet element type (fixes clippy::type_complexity) - Use *= for relevance_score blending (fixes clippy::assign_op_pattern) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark Regression ReportThreshold: 10% regression vs baseline from main |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
schema:DatasetAction) for HuggingFace Dataset Search and OpenML Dataset Search — pure TOML, zero new Rust trait code, auto-discovered byFederatedRegistry.query_local_satisfiable()JoinSethandshake fan-out: All registered zero-disclosure dataset agents run the full 6-phase PAP handshake concurrently; results merged by preference-biased relevance scoreRestoredFromMemeximmediately while fresh handshakes run in parallel; per-agent and aggregate episodes recorded;PreferenceEnginesignals fed back for future rankingDatasetSearchTemplateblock renderer: Readsschema:ItemListJSON-LD, shows memex provenance banner, source-agent badges, open-license tint, and receipt provenance footerschema:DatasetActionrule, free-dictionary coverage, strip-behaviour and ordering-conflict regression testsBlockUpdatemigration: Adapted all 6 emit sites to theBlockUpdatewire type from refactor: BlockUpdate typed events + BlockContext provider pattern #289 (reviewed the refactor before applying — it's correct: compiler-enforced separation of backend vs. frontend-only state)Architecture notes
Adding a new dataset source = add one TOML file. No Rust changes required. Federation self-configuration:
publish_agent(did, "pap://shared-registry")makes any Papillon instance auto-discover it viaFederatedRegistry.merge_remote().Test plan
cargo test --workspacepasses (271 shared-crate tests, all intent tests including 15 new ones)cargo clippy --workspace --all-targets -- -D warningscleancargo fmt --all -- --checkcleancargo check --target wasm32-unknown-unknownclean inapps/papillon/frontend🤖 Generated with Claude Code