feat(fetch): add FreshCandidateCollector primitive for selection-time filtering#1808
Merged
feat(fetch): add FreshCandidateCollector primitive for selection-time filtering#1808
Conversation
… filtering Source-agnostic helper for fetch handlers that paginate or scan an external source and need to skip already-processed and currently-claimed candidates while continuing to look for fresh work. Handlers feed candidate identifiers to the collector imperatively. The collector reuses ExecutionContext::isItemProcessed/isItemClaimed (so the datamachine_should_reprocess_item filter applies consistently with final fetch dedupe) and exposes diagnostics: raw_seen, accepted, processed_skipped, claimed_skipped, duplicate_skipped, reprocess_accepted, max_items, and source_exhausted. Selection-time filtering only — FetchHandler::get_fetch_data() still owns final dedupe/claim/cap. Replaces the per-extension overfetch / pagination / processed-list workarounds in Intelligence MCP, data-machine-socials Reddit, and data-machine-events. Refs #1807.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a source-agnostic Data Machine core primitive for fresh candidate collection before fetch dedupe, per #1807.
Fetch handlers that page through external sources need to skip already-processed and currently-claimed candidates while continuing to scan until they find schedulable work. Today that behavior lives as ad-hoc shims in Intelligence MCP, data-machine-socials Reddit, and data-machine-events. This PR adds the generic primitive those consumers can converge on with no backwards compatibility shims.
Located at
DataMachine\Core\Steps\Fetch\FreshCandidateCollector.Design
Handlers drive the collector imperatively as they paginate:
The collector reuses
ExecutionContext::isItemProcessed()andisItemClaimed(), sodatamachine_should_reprocess_itemsemantics stay consistent with the existing core dedupe path.Selection-time, not authoritative
Final centralized dedupe/claim/cap in
FetchHandler::get_fetch_data()is unchanged. The collector is purely a selection-time aid that lets handlers stop scanning once they have enough fresh work. Items it accepts still flow through the standardFetchHandlerpipeline where the authoritative claim happens.Diagnostics
getDiagnostics()returnsraw_seen,accepted,processed_skipped,claimed_skipped,duplicate_skipped,reprocess_accepted,max_items, andsource_exhausted. Cheap integers + bool — handlers can log them verbatim or surface them through engine data.Tests
15 new unit tests under
tests/Unit/Core/Steps/Fetch/FreshCandidateCollectorTest.phpcovering:isFull()short-circuit.max_items = 0).Stubs
ExecutionContextvia PHPUnit mocks plus a small subclass so the primitive runs without a database — matches the pattern already used inFetchHandlerDataPacketTest.Test results
homeboy test data-machine --skip-lint:ImageGenerationPromptRefinementTest::test_refine_prompt_includes_post_context_when_provided).The pre-existing failure is unrelated to this change and reproduces on
origin/main.vendor/bin/phpcsruns clean on both new files. Repo-widehomeboy lintreports 578 pre-existing JS findings — none in the new PHP files.Follow-ups (separate PRs, not in this scope)
Closes #1807 once the follow-up repo migrations land.
AI assistance