v0.11.1 — default aggregator-page URL filter (mdBook /print.html, Hugo /_print/)#34
Merged
Conversation
…o /_print/) Reject single-render-of-whole-tree aggregator pages during crawl-time URL filtering. These pages contain the entire docs tree on one URL, so embedding- based retrieval ranks them above the dedicated chapter pages a user actually wants. Patterns rejected pre-fetch (saves crawl budget): */print.html, */_print, */_print/, */_print/*, */print/index.html Opt out via include_aggregator_pages=True engine kwarg or --include-aggregators CLI flag for offline-archive use cases. Motivation from llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five well-functioning competitors returned 0% /_print/ on kubernetes-docs. Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring-match safety (/blueprint.html, /preprint.html, /imprint/ all pass through), opt-out flag, composition with user-supplied exclude_paths and include_paths, sync + async engine parity. 647 tests total (was 611), no regressions.
AIMLPM
added a commit
that referenced
this pull request
May 16, 2026
Pre-existing lint failure on main since PR #34 (v0.11.1 ship) — ruff wants underscore-prefixed names sorted first in the import list. Applied `ruff check --fix`. No semantic change to the tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/print.html,/_print/) during crawl-time URL filtering. These pages contain the entire docs tree on one URL and dominate embedding-based retrieval rankings on cosine similarity, blocking the dedicated chapter pages a user actually wants.include_aggregator_pages=Trueengine kwarg /--include-aggregatorsCLI flag for offline-archive use cases.Motivation
From the public
llm-crawler-benchmarksv1.4 cycle audit:/print.html/_print/Markcrawl is the only top-tier tool that didn't filter these pages. The 9-12 retrieval-bucket misses concentrated on this in v1.4 are addressable with a single 5-line filter at the URL-frontier hook in
path_excluded.Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs.
What's in the patch
markcrawl/core.py_DEFAULT_AGGREGATOR_PATH_PATTERNStuple;include_aggregator_pages: bool = Falseparam threaded through both engines + wrapper functions;path_excludedchecks aggregator patterns before user-suppliedexclude_pathsmarkcrawl/cli.py--include-aggregatorsflag, threaded through bothcrawl()call sitestests/test_v011_1_aggregator_filter.pyCHANGELOG.mdpyproject.tomlSubstring-match safety
Patterns are anchored to avoid over-matching:
/book/print.html/blueprint.htmlprintis mid-word/preprint.html/imprint//_printer-friendly/css.css_printis prefix-only/_print/index.htmlAll six cases have explicit tests.
Test plan
What this does NOT do
exclude_paths/include_paths— composes with both