Skip to content

v0.11.1 — default aggregator-page URL filter (mdBook /print.html, Hugo /_print/)#34

Merged
AIMLPM merged 1 commit into
mainfrom
v0.11.1-aggregator-filter
May 12, 2026
Merged

v0.11.1 — default aggregator-page URL filter (mdBook /print.html, Hugo /_print/)#34
AIMLPM merged 1 commit into
mainfrom
v0.11.1-aggregator-filter

Conversation

@AIMLPM
Copy link
Copy Markdown
Owner

@AIMLPM AIMLPM commented May 12, 2026

Summary

  • Reject single-render-of-whole-tree aggregator pages (/print.html, /_print/) during crawl-time URL filtering. These pages contain the entire docs tree on one URL and dominate embedding-based retrieval rankings on cosine similarity, blocking the dedicated chapter pages a user actually wants.
  • Opt-out via include_aggregator_pages=True engine kwarg / --include-aggregators CLI flag for offline-archive use cases.
  • 36 new tests, 647 total (was 611), no regressions.

Motivation

From the public llm-crawler-benchmarks v1.4 cycle audit:

Site markcrawl top-5 slots that are aggregator Best competitor
rust-book 49% /print.html 30.7% (crawl4ai)
kubernetes-docs 39% /_print/ 0% (4 of 5 competitors)

Markcrawl is the only top-tier tool that didn't filter these pages. The 9-12 retrieval-bucket misses concentrated on this in v1.4 are addressable with a single 5-line filter at the URL-frontier hook in path_excluded.

Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs.

What's in the patch

File Change
markcrawl/core.py _DEFAULT_AGGREGATOR_PATH_PATTERNS tuple; include_aggregator_pages: bool = False param threaded through both engines + wrapper functions; path_excluded checks aggregator patterns before user-supplied exclude_paths
markcrawl/cli.py --include-aggregators flag, threaded through both crawl() call sites
tests/test_v011_1_aggregator_filter.py 36 new tests
CHANGELOG.md 0.11.1 entry: motivation, mechanism, expected impact, test count
pyproject.toml version 0.11.0 → 0.11.1

Substring-match safety

Patterns are anchored to avoid over-matching:

URL Behavior Reason
/book/print.html rejected mdBook aggregator
/blueprint.html passes print is mid-word
/preprint.html passes academic preprint, content page
/imprint/ passes legal page, not aggregator
/_printer-friendly/css.css passes asset path, _print is prefix-only
/_print/index.html rejected Hugo aggregator

All six cases have explicit tests.

Test plan

  • 36 new tests pass on the new branch
  • Full suite 647 passed in 35s, no regressions
  • After merge, tag v0.11.1 and PyPI publish (separate step)
  • Bench measurement waits for v1.5 helpful-pages-universe methodology — current v1.4 methodology is anchor-biased, would give misleading numbers regardless of underlying fix

What this does NOT do

  • Does NOT change default crawl behavior on sites that don't generate aggregator pages (most sites)
  • Does NOT affect retrieval-side code — purely a crawl-time URL filter
  • Does NOT bypass user-supplied exclude_paths / include_paths — composes with both
  • Does NOT measure the lift — measurement deferred to v1.5 bench cycle per the bench-cadence directive

…o /_print/)

Reject single-render-of-whole-tree aggregator pages during crawl-time URL
filtering. These pages contain the entire docs tree on one URL, so embedding-
based retrieval ranks them above the dedicated chapter pages a user actually
wants.

Patterns rejected pre-fetch (saves crawl budget):
  */print.html, */_print, */_print/, */_print/*, */print/index.html

Opt out via include_aggregator_pages=True engine kwarg or
--include-aggregators CLI flag for offline-archive use cases.

Motivation from llm-crawler-benchmarks v1.4 cycle: markcrawl was returning
/print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39%
of kubernetes-docs slots, while four of the five well-functioning competitors
returned 0% /_print/ on kubernetes-docs. Predicted MRR lift on the 9-site
bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs.

36 new tests in tests/test_v011_1_aggregator_filter.py covering default
rejection, substring-match safety (/blueprint.html, /preprint.html,
/imprint/ all pass through), opt-out flag, composition with user-supplied
exclude_paths and include_paths, sync + async engine parity. 647 tests
total (was 611), no regressions.
@AIMLPM AIMLPM merged commit a5e158b into main May 12, 2026
4 of 5 checks passed
AIMLPM added a commit that referenced this pull request May 16, 2026
Pre-existing lint failure on main since PR #34 (v0.11.1 ship) — ruff
wants underscore-prefixed names sorted first in the import list. Applied
`ruff check --fix`. No semantic change to the tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant