v0.11.1 — default aggregator-page URL filter (mdBook /print.html, Hugo /_print/) by AIMLPM · Pull Request #34 · AIMLPM/markcrawl

AIMLPM · 2026-05-12T04:44:52Z

Summary

Reject single-render-of-whole-tree aggregator pages (/print.html, /_print/) during crawl-time URL filtering. These pages contain the entire docs tree on one URL and dominate embedding-based retrieval rankings on cosine similarity, blocking the dedicated chapter pages a user actually wants.
Opt-out via include_aggregator_pages=True engine kwarg / --include-aggregators CLI flag for offline-archive use cases.
36 new tests, 647 total (was 611), no regressions.

Motivation

From the public llm-crawler-benchmarks v1.4 cycle audit:

Site	markcrawl top-5 slots that are aggregator	Best competitor
rust-book	49% `/print.html`	30.7% (crawl4ai)
kubernetes-docs	39% `/_print/`	0% (4 of 5 competitors)

Markcrawl is the only top-tier tool that didn't filter these pages. The 9-12 retrieval-bucket misses concentrated on this in v1.4 are addressable with a single 5-line filter at the URL-frontier hook in path_excluded.

Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs.

What's in the patch

File	Change
`markcrawl/core.py`	`_DEFAULT_AGGREGATOR_PATH_PATTERNS` tuple; `include_aggregator_pages: bool = False` param threaded through both engines + wrapper functions; `path_excluded` checks aggregator patterns before user-supplied `exclude_paths`
`markcrawl/cli.py`	`--include-aggregators` flag, threaded through both `crawl()` call sites
`tests/test_v011_1_aggregator_filter.py`	36 new tests
`CHANGELOG.md`	0.11.1 entry: motivation, mechanism, expected impact, test count
`pyproject.toml`	version 0.11.0 → 0.11.1

Substring-match safety

Patterns are anchored to avoid over-matching:

URL	Behavior	Reason
`/book/print.html`	rejected	mdBook aggregator
`/blueprint.html`	passes	`print` is mid-word
`/preprint.html`	passes	academic preprint, content page
`/imprint/`	passes	legal page, not aggregator
`/_printer-friendly/css.css`	passes	asset path, `_print` is prefix-only
`/_print/index.html`	rejected	Hugo aggregator

All six cases have explicit tests.

Test plan

36 new tests pass on the new branch
Full suite 647 passed in 35s, no regressions
After merge, tag v0.11.1 and PyPI publish (separate step)
Bench measurement waits for v1.5 helpful-pages-universe methodology — current v1.4 methodology is anchor-biased, would give misleading numbers regardless of underlying fix

What this does NOT do

Does NOT change default crawl behavior on sites that don't generate aggregator pages (most sites)
Does NOT affect retrieval-side code — purely a crawl-time URL filter
Does NOT bypass user-supplied exclude_paths / include_paths — composes with both
Does NOT measure the lift — measurement deferred to v1.5 bench cycle per the bench-cadence directive

…o /_print/) Reject single-render-of-whole-tree aggregator pages during crawl-time URL filtering. These pages contain the entire docs tree on one URL, so embedding- based retrieval ranks them above the dedicated chapter pages a user actually wants. Patterns rejected pre-fetch (saves crawl budget): */print.html, */_print, */_print/, */_print/*, */print/index.html Opt out via include_aggregator_pages=True engine kwarg or --include-aggregators CLI flag for offline-archive use cases. Motivation from llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five well-functioning competitors returned 0% /_print/ on kubernetes-docs. Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring-match safety (/blueprint.html, /preprint.html, /imprint/ all pass through), opt-out flag, composition with user-supplied exclude_paths and include_paths, sync + async engine parity. 647 tests total (was 611), no regressions.

Pre-existing lint failure on main since PR #34 (v0.11.1 ship) — ruff wants underscore-prefixed names sorted first in the import list. Applied `ruff check --fix`. No semantic change to the tests.

AIMLPM merged commit a5e158b into main May 12, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.11.1 — default aggregator-page URL filter (mdBook /print.html, Hugo /_print/)#34

v0.11.1 — default aggregator-page URL filter (mdBook /print.html, Hugo /_print/)#34
AIMLPM merged 1 commit into
mainfrom
v0.11.1-aggregator-filter

AIMLPM commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AIMLPM commented May 12, 2026

Summary

Motivation

What's in the patch

Substring-match safety

Test plan

What this does NOT do

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant