Add parse_sitemap_urls plugin for sitemap.xml URL discovery by ysalitrynskyi · Pull Request #29 · ArchiveBox/abx-plugins

ysalitrynskyi · 2026-05-25T20:36:58Z

Status: draft, needs more testing before merge

I'm opening this as a draft because the plugin has not been exercised
inside a live ArchiveBox install yet — only against the standalone
hook contract (the abx-plugins repo's test harness + manual runs
against a handful of real sitemaps). Before this is ready to land I'd
like to:

run it end-to-end inside an actual archivebox add flow and confirm
the emitted Snapshot records feed the host's crawl frontier as
expected
test the depth/limit interaction with the host's own max_depth /
max_urls settings on a real install
try it on more varied sites, especially SPAs that only publish a
partial sitemap (pairs with parse_dom_outlinks)
get feedback on whether the new MAX_SITEMAPS and
ALLOW_PRIVATE_HOSTS knobs hit the right defaults for the
ecosystem

If anyone on the abx-plugins side can spot something that won't survive
the host integration, I'd rather hear it now than after I've put more
hours into polish. Happy to keep iterating.

What this does

Adds a sitemap-driven URL discovery hook so a single seed URL can
expand into a full-site crawl without an external crawler in the
loop. Closes the long-standing
ArchiveBox/ArchiveBox#191.

The host (ArchiveBox or abx-dl) still owns crawl ordering, dedup,
and the global max_depth / max_urls caps — this plugin only emits
Snapshot JSONL records, so existing crawl semantics are preserved.

Features

Capability	Notes
Standard `urlset` parsing	Preserves `<lastmod>` as `bookmarked_at`, also reads `<priority>` and `<changefreq>`.
`sitemapindex` recursion	Configurable depth + total-fetch cap, cycle-safe via visited-set.
Gzip handling	Magic-byte sniff, `Content-Encoding: gzip` header, `.gz` URL suffix.
BOM-tolerant XML	UTF-8, UTF-16 LE, UTF-16 BE.
`robots.txt` `Sitemap:` discovery	Default on; handles multiple directives per file.
Fallback path probing	`/sitemap.xml`, `/sitemap_index.xml`, `/sitemap-index.xml`, `/wp-sitemap.xml`, `/sitemap.xml.gz` (configurable).
Include / exclude regex filters	With a scan-length cap to blunt catastrophic-backtracking risk.
`same_host_only` mode	Also applied to child sitemap fetches (per sitemaps.org §2.2).
Priority + changefreq filters	`PRIORITY_MIN` threshold with optional `REQUIRE_PRIORITY` strictness; `CHANGEFREQ_ALLOWED` allowlist.
Sort modes	`url` / `lastmod` / `priority` / `order` (sitemap order).
Image / video / news extensions	Optional. Media extras carry `tags=sitemap-media` and pass through the same URL policy as page URLs; news emits `Tag` records for publication names.
HTTP retry + backoff	408, 429, 5xx, network errors; configurable count and base delay.
Bounded redirects	Hard cap on redirect count, rejects non-HTTP schemes and private hosts unless `ALLOW_PRIVATE_HOSTS=true`.
TLS verification toggle	Falls back to `CHECK_SSL_VALIDITY`.
Custom User-Agent + Accept-Language	Per-plugin override with global fallback.
Schemeless URL handling	`//host/path` resolved against the sitemap's own scheme.
Fragment stripping	`#anchor` variants don't produce duplicate snapshots.
`MAX_URLS` cap	Default 5000; applies before host's frontier caps.
`MAX_SITEMAPS` cap	Default 100; counts attempts, not successes, so an adversarial index can't pivot into thousands of fetches.
Verbose mode	One `fetching sitemap …` line per fetch on stderr.

No new binary dependencies. Stdlib (xml.etree, gzip, ssl,
urllib) plus the existing rich_click, defusedxml, and
abx_plugins.plugins.base.utils.

Security posture

Untrusted sitemap input is the default threat model. Defenses:

XML hardening via defusedxml.ElementTree.iterparse. DTDs and
entities are refused; billion-laughs and XXE fail-closed.
Response size cap before parsing (default 50 MiB).
Decompression cap + ratio cap (default 200 MiB and 100×) so
gzip bombs fail with status=failed rather than OOM.
Scheme allowlist for emitted URLs: only http and https;
javascript:, data:, ftp:, etc. are dropped.
file:// containment: only allowed when the seed itself is
file:// or the operator opts in.
Bounded, validated redirects via a custom
HTTPRedirectHandler that caps the chain and refuses non-HTTP
schemes and private (loopback / RFC1918 / link-local / multicast)
targets unless explicitly allowed.
Sitemap attempt cap (MAX_SITEMAPS) defeats adversarial
sitemap-indexes pointing at thousands of broken children.
Streaming parse via iterparse with elem.clear() plus
root.remove(elem) keeps resident XML bounded to a single
<url> subtree even on 500k-URL documents.
DNS-rebinding caveat documented in the README — the
private-host check re-resolves on each policy call but cannot pin
the IP through to urllib's connect-time lookup. If your threat
model includes rebinding, run behind an outbound firewall.

Hook contract

Standard on_Snapshot__* per abx-plugins/README.md:

stdout: 0+ Tag records, 0+ Snapshot records, terminal
ArchiveResult. Every non-empty stdout line is a JSON record (the
test harness asserts this).
stderr: discovery / fetch error lines and the human summary.
succeeded if any URL emitted, noresults when filters or empty
sitemaps prune everything, skipped when the plugin is disabled,
failed only when every candidate sitemap is unfetchable /
unparseable or a security guard tripped.
SNAP_DIR/parse_sitemap_urls/urls.jsonl written atomically and
removed on noresults / failed.

Summary string carries counters so logs make it obvious why nothing
emitted:

0 URLs parsed (visited 1 sitemap(s); skipped_filter=3 skipped_host=0
 skipped_priority=2 skipped_changefreq=0 skipped_scheme=1
 skipped_extras=0)

Filename

Numbered 76 — sits after parse_dom_outlinks (75) and before any
later snapshot work. The 73 slot is already taken by
parse_netscape_urls.

What I tested

ruff check        → All checks passed
ruff format       → 5 files already formatted
pyright           → 0 errors, 0 warnings
pytest            → 83 passed in ~80s

83 tests across two files:

tests/test_parse_sitemap_urls.py (34) — basic urlset and
sitemapindex parsing, gzip, filtering, malformed input, robots.txt
discovery, HTTP fetch via pytest-httpserver, ordering and dedup.
tests/test_parse_sitemap_urls_advanced.py (49) — BOM stripping,
unicode, schemeless URLs, priority and changefreq filtering, all
four sort modes, image / video / news extensions, HTTP retry on
5xx, retry exhaustion, redirect following, redirect count cap,
scheme allowlist, remote → file:// chain refused, billion-laughs
blocked, external entity blocked, gzip bomb capped, response body
capped, truncated gzip handled cleanly, fragment stripping, media
extras subject to full URL policy, MAX_SITEMAP_DEPTH
semantics (0 walks only seed, 1 walks one child level),
MAX_SITEMAPS enforced on attempts (not just successes),
cross-host child sitemap refused under SAME_HOST_ONLY, JSONL
contract enforced, 500k-URL streaming bounded by wall time.

Manual smoke runs against real sites (all clean):

Site	Setup	Result
docs.python.org	root URL discovery via robots.txt	8 URLs
github.blog	WP sitemap-index	200-URL cap respected, 2 sitemaps walked, `bookmarked_at` populated
www.sitemaps.org	multi-locale gzipped sitemap	20 URLs across `/en/`, `/es/`
bbc.com news archive	`SORT_BY=lastmod`	15 newest-first articles
github.blog	`SAME_HOST_ONLY=true`	10 URLs, all `github.blog`
docs.python.org	`EXCLUDE_REGEX=/3\.1[0-2]/`	5 / 8 URLs (3 filtered)
sitemaps.org	`PRIORITY_MIN=0.8`	0 / 84 (no priority hints declared)

Compatibility

New plugin, no existing behavior changes.
No required_plugins, no required_binaries, no preflight impact.
Pure stdlib + existing dev deps; the only new runtime dependency is
defusedxml>=0.7.1 for XML hardening.

Open questions for reviewers

Default for MAX_SITEMAPS is 100 — reasonable, or should it be
higher / lower? Tied to typical real-world sitemap-index fan-out.
ALLOW_PRIVATE_HOSTS defaults to false, which surfaces a
private_host error when someone tries to archive their own
intranet without opting in. Discoverable enough, or should the
default flip when the seed itself is private?
The plugin pairs naturally with parse_dom_outlinks for SPAs.
Worth dropping a one-liner in parse_dom_outlinks/README.md
recommending sitemap discovery as the "first pass" of a full-site
crawl?
Anything in the hook contract I'm reading wrong from
abx-plugins/README.md?

Marking ready-for-review only after the e2e run inside ArchiveBox is
clean.

Summary by cubic

Add parse_sitemap_urls plugin to discover URLs from sitemap.xml and emit Snapshot records, enabling full‑site crawls from a single seed without an external crawler. Preserves host crawl ordering, dedup, and depth/URL caps.

New Features
- Parse urlset and sitemapindex (recursive, cycle-safe); handles gzip and BOM.
- Discover via robots.txt and fallback probing; resolves schemeless URLs; strips fragments.
- Filters: same-host-only, include/exclude regex, priority/changefreq; sort by url/lastmod/priority/order.
- Safety: hardened XML via defusedxml, response/decompression caps, HTTP‑only allowlist, bounded redirects, private-host gate.
- Controls: hard caps on URLs and sitemap fetches, retry/backoff, TLS toggle, custom UA/Accept‑Language.
- Outputs JSONL Snapshot (and Tag for news); preserves <lastmod> as bookmarked_at.
Dependencies
- Add defusedxml>=0.7.1. No new binaries.

^{Written for commit 093c75d. Summary will update on new commits. Review in cubic}

Adds a sitemap-driven URL discovery hook so a single seed URL can expand into a full-site crawl without an external crawler in the loop. Closes the long-standing ArchiveBox/ArchiveBox#191 gap. The host (ArchiveBox or abx-dl) still owns crawl ordering, dedup, and the global max_depth / max_urls caps — this plugin only emits Snapshot JSONL records, so existing crawl semantics are preserved. Capabilities ------------ * urlset parsing with bookmarked_at from <lastmod> and full <priority> / <changefreq> metadata * sitemapindex recursion up to PARSE_SITEMAP_URLS_MAX_SITEMAP_DEPTH with cycle detection and a visited-set * Gzip handling via magic-byte sniff, .gz suffix, and Content-Encoding: gzip — under hard size / ratio caps * BOM-tolerant XML parsing for UTF-8, UTF-16 LE, and UTF-16 BE; the XML encoding declaration is rewritten when re-encoding UTF-16 to UTF-8 so the parser doesn't trip on the mismatch * robots.txt Sitemap: directive discovery and configurable fallback paths (/sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml, /sitemap.xml.gz by default) * Include / exclude regex filters with scan-length cap, same-host-only mode, priority threshold (with optional REQUIRE_PRIORITY to also drop entries without a <priority> tag), changefreq allowlist * Four sort modes: url (alpha), lastmod (newest first), priority (highest first), order (preserve sitemap order) * Optional image / video / news sitemap extensions; image and video extras get tags=sitemap-media and pass through the same URL policy as page URLs, news emits Tag records for the publication name * HTTP retry with exponential backoff for 408 / 429 / 5xx / network errors, custom User-Agent and Accept-Language headers, TLS verification toggle * Schemeless URL (//host/path) resolution against the sitemap's own scheme, fragment stripping during URL normalization to avoid emitting duplicate snapshots for #anchor variants * Hard PARSE_SITEMAP_URLS_MAX_URLS cap (default 5000), verbose stderr mode, summary counters for every filter category Security posture ---------------- * XML parsing uses defusedxml — DTDs and entities are refused, so billion-laughs and XXE payloads fail-closed * Response bytes capped before parsing (default 50 MiB) * Decompression bounded by absolute size and decompressed/compressed ratio (default 200 MiB / 100x) — gzip bombs fail with status=failed * Scheme allowlist for emitted URLs: only http / https; javascript:, data:, ftp:, etc. are refused. file:// is allowed only when the seed itself is file:// or ALLOW_FILE_URLS is explicitly set * Custom redirect handler bounds redirect count and rejects non-HTTP schemes and private (loopback / RFC1918 / link-local / multicast) targets unless ALLOW_PRIVATE_HOSTS is set * Regex filters scan only the first REGEX_INPUT_CAP characters of each URL, blunting catastrophic-backtracking risk * Seed URL also subject to scheme and private-host gates Filename -------- Numbered 76 (after parse_dom_outlinks at 75; the previous 73 slot collided with parse_netscape_urls). Hook contract ------------- Follows the standard on_Snapshot__* contract from abx-plugins/README.md: * stdout: 0+ Tag records, 0+ Snapshot records, terminal ArchiveResult * stderr: discovery / fetch error lines and the human summary * succeeded if any URL emitted, noresults when filters or empty sitemaps prune everything, skipped when the plugin is disabled, failed only when every candidate sitemap is unfetchable / unparseable or a security guard tripped * SNAP_DIR/parse_sitemap_urls/urls.jsonl written atomically; cleared on noresults / failed so stale data never lingers Tests ----- 72 tests across two files: * tests/test_parse_sitemap_urls.py — 34 tests covering basic urlset, sitemapindex (one and two levels, max-depth, cyclic), gzip, filtering, limits, malformed input (truncated XML, non-XML, empty urlset, missing file, unknown root element, unnamespaced sitemap), robots.txt discovery (direct seed, root-URL fallback, disabled), HTTP fetch via pytest-httpserver, ordering, dedup * tests/test_parse_sitemap_urls_advanced.py — 38 tests covering BOM stripping (UTF-8 / UTF-16 LE), unicode URLs, whitespace normalization, schemeless URLs, priority / changefreq filters with and without REQUIRE_PRIORITY, all four sort modes, image / video / news extensions, HTTP retry on 503, retry exhaustion, redirect following, redirect to non-HTTP scheme refused, private-host seed refused, Content-Encoding: gzip, custom user-agent and Accept-Language, verbose stderr, multiple Sitemap: directives in robots.txt, custom fallback paths, 2000-URL stretch test, dedup of image extras against page URLs, scheme allowlist (javascript:, data:, ftp: refused), remote sitemap → file:// child refused, XML billion-laughs blocked, external entity blocked, gzip bomb rejected by decompression cap, response body cap, fragment stripping, image extras subject to same-host + include filters, MAX_SITEMAP_DEPTH=0 walks only the seed and MAX_SITEMAP_DEPTH=1 walks one child level Manual smoke tests against docs.python.org, github.blog (WordPress sitemap-index), www.sitemaps.org (multi-locale), bbc.com news archive (lastmod sort) all clean. ruff / pyright clean, pytest 72/72 passing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parse_sitemap_urls plugin for sitemap.xml URL discovery#29

Add parse_sitemap_urls plugin for sitemap.xml URL discovery#29
ysalitrynskyi wants to merge 1 commit into
ArchiveBox:mainfrom
ysalitrynskyi:feat/parse-sitemap-urls

ysalitrynskyi commented May 25, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ysalitrynskyi commented May 25, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: draft, needs more testing before merge

What this does

Features

Security posture

Hook contract

Filename

What I tested

Compatibility

Open questions for reviewers

Summary by cubic

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ysalitrynskyi commented May 25, 2026 •

edited by cubic-dev-ai Bot

Loading