Skip to content

Add parse_sitemap_urls plugin for sitemap.xml URL discovery#29

Draft
ysalitrynskyi wants to merge 1 commit into
ArchiveBox:mainfrom
ysalitrynskyi:feat/parse-sitemap-urls
Draft

Add parse_sitemap_urls plugin for sitemap.xml URL discovery#29
ysalitrynskyi wants to merge 1 commit into
ArchiveBox:mainfrom
ysalitrynskyi:feat/parse-sitemap-urls

Conversation

@ysalitrynskyi
Copy link
Copy Markdown

@ysalitrynskyi ysalitrynskyi commented May 25, 2026

Status: draft, needs more testing before merge

I'm opening this as a draft because the plugin has not been exercised
inside a live ArchiveBox install yet — only against the standalone
hook contract (the abx-plugins repo's test harness + manual runs
against a handful of real sitemaps). Before this is ready to land I'd
like to:

  • run it end-to-end inside an actual archivebox add flow and confirm
    the emitted Snapshot records feed the host's crawl frontier as
    expected
  • test the depth/limit interaction with the host's own max_depth /
    max_urls settings on a real install
  • try it on more varied sites, especially SPAs that only publish a
    partial sitemap (pairs with parse_dom_outlinks)
  • get feedback on whether the new MAX_SITEMAPS and
    ALLOW_PRIVATE_HOSTS knobs hit the right defaults for the
    ecosystem

If anyone on the abx-plugins side can spot something that won't survive
the host integration, I'd rather hear it now than after I've put more
hours into polish. Happy to keep iterating.

What this does

Adds a sitemap-driven URL discovery hook so a single seed URL can
expand into a full-site crawl without an external crawler in the
loop. Closes the long-standing
ArchiveBox/ArchiveBox#191.

The host (ArchiveBox or abx-dl) still owns crawl ordering, dedup,
and the global max_depth / max_urls caps — this plugin only emits
Snapshot JSONL records, so existing crawl semantics are preserved.

Features

Capability Notes
Standard urlset parsing Preserves <lastmod> as bookmarked_at, also reads <priority> and <changefreq>.
sitemapindex recursion Configurable depth + total-fetch cap, cycle-safe via visited-set.
Gzip handling Magic-byte sniff, Content-Encoding: gzip header, .gz URL suffix.
BOM-tolerant XML UTF-8, UTF-16 LE, UTF-16 BE.
robots.txt Sitemap: discovery Default on; handles multiple directives per file.
Fallback path probing /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml, /sitemap.xml.gz (configurable).
Include / exclude regex filters With a scan-length cap to blunt catastrophic-backtracking risk.
same_host_only mode Also applied to child sitemap fetches (per sitemaps.org §2.2).
Priority + changefreq filters PRIORITY_MIN threshold with optional REQUIRE_PRIORITY strictness; CHANGEFREQ_ALLOWED allowlist.
Sort modes url / lastmod / priority / order (sitemap order).
Image / video / news extensions Optional. Media extras carry tags=sitemap-media and pass through the same URL policy as page URLs; news emits Tag records for publication names.
HTTP retry + backoff 408, 429, 5xx, network errors; configurable count and base delay.
Bounded redirects Hard cap on redirect count, rejects non-HTTP schemes and private hosts unless ALLOW_PRIVATE_HOSTS=true.
TLS verification toggle Falls back to CHECK_SSL_VALIDITY.
Custom User-Agent + Accept-Language Per-plugin override with global fallback.
Schemeless URL handling //host/path resolved against the sitemap's own scheme.
Fragment stripping #anchor variants don't produce duplicate snapshots.
MAX_URLS cap Default 5000; applies before host's frontier caps.
MAX_SITEMAPS cap Default 100; counts attempts, not successes, so an adversarial index can't pivot into thousands of fetches.
Verbose mode One fetching sitemap … line per fetch on stderr.

No new binary dependencies. Stdlib (xml.etree, gzip, ssl,
urllib) plus the existing rich_click, defusedxml, and
abx_plugins.plugins.base.utils.

Security posture

Untrusted sitemap input is the default threat model. Defenses:

  • XML hardening via defusedxml.ElementTree.iterparse. DTDs and
    entities are refused; billion-laughs and XXE fail-closed.
  • Response size cap before parsing (default 50 MiB).
  • Decompression cap + ratio cap (default 200 MiB and 100×) so
    gzip bombs fail with status=failed rather than OOM.
  • Scheme allowlist for emitted URLs: only http and https;
    javascript:, data:, ftp:, etc. are dropped.
  • file:// containment: only allowed when the seed itself is
    file:// or the operator opts in.
  • Bounded, validated redirects via a custom
    HTTPRedirectHandler that caps the chain and refuses non-HTTP
    schemes and private (loopback / RFC1918 / link-local / multicast)
    targets unless explicitly allowed.
  • Sitemap attempt cap (MAX_SITEMAPS) defeats adversarial
    sitemap-indexes pointing at thousands of broken children.
  • Streaming parse via iterparse with elem.clear() plus
    root.remove(elem) keeps resident XML bounded to a single
    <url> subtree even on 500k-URL documents.
  • DNS-rebinding caveat documented in the README — the
    private-host check re-resolves on each policy call but cannot pin
    the IP through to urllib's connect-time lookup. If your threat
    model includes rebinding, run behind an outbound firewall.

Hook contract

Standard on_Snapshot__* per abx-plugins/README.md:

  • stdout: 0+ Tag records, 0+ Snapshot records, terminal
    ArchiveResult. Every non-empty stdout line is a JSON record (the
    test harness asserts this).
  • stderr: discovery / fetch error lines and the human summary.
  • succeeded if any URL emitted, noresults when filters or empty
    sitemaps prune everything, skipped when the plugin is disabled,
    failed only when every candidate sitemap is unfetchable /
    unparseable or a security guard tripped.
  • SNAP_DIR/parse_sitemap_urls/urls.jsonl written atomically and
    removed on noresults / failed.

Summary string carries counters so logs make it obvious why nothing
emitted:

0 URLs parsed (visited 1 sitemap(s); skipped_filter=3 skipped_host=0
 skipped_priority=2 skipped_changefreq=0 skipped_scheme=1
 skipped_extras=0)

Filename

Numbered 76 — sits after parse_dom_outlinks (75) and before any
later snapshot work. The 73 slot is already taken by
parse_netscape_urls.

What I tested

ruff check        → All checks passed
ruff format       → 5 files already formatted
pyright           → 0 errors, 0 warnings
pytest            → 83 passed in ~80s

83 tests across two files:

  • tests/test_parse_sitemap_urls.py (34) — basic urlset and
    sitemapindex parsing, gzip, filtering, malformed input, robots.txt
    discovery, HTTP fetch via pytest-httpserver, ordering and dedup.
  • tests/test_parse_sitemap_urls_advanced.py (49) — BOM stripping,
    unicode, schemeless URLs, priority and changefreq filtering, all
    four sort modes, image / video / news extensions, HTTP retry on
    5xx, retry exhaustion, redirect following, redirect count cap,
    scheme allowlist, remote → file:// chain refused, billion-laughs
    blocked, external entity blocked, gzip bomb capped, response body
    capped, truncated gzip handled cleanly, fragment stripping, media
    extras subject to full URL policy, MAX_SITEMAP_DEPTH
    semantics (0 walks only seed, 1 walks one child level),
    MAX_SITEMAPS enforced on attempts (not just successes),
    cross-host child sitemap refused under SAME_HOST_ONLY, JSONL
    contract enforced, 500k-URL streaming bounded by wall time.

Manual smoke runs against real sites (all clean):

Site Setup Result
docs.python.org root URL discovery via robots.txt 8 URLs
github.blog WP sitemap-index 200-URL cap respected, 2 sitemaps walked, bookmarked_at populated
www.sitemaps.org multi-locale gzipped sitemap 20 URLs across /en/, /es/
bbc.com news archive SORT_BY=lastmod 15 newest-first articles
github.blog SAME_HOST_ONLY=true 10 URLs, all github.blog
docs.python.org EXCLUDE_REGEX=/3\.1[0-2]/ 5 / 8 URLs (3 filtered)
sitemaps.org PRIORITY_MIN=0.8 0 / 84 (no priority hints declared)

Compatibility

  • New plugin, no existing behavior changes.
  • No required_plugins, no required_binaries, no preflight impact.
  • Pure stdlib + existing dev deps; the only new runtime dependency is
    defusedxml>=0.7.1 for XML hardening.

Open questions for reviewers

  1. Default for MAX_SITEMAPS is 100 — reasonable, or should it be
    higher / lower? Tied to typical real-world sitemap-index fan-out.
  2. ALLOW_PRIVATE_HOSTS defaults to false, which surfaces a
    private_host error when someone tries to archive their own
    intranet without opting in. Discoverable enough, or should the
    default flip when the seed itself is private?
  3. The plugin pairs naturally with parse_dom_outlinks for SPAs.
    Worth dropping a one-liner in parse_dom_outlinks/README.md
    recommending sitemap discovery as the "first pass" of a full-site
    crawl?
  4. Anything in the hook contract I'm reading wrong from
    abx-plugins/README.md?

Marking ready-for-review only after the e2e run inside ArchiveBox is
clean.


Summary by cubic

Add parse_sitemap_urls plugin to discover URLs from sitemap.xml and emit Snapshot records, enabling full‑site crawls from a single seed without an external crawler. Preserves host crawl ordering, dedup, and depth/URL caps.

  • New Features

    • Parse urlset and sitemapindex (recursive, cycle-safe); handles gzip and BOM.
    • Discover via robots.txt and fallback probing; resolves schemeless URLs; strips fragments.
    • Filters: same-host-only, include/exclude regex, priority/changefreq; sort by url/lastmod/priority/order.
    • Safety: hardened XML via defusedxml, response/decompression caps, HTTP‑only allowlist, bounded redirects, private-host gate.
    • Controls: hard caps on URLs and sitemap fetches, retry/backoff, TLS toggle, custom UA/Accept‑Language.
    • Outputs JSONL Snapshot (and Tag for news); preserves <lastmod> as bookmarked_at.
  • Dependencies

    • Add defusedxml>=0.7.1. No new binaries.

Written for commit 093c75d. Summary will update on new commits. Review in cubic

Adds a sitemap-driven URL discovery hook so a single seed URL can
expand into a full-site crawl without an external crawler in the
loop. Closes the long-standing ArchiveBox/ArchiveBox#191 gap.

The host (ArchiveBox or abx-dl) still owns crawl ordering, dedup, and
the global max_depth / max_urls caps — this plugin only emits
Snapshot JSONL records, so existing crawl semantics are preserved.

Capabilities
------------

* urlset parsing with bookmarked_at from <lastmod> and full <priority>
  / <changefreq> metadata
* sitemapindex recursion up to PARSE_SITEMAP_URLS_MAX_SITEMAP_DEPTH
  with cycle detection and a visited-set
* Gzip handling via magic-byte sniff, .gz suffix, and
  Content-Encoding: gzip — under hard size / ratio caps
* BOM-tolerant XML parsing for UTF-8, UTF-16 LE, and UTF-16 BE; the
  XML encoding declaration is rewritten when re-encoding UTF-16 to
  UTF-8 so the parser doesn't trip on the mismatch
* robots.txt Sitemap: directive discovery and configurable fallback
  paths (/sitemap.xml, /sitemap_index.xml, /sitemap-index.xml,
  /wp-sitemap.xml, /sitemap.xml.gz by default)
* Include / exclude regex filters with scan-length cap, same-host-only
  mode, priority threshold (with optional REQUIRE_PRIORITY to also
  drop entries without a <priority> tag), changefreq allowlist
* Four sort modes: url (alpha), lastmod (newest first), priority
  (highest first), order (preserve sitemap order)
* Optional image / video / news sitemap extensions; image and video
  extras get tags=sitemap-media and pass through the same URL policy
  as page URLs, news emits Tag records for the publication name
* HTTP retry with exponential backoff for 408 / 429 / 5xx / network
  errors, custom User-Agent and Accept-Language headers, TLS
  verification toggle
* Schemeless URL (//host/path) resolution against the sitemap's own
  scheme, fragment stripping during URL normalization to avoid
  emitting duplicate snapshots for #anchor variants
* Hard PARSE_SITEMAP_URLS_MAX_URLS cap (default 5000), verbose stderr
  mode, summary counters for every filter category

Security posture
----------------

* XML parsing uses defusedxml — DTDs and entities are refused, so
  billion-laughs and XXE payloads fail-closed
* Response bytes capped before parsing (default 50 MiB)
* Decompression bounded by absolute size and decompressed/compressed
  ratio (default 200 MiB / 100x) — gzip bombs fail with status=failed
* Scheme allowlist for emitted URLs: only http / https; javascript:,
  data:, ftp:, etc. are refused. file:// is allowed only when the
  seed itself is file:// or ALLOW_FILE_URLS is explicitly set
* Custom redirect handler bounds redirect count and rejects non-HTTP
  schemes and private (loopback / RFC1918 / link-local / multicast)
  targets unless ALLOW_PRIVATE_HOSTS is set
* Regex filters scan only the first REGEX_INPUT_CAP characters of
  each URL, blunting catastrophic-backtracking risk
* Seed URL also subject to scheme and private-host gates

Filename
--------

Numbered 76 (after parse_dom_outlinks at 75; the previous 73 slot
collided with parse_netscape_urls).

Hook contract
-------------

Follows the standard on_Snapshot__* contract from abx-plugins/README.md:

* stdout: 0+ Tag records, 0+ Snapshot records, terminal ArchiveResult
* stderr: discovery / fetch error lines and the human summary
* succeeded if any URL emitted, noresults when filters or empty
  sitemaps prune everything, skipped when the plugin is disabled,
  failed only when every candidate sitemap is unfetchable / unparseable
  or a security guard tripped
* SNAP_DIR/parse_sitemap_urls/urls.jsonl written atomically; cleared
  on noresults / failed so stale data never lingers

Tests
-----

72 tests across two files:

* tests/test_parse_sitemap_urls.py — 34 tests covering basic urlset,
  sitemapindex (one and two levels, max-depth, cyclic), gzip,
  filtering, limits, malformed input (truncated XML, non-XML, empty
  urlset, missing file, unknown root element, unnamespaced sitemap),
  robots.txt discovery (direct seed, root-URL fallback, disabled),
  HTTP fetch via pytest-httpserver, ordering, dedup
* tests/test_parse_sitemap_urls_advanced.py — 38 tests covering BOM
  stripping (UTF-8 / UTF-16 LE), unicode URLs, whitespace
  normalization, schemeless URLs, priority / changefreq filters with
  and without REQUIRE_PRIORITY, all four sort modes, image / video /
  news extensions, HTTP retry on 503, retry exhaustion, redirect
  following, redirect to non-HTTP scheme refused, private-host seed
  refused, Content-Encoding: gzip, custom user-agent and
  Accept-Language, verbose stderr, multiple Sitemap: directives in
  robots.txt, custom fallback paths, 2000-URL stretch test, dedup of
  image extras against page URLs, scheme allowlist (javascript:,
  data:, ftp: refused), remote sitemap → file:// child refused, XML
  billion-laughs blocked, external entity blocked, gzip bomb rejected
  by decompression cap, response body cap, fragment stripping, image
  extras subject to same-host + include filters, MAX_SITEMAP_DEPTH=0
  walks only the seed and MAX_SITEMAP_DEPTH=1 walks one child level

Manual smoke tests against docs.python.org, github.blog
(WordPress sitemap-index), www.sitemaps.org (multi-locale),
bbc.com news archive (lastmod sort) all clean.

ruff / pyright clean, pytest 72/72 passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant