Add parse_sitemap_urls plugin for sitemap.xml URL discovery#29
Draft
ysalitrynskyi wants to merge 1 commit into
Draft
Add parse_sitemap_urls plugin for sitemap.xml URL discovery#29ysalitrynskyi wants to merge 1 commit into
ysalitrynskyi wants to merge 1 commit into
Conversation
Adds a sitemap-driven URL discovery hook so a single seed URL can expand into a full-site crawl without an external crawler in the loop. Closes the long-standing ArchiveBox/ArchiveBox#191 gap. The host (ArchiveBox or abx-dl) still owns crawl ordering, dedup, and the global max_depth / max_urls caps — this plugin only emits Snapshot JSONL records, so existing crawl semantics are preserved. Capabilities ------------ * urlset parsing with bookmarked_at from <lastmod> and full <priority> / <changefreq> metadata * sitemapindex recursion up to PARSE_SITEMAP_URLS_MAX_SITEMAP_DEPTH with cycle detection and a visited-set * Gzip handling via magic-byte sniff, .gz suffix, and Content-Encoding: gzip — under hard size / ratio caps * BOM-tolerant XML parsing for UTF-8, UTF-16 LE, and UTF-16 BE; the XML encoding declaration is rewritten when re-encoding UTF-16 to UTF-8 so the parser doesn't trip on the mismatch * robots.txt Sitemap: directive discovery and configurable fallback paths (/sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml, /sitemap.xml.gz by default) * Include / exclude regex filters with scan-length cap, same-host-only mode, priority threshold (with optional REQUIRE_PRIORITY to also drop entries without a <priority> tag), changefreq allowlist * Four sort modes: url (alpha), lastmod (newest first), priority (highest first), order (preserve sitemap order) * Optional image / video / news sitemap extensions; image and video extras get tags=sitemap-media and pass through the same URL policy as page URLs, news emits Tag records for the publication name * HTTP retry with exponential backoff for 408 / 429 / 5xx / network errors, custom User-Agent and Accept-Language headers, TLS verification toggle * Schemeless URL (//host/path) resolution against the sitemap's own scheme, fragment stripping during URL normalization to avoid emitting duplicate snapshots for #anchor variants * Hard PARSE_SITEMAP_URLS_MAX_URLS cap (default 5000), verbose stderr mode, summary counters for every filter category Security posture ---------------- * XML parsing uses defusedxml — DTDs and entities are refused, so billion-laughs and XXE payloads fail-closed * Response bytes capped before parsing (default 50 MiB) * Decompression bounded by absolute size and decompressed/compressed ratio (default 200 MiB / 100x) — gzip bombs fail with status=failed * Scheme allowlist for emitted URLs: only http / https; javascript:, data:, ftp:, etc. are refused. file:// is allowed only when the seed itself is file:// or ALLOW_FILE_URLS is explicitly set * Custom redirect handler bounds redirect count and rejects non-HTTP schemes and private (loopback / RFC1918 / link-local / multicast) targets unless ALLOW_PRIVATE_HOSTS is set * Regex filters scan only the first REGEX_INPUT_CAP characters of each URL, blunting catastrophic-backtracking risk * Seed URL also subject to scheme and private-host gates Filename -------- Numbered 76 (after parse_dom_outlinks at 75; the previous 73 slot collided with parse_netscape_urls). Hook contract ------------- Follows the standard on_Snapshot__* contract from abx-plugins/README.md: * stdout: 0+ Tag records, 0+ Snapshot records, terminal ArchiveResult * stderr: discovery / fetch error lines and the human summary * succeeded if any URL emitted, noresults when filters or empty sitemaps prune everything, skipped when the plugin is disabled, failed only when every candidate sitemap is unfetchable / unparseable or a security guard tripped * SNAP_DIR/parse_sitemap_urls/urls.jsonl written atomically; cleared on noresults / failed so stale data never lingers Tests ----- 72 tests across two files: * tests/test_parse_sitemap_urls.py — 34 tests covering basic urlset, sitemapindex (one and two levels, max-depth, cyclic), gzip, filtering, limits, malformed input (truncated XML, non-XML, empty urlset, missing file, unknown root element, unnamespaced sitemap), robots.txt discovery (direct seed, root-URL fallback, disabled), HTTP fetch via pytest-httpserver, ordering, dedup * tests/test_parse_sitemap_urls_advanced.py — 38 tests covering BOM stripping (UTF-8 / UTF-16 LE), unicode URLs, whitespace normalization, schemeless URLs, priority / changefreq filters with and without REQUIRE_PRIORITY, all four sort modes, image / video / news extensions, HTTP retry on 503, retry exhaustion, redirect following, redirect to non-HTTP scheme refused, private-host seed refused, Content-Encoding: gzip, custom user-agent and Accept-Language, verbose stderr, multiple Sitemap: directives in robots.txt, custom fallback paths, 2000-URL stretch test, dedup of image extras against page URLs, scheme allowlist (javascript:, data:, ftp: refused), remote sitemap → file:// child refused, XML billion-laughs blocked, external entity blocked, gzip bomb rejected by decompression cap, response body cap, fragment stripping, image extras subject to same-host + include filters, MAX_SITEMAP_DEPTH=0 walks only the seed and MAX_SITEMAP_DEPTH=1 walks one child level Manual smoke tests against docs.python.org, github.blog (WordPress sitemap-index), www.sitemaps.org (multi-locale), bbc.com news archive (lastmod sort) all clean. ruff / pyright clean, pytest 72/72 passing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: draft, needs more testing before merge
I'm opening this as a draft because the plugin has not been exercised
inside a live ArchiveBox install yet — only against the standalone
hook contract (the abx-plugins repo's test harness + manual runs
against a handful of real sitemaps). Before this is ready to land I'd
like to:
archivebox addflow and confirmthe emitted
Snapshotrecords feed the host's crawl frontier asexpected
max_depth/max_urlssettings on a real installpartial sitemap (pairs with
parse_dom_outlinks)MAX_SITEMAPSandALLOW_PRIVATE_HOSTSknobs hit the right defaults for theecosystem
If anyone on the abx-plugins side can spot something that won't survive
the host integration, I'd rather hear it now than after I've put more
hours into polish. Happy to keep iterating.
What this does
Adds a sitemap-driven URL discovery hook so a single seed URL can
expand into a full-site crawl without an external crawler in the
loop. Closes the long-standing
ArchiveBox/ArchiveBox#191.
The host (ArchiveBox or
abx-dl) still owns crawl ordering, dedup,and the global
max_depth/max_urlscaps — this plugin only emitsSnapshotJSONL records, so existing crawl semantics are preserved.Features
urlsetparsing<lastmod>asbookmarked_at, also reads<priority>and<changefreq>.sitemapindexrecursionContent-Encoding: gzipheader,.gzURL suffix.robots.txtSitemap:discovery/sitemap.xml,/sitemap_index.xml,/sitemap-index.xml,/wp-sitemap.xml,/sitemap.xml.gz(configurable).same_host_onlymodePRIORITY_MINthreshold with optionalREQUIRE_PRIORITYstrictness;CHANGEFREQ_ALLOWEDallowlist.url/lastmod/priority/order(sitemap order).tags=sitemap-mediaand pass through the same URL policy as page URLs; news emitsTagrecords for publication names.ALLOW_PRIVATE_HOSTS=true.CHECK_SSL_VALIDITY.//host/pathresolved against the sitemap's own scheme.#anchorvariants don't produce duplicate snapshots.MAX_URLScapMAX_SITEMAPScapfetching sitemap …line per fetch on stderr.No new binary dependencies. Stdlib (
xml.etree,gzip,ssl,urllib) plus the existingrich_click,defusedxml, andabx_plugins.plugins.base.utils.Security posture
Untrusted sitemap input is the default threat model. Defenses:
defusedxml.ElementTree.iterparse. DTDs andentities are refused; billion-laughs and XXE fail-closed.
gzip bombs fail with
status=failedrather than OOM.httpandhttps;javascript:,data:,ftp:, etc. are dropped.file://containment: only allowed when the seed itself isfile://or the operator opts in.HTTPRedirectHandlerthat caps the chain and refuses non-HTTPschemes and private (loopback / RFC1918 / link-local / multicast)
targets unless explicitly allowed.
MAX_SITEMAPS) defeats adversarialsitemap-indexes pointing at thousands of broken children.
iterparsewithelem.clear()plusroot.remove(elem)keeps resident XML bounded to a single<url>subtree even on 500k-URL documents.private-host check re-resolves on each policy call but cannot pin
the IP through to
urllib's connect-time lookup. If your threatmodel includes rebinding, run behind an outbound firewall.
Hook contract
Standard
on_Snapshot__*perabx-plugins/README.md:Tagrecords, 0+Snapshotrecords, terminalArchiveResult. Every non-empty stdout line is a JSON record (thetest harness asserts this).
succeededif any URL emitted,noresultswhen filters or emptysitemaps prune everything,
skippedwhen the plugin is disabled,failedonly when every candidate sitemap is unfetchable /unparseable or a security guard tripped.
SNAP_DIR/parse_sitemap_urls/urls.jsonlwritten atomically andremoved on
noresults/failed.Summary string carries counters so logs make it obvious why nothing
emitted:
Filename
Numbered
76— sits afterparse_dom_outlinks(75) and before anylater snapshot work. The
73slot is already taken byparse_netscape_urls.What I tested
83 tests across two files:
tests/test_parse_sitemap_urls.py(34) — basic urlset andsitemapindex parsing, gzip, filtering, malformed input, robots.txt
discovery, HTTP fetch via
pytest-httpserver, ordering and dedup.tests/test_parse_sitemap_urls_advanced.py(49) — BOM stripping,unicode, schemeless URLs, priority and changefreq filtering, all
four sort modes, image / video / news extensions, HTTP retry on
5xx, retry exhaustion, redirect following, redirect count cap,
scheme allowlist, remote →
file://chain refused, billion-laughsblocked, external entity blocked, gzip bomb capped, response body
capped, truncated gzip handled cleanly, fragment stripping, media
extras subject to full URL policy,
MAX_SITEMAP_DEPTHsemantics (
0walks only seed,1walks one child level),MAX_SITEMAPSenforced on attempts (not just successes),cross-host child sitemap refused under
SAME_HOST_ONLY, JSONLcontract enforced, 500k-URL streaming bounded by wall time.
Manual smoke runs against real sites (all clean):
bookmarked_atpopulated/en/,/es/SORT_BY=lastmodSAME_HOST_ONLY=truegithub.blogEXCLUDE_REGEX=/3\.1[0-2]/PRIORITY_MIN=0.8Compatibility
required_plugins, norequired_binaries, no preflight impact.defusedxml>=0.7.1for XML hardening.Open questions for reviewers
MAX_SITEMAPSis 100 — reasonable, or should it behigher / lower? Tied to typical real-world sitemap-index fan-out.
ALLOW_PRIVATE_HOSTSdefaults tofalse, which surfaces aprivate_hosterror when someone tries to archive their ownintranet without opting in. Discoverable enough, or should the
default flip when the seed itself is private?
parse_dom_outlinksfor SPAs.Worth dropping a one-liner in
parse_dom_outlinks/README.mdrecommending sitemap discovery as the "first pass" of a full-site
crawl?
abx-plugins/README.md?Marking ready-for-review only after the e2e run inside ArchiveBox is
clean.
Summary by cubic
Add
parse_sitemap_urlsplugin to discover URLs fromsitemap.xmland emitSnapshotrecords, enabling full‑site crawls from a single seed without an external crawler. Preserves host crawl ordering, dedup, and depth/URL caps.New Features
urlsetandsitemapindex(recursive, cycle-safe); handles gzip and BOM.robots.txtand fallback probing; resolves schemeless URLs; strips fragments.defusedxml, response/decompression caps, HTTP‑only allowlist, bounded redirects, private-host gate.Snapshot(andTagfor news); preserves<lastmod>asbookmarked_at.Dependencies
defusedxml>=0.7.1. No new binaries.Written for commit 093c75d. Summary will update on new commits. Review in cubic