Clone entire documentation sites into AI-friendly markdown via their sitemap.
Built for pulling documentation into context for AI agents when the source repo isn't public.
go install github.com/devon/docs-cloner@latestOr build from source:
git clone https://github.com/devon/docs-cloner.git
cd docs-cloner
go build -o docs-cloner .docs-cloner --url https://example.com/sitemap.xml -o ./docsThis fetches every page in the sitemap, extracts the main content area, converts it to clean markdown with YAML frontmatter, and writes files mirroring the site's URL structure.
Some documentation sites serve raw .md files at alternate URLs. Use --fetch-md to skip HTML conversion entirely:
# Append .md to each page URL (default pattern)
docs-cloner --url https://example.com/sitemap.xml --fetch-md
# Custom pattern with placeholders
docs-cloner --url https://example.com/sitemap.xml --fetch-md "{url}?plain=1"docs-cloner --url https://example.com/sitemap.xml --single-file -o ./docsThis writes individual files and a concatenated all-pages.md with a table of contents at the top.
If the auto-detection picks up the wrong content area, specify a CSS selector:
docs-cloner --url https://example.com/sitemap.xml --selector ".docs-content"# Only grab English docs
docs-cloner --url https://example.com/sitemap.xml --include docs/en/
# Grab everything except the blog
docs-cloner --url https://example.com/sitemap.xml --exclude blog/Git Bash (Windows): Omit the leading
/from filter patterns (usedocs/en/not/docs/en/). Git Bash rewrites arguments starting with/into Windows paths, which breaks the filter. Alternatively, setMSYS_NO_PATHCONV=1.
docs-cloner --url https://example.com/sitemap.xml -c 2 -d 500Each page produces a .md file with YAML frontmatter:
---
title: Page Title
source_url: https://example.com/docs/getting-started
crawl_date: 2026-02-13T15:30:00-05:00
---
Page content in clean markdown...Files are organized to mirror the site structure:
output/
docs/
getting-started.md
api/
authentication.md
endpoints.md
blog/
hello-world.md
Usage:
docs-cloner [flags]
Flags:
--url string Sitemap URL (required)
-o, --output string Output directory (default "./output")
--fetch-md [pattern] Fetch raw markdown instead of converting HTML.
Without a value, appends .md to each URL.
With a value, uses it as a URL pattern.
Placeholders: {url}, {path}, {host}
-c, --concurrency int Parallel workers (default 5)
-d, --delay int Per-worker delay between requests in ms (default 200)
--single-file Also produce a single concatenated all-pages.md
--selector string CSS selector for main content (default: auto-detect)
--include strings Only process URLs containing this substring (repeatable)
--exclude strings Skip URLs containing this substring (repeatable)
--clean Remove output directory before writing
-v, --verbose Log every page
--user-agent string Custom User-Agent (default "docs-cloner/1.0")
-h, --help Show help
- Fetches and parses the XML sitemap (supports sitemap index files with sub-sitemaps)
- Fans out page URLs to a configurable worker pool
- Each worker fetches the page, extracts content using CSS selectors (heuristic cascade or explicit), and converts to markdown
- Strips navigation, sidebars, footers, and other noise
- Adds YAML frontmatter with title, source URL, and crawl date
- Writes
.mdfiles mirroring the site's URL path structure - Optionally concatenates everything into a single file with a TOC
When no --selector is provided, the tool tries these selectors in order and uses the first match with substantial content:
main > article > [role="main"] > .content > .main-content > #content > .markdown-body > .documentation-content > .docs-content > .page-content
Noise elements like nav, .sidebar, .toc, .breadcrumb, script, and style are removed before conversion.
- Does not execute JavaScript. Sites that render content client-side will produce empty or incomplete output. Use
--fetch-mdas a workaround for sites that serve raw markdown. - Respects the sitemap only. Pages not listed in the sitemap won't be cloned.
- No robots.txt checking. Be respectful with concurrency and delay settings.