Python library for working with Screaming Frog SEO Spider crawl data programmatically.
Public alpha is focused on DB-backed crawl workflows:
- query
.dbseospidercrawls without manual exports - access all
628mapped export/report surfaces - run sitewide page and link queries, raw SQL, crawl diff, and chain analysis
- convert
.seospidercrawls into queryable DB-backed workflows - use DuckDB as the default analysis engine, with Derby as the crawl source-of-truth
See methods.md for a complete method-level API reference.
601 / 628tabs fully mapped15,490 / 15,589fields mapped- current
mainpasses195tests (2skipped live/optional tests)
- Title and meta-description pixel-width filters are not implemented yet.
- Some hreflang edge cases still do not have exact Derby parity (
incorrect language-codecases). .seospiderconversion requires a local Screaming Frog CLI install.
from screamingfrog import Crawl
crawl = Crawl.load("./exports")
for page in crawl.internal.filter(status_code=404):
print(page.address)from screamingfrog import Crawl, list_crawls
# CSV exports directory
crawl = Crawl.load("./exports")
# SQLite database
crawl = Crawl.load("./crawl.db")
# DuckDB analytics cache
crawl = Crawl.load("./crawl.duckdb")
# Derby .dbseospider file -> auto-promotes into a sibling DuckDB cache by default
crawl = Crawl.load("./crawl.dbseospider")
# Screaming Frog .seospider crawl (default: convert to DB + DuckDB-backed analysis)
crawl = Crawl.load("./crawl.seospider")
# Disable .dbseospider materialization (still uses Derby from ProjectInstanceData)
crawl = Crawl.load(
"./crawl.seospider",
materialize_dbseospider=False,
)
# Force CSV mode for .seospider (CLI export -> CSV backend)
crawl = Crawl.load(
"./crawl.seospider",
seospider_backend="csv",
export_dir="./exports_from_seospider",
export_tabs=["Internal:All", "External:All", "Response Codes:All"],
)
# Kitchen-sink export profile (all tabs/bulk exports from SF UI)
crawl = Crawl.load(
"./crawl.seospider",
seospider_backend="csv",
export_dir="./exports_kitchen",
export_profile="kitchen_sink",
)
# DB crawl ID (DB mode) loads DuckDB-backed analysis by default
crawl = Crawl.load("138edb21-61d0-41cd-9e9b-725b592a471c", source_type="db_id")
# DB crawl ID -> export and load a DuckDB analytics cache directly
crawl = Crawl.load(
"138edb21-61d0-41cd-9e9b-725b592a471c",
source_type="db_id",
db_id_backend="duckdb",
duckdb_path="./crawl.duckdb",
duckdb_tabs="all",
)
# Discover available DB crawls, then load one by ID
latest = list_crawls()[0]
crawl = Crawl.load(latest.db_id, source_type="db_id").dbseospider, DB crawl IDs, and.seospiderconversions default to DuckDB-backed analysis.- Use
dbseospider_backend="derby"/db_id_backend="derby"/seospider_backend="derby"to stay on Derby. .seospiderdefaults to DB conversion (CLI load + Derby source, DuckDB analysis). Useseospider_backend="csv"for exports..seospiderauto-materializes a.dbseospiderfile next to the crawl (overwrite default).- Set
materialize_dbseospider=Falseto avoid creating the.dbseospidercache file. - Set
dbseospider_overwrite=Falseto reuse an existing.dbseospidercache. - DB conversion can temporarily set
storage.mode=DBinspider.config(setensure_db_mode=Falseto skip). - Internal DB crawl directories (e.g.
ProjectInstanceData/.../results_.../sql) load via Derby. - DB crawl IDs can force CSV exports with
db_id_backend="csv". - DuckDB cache refresh defaults to
duckdb_if_exists="auto"and rebuilds only when the Derby source changed. - Set
SCREAMINGFROG_CLIif the CLI executable is not in a standard install path. - CLI exports default to the
Internal:Alltab unlessexport_tabsis provided. export_profile="kitchen_sink"uses bundled export lists captured from the SF UI.- Derby loads can auto-fallback to CSV exports for missing columns or GUI filters (
csv_fallback=True,csv_fallback_profile="kitchen_sink"). - CSV fallback cache defaults to
csv_fallback_cache_dir(next to the crawl); setcsv_fallback=Falseto disable. .duckdbloads use the DuckDB analytics backend directly.
DuckDB is the default analysis layer for DB-backed crawl workflows. Derby remains the crawl source-of-truth:
from screamingfrog import Crawl
derby_crawl = Crawl.load("./crawl.dbseospider", dbseospider_backend="derby", csv_fallback=False)
derby_crawl.export_duckdb("./crawl.duckdb", if_exists="auto")
fast = Crawl.load("./crawl.duckdb")
# one DuckDB file can also hold multiple crawls under separate namespaces
derby_crawl.export_duckdb("./portfolio.duckdb", namespace="client-a", if_exists="auto")
other_crawl.export_duckdb("./portfolio.duckdb", namespace="client-b", if_exists="auto")
namespaces = Crawl.duckdb_namespaces("./portfolio.duckdb")
client_a = Crawl.from_duckdb("./portfolio.duckdb", namespace="client-a")
pages_404 = fast.pages().filter(status_code=404).collect()
lightweight = fast.pages().select("Address", "Status Code", "Title 1").collect()
broken_inlinks = fast.links("in").select("Source", "Address", "Status Code").filter(status_code=404).collect()
matching_pages = fast.search("canonical", fields=["Address", "Title 1"]).collect()
links = fast.links("in").filter(status_code=404).collect()
rows = (
fast.query("APP", "URLS")
.select("ENCODED_URL", "RESPONSE_CODE")
.where("RESPONSE_CODE >= ?", 400)
.collect()
)Notes:
- Derby remains the source-of-truth crawl store.
- DuckDB is the default analysis engine for DB-backed workflows.
- Default DB-backed loads now create a tiny sidecar DuckDB cache first, keep Derby prewarmed as the lazy source backend, and only materialize heavier relations if you actually ask for them.
- DuckDB caches can now store multiple crawls in one
.duckdbfile via namespaces; passnamespace=...on export andCrawl.from_duckdb(..., namespace=...)on load. - Repeated DB-backed loads in the same Python process now reuse the cached Derby source backend for the same crawl fingerprint, so reopening the same crawl avoids paying Derby startup again.
- High-level page workflows (
crawl.pages(), page counts, page iteration) now read from the internal model directly instead of forcinginternal_alltab materialization on a cold cache. crawl.pages().select(...)now projects narrow page field subsets through a sharedinternal_commonhelper relation or the prewarmed Derby source backend, so lightweight page workflows avoid wideinternal_allmaterialization too.crawl.links(...).select(...)now does the same against the sharedlinks_corehelper relation, so lightweight sitewide link queries avoid materializingall_inlinks/all_outlinkstabs on cold caches.- Cold-cache projected page/link reads now prefer one-shot source-backed projections before writing helper relations into DuckDB, so first-use lightweight workflows stay closer to direct-query cost instead of paying a cache-write penalty up front.
compare()now uses the same source-backed projection path for its wider internal field set, so cold-cache crawl diffs no longer fall back to fullcrawl.internalscans.- Cold-cache graph workflows (
broken_links_report,broken_inlinks_report,nofollow_inlinks_report) can execute directly from the prewarmed Derby source, so they return without first exporting wideall_inlinkstables into DuckDB. - Generic
crawl.tab(...)/crawl.tab_columns(...)calls also fall back to the prewarmed source backend when a tab is not cached yet, so first-use tab access no longer forces a DuckDB export round-trip. - When DuckDB does need cached subsets, it now materializes narrow helper relations instead of forcing full
internal_all/all_inlinksexports. compare()now uses a DuckDB-first projection path too, so crawl diffs only pull the internal fields required for diffing instead of fullinternal_allrows.title_meta_audit()runs DuckDB-first wheninternal_allis already cached, and otherwise falls back to the same high-level internal model.- DuckDB
inlinks(url)/outlinks(url)fall back to the source backend or narrow cached link relations, so they still work on lean caches withoutall_inlinks/all_outlinks. - Issue-family helpers read DuckDB issue relations directly when they exist in the cache.
- Chain helpers now fall back to raw DuckDB traversal too, so redirect/canonical chain methods no longer require materialized chain tabs on lean caches.
summary()keeps the core crawl counts fast on cold caches; issue-family and chain counts areNoneuntil those tab families are materialized.- You can also export directly from a DB crawl id with
export_duckdb_from_db_id(...). .dbseospider,.seospider, and DB crawl ID loaders can all auto-promote to DuckDB.- Use
tabs="all"if you want to materialize every currently available mapped tab into the DuckDB cache.
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider")
matching_pages = crawl.search("blog", fields=["Address", "Title 1"]).collect()
projected = crawl.pages().select("Address", "Status Code", "Title 1").filter(status_code=404).collect()
nofollow_links = crawl.links("in").search("nofollow", fields=["Follow"]).collect()
blog_inlinks = crawl.section("/blog").tab("all_inlinks").collect()
orphans = crawl.orphan_pages_report(only_indexable=True)
broken_inlinks = crawl.broken_inlinks_report()
security_issues = crawl.security_issues_report()
canonical_issues = crawl.canonical_issues_report()
hreflang_issues = crawl.hreflang_issues_report()
redirect_issues = crawl.redirect_issues_report()
summary = crawl.summary()Use list_crawls() to enumerate DB-mode crawls in your local Screaming Frog
ProjectInstanceData directory, without opening Derby or starting Java.
from screamingfrog import list_crawls
for info in list_crawls():
print(info.db_id, info.url, info.urls_crawled, info.modified)list_crawls(project_root=...) returns CrawlInfo objects with:
db_id: crawl UUID folder nameurl: crawl start URLurls_crawled: number of crawled URLspercent_complete: crawl completion percentagemodified: last modified timestamp (UTC)path: absolute path to the crawl folder
In addition to the typed internal view, you can iterate any exported tab:
from screamingfrog import Crawl
crawl = Crawl.load("./exports")
# List available CSV tabs
print(crawl.tabs)
# Access a tab by file name (extension optional)
for row in crawl.tab("response_codes_all"):
print(row["Address"], row["Status Code"])
# Filter using column names or snake_case equivalents
for row in crawl.tab("internal_all").filter(status_code="404"):
print(row["Address"])
# Apply GUI filters (when supported)
for row in crawl.tab("page_titles").filter(gui="Missing"):
print(row["Address"], row["Title 1"])Notes:
- CSV backend exposes any
*.csvin the export folder. - Derby backend exposes tabs mapped in
schemas/mapping.json(orSCREAMINGFROG_MAPPING). - Hybrid Derby+CSV fallback is enabled by default for
Crawl.loadand will export missing tabs on demand. - SQLite backend supports only a small set of high-value tabs (response codes, titles, meta description, internal_all).
- For exact GUI filter behavior, use CSV exports (e.g.,
export_profile="kitchen_sink"). - Derby now natively supports
Response Codes > Internal Redirect ChainandHreflang > Not Using Canonical. - HTTP canonical/rel fields in Derby are parsed from
HTTP_RESPONSE_HEADER_COLLECTIONwhen present. - Derby-backed
crawl.internalnow materializes computed mapped fields likeIndexabilityandIndexability Status. - Derby filters now work against mapped expression fields and header-derived fields in both
crawl.internalandcrawl.tab(...). - Some link metrics (Link Score, % of Total, JS outlink counts) are not mapped in Derby yet.
Use first-class page/link views when you do not want to remember tab names:
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider")
pages_404 = crawl.pages().filter(status_code=404).collect()
nofollow_inlinks = crawl.links("in").filter(rel="nofollow").collect()
blog_pages = crawl.section("/blog").pages().collect()
blog_outlinks = crawl.section("/blog").links("out").collect()Notes:
crawl.pages()is a mapped sitewide page view backed by the internal page model, with DuckDB/source-backed fast paths for counts and iteration.crawl.links("in")/crawl.links("out")are sitewide mapped link views backed by cached link tabs when available and by the source backend when the cache is still lean.crawl.section("/blog")matches by URL path prefix; pass a full URL prefix if you want host-specific scoping.
When using a .dbseospider crawl, you can read inlinks/outlinks directly from Derby:
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider")
for link in crawl.inlinks("https://example.com/page"):
if link.data.get("NoFollow"):
print(link.source, "->", link.destination, link.data.get("Rel"))Dedicated chain helpers are available on Crawl:
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider")
# Redirect chains with 3+ hops and no loop
for row in crawl.redirect_chains(min_hops=3, loop=False):
print(row["Address"], row.get("Number of Redirects"))
# Canonical chains
for row in crawl.canonical_chains(min_hops=2):
print(row["Address"], row.get("Number of Canonicals"))
# Mixed redirect+canonical chains
for row in crawl.redirect_and_canonical_chains(min_hops=4):
print(row["Address"], row.get("Number of Redirects/Canonicals"))Thin report helpers are available for common workflows:
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider")
broken = crawl.broken_links_report()
title_meta = crawl.title_meta_audit()
non_indexable = crawl.indexability_audit()
chains = crawl.redirect_chain_report(min_hops=3)Notes:
broken_links_report()returns broken internal URLs with inlink counts and sampled inlink sources when available.title_meta_audit()currently surfaces missing titles and missing meta descriptions as flat issue rows.indexability_audit()returns non-indexable pages with the key indexability fields that explain why.redirect_chain_report()is a collected helper overcrawl.redirect_chains(...).
Mapped fields are stable and documented. Raw access is available for advanced users who want immediate access to Derby/SQLite columns even when mappings are incomplete.
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider", csv_fallback=False)
# Raw table rows (Derby/SQLite only)
for row in crawl.raw("APP.URLS"):
print(row["ENCODED_URL"], row["RESPONSE_CODE"])
# SQL passthrough (Derby/SQLite only)
for row in crawl.sql(
"SELECT ENCODED_URL, RESPONSE_CODE FROM APP.URLS WHERE RESPONSE_CODE >= ?",
[400],
):
print(row)Notes:
raw()/sql()are not supported for CSV/CLI export backends.- Raw column names may vary by backend and Screaming Frog version.
Use a chainable API for common SQL without writing full query strings:
from screamingfrog import Crawl
crawl = Crawl.load("./crawl.dbseospider", csv_fallback=False)
rows = (
crawl.query("APP", "URLS")
.select("ENCODED_URL", "RESPONSE_CODE", "TITLE_1")
.where("RESPONSE_CODE >= ?", 400)
.order_by("RESPONSE_CODE DESC", "ENCODED_URL ASC")
.limit(100)
.collect()
)Notes:
crawl.query(...)uses the backend SQL engine (Derby/Hybrid/SQLite).- CSV/CLI export backends do not support SQL/query execution.
- Use
.to_sql()if you want to inspect the generated SQL + params. InternalView,TabView,LinkView,QueryView, andCrawlDiffalso supportto_pandas()/to_polars()with optional dependencies installed.
from screamingfrog import Crawl
old = Crawl.load("./crawl-2024-01.dbseospider")
new = Crawl.load("./crawl-2024-02.dbseospider")
diff = new.compare(old)
print(diff.summary())
for change in diff.status_changes[:5]:
print(change.url, change.old_status, "->", change.new_status)Notes:
- Title comparison uses
Title 1by default (override viacompare(..., title_fields=...)). - Redirect changes are best-effort and depend on available columns/headers.
- Additional field changes are captured for canonical + canonical status, meta description/keywords/refresh, H1/H2/H3, word count, indexability, and robots + directives summary by default (override via
compare(..., field_groups=...)). diff.to_rows()flattens all change buckets into one row list for export/dataframes.
Ready-to-run scripts are available in examples/:
examples/broken_links_report.pyexamples/title_meta_audit.pyexamples/crawl_diff.py
from screamingfrog import Crawl
crawl = Crawl.load("./exports")
# List GUI filter names for a tab
print(crawl.tab_filters("Page Titles"))
# Inspect columns (CSV header or Derby mapping)
print(crawl.tab_columns("page_titles"))
# Get both in one shot
print(crawl.describe_tab("page_titles"))You can access the bundled kitchen-sink export lists directly:
from screamingfrog.config import get_export_profile
profile = get_export_profile("kitchen_sink")
print(len(profile.export_tabs), len(profile.bulk_exports))The package includes Python wrappers around the Screaming Frog CLI:
from screamingfrog import export_crawl, start_crawl
# Start a crawl from a URL
start_crawl(
"https://example.com",
"./out",
save_crawl=True,
export_tabs=["Internal:All", "Response Codes:All"],
)
# Export from an existing crawl file (.seospider / .dbseospider)
export_crawl(
"./crawl.seospider",
"./exports",
export_tabs=["Internal:All", "Page Titles:Missing"],
).dbseospider files are zip archives of a DB-mode crawl folder. You can pack or
unpack them with helpers:
from screamingfrog import (
export_dbseospider_from_seospider,
pack_dbseospider,
pack_dbseospider_from_db_id,
unpack_dbseospider,
)
# Package an internal DB crawl folder
dbseospider = pack_dbseospider(
r"C:\Users\Antonio\.ScreamingFrogSEOSpider\ProjectInstanceData\<project_id>",
r"C:\Users\Antonio\my-crawl.dbseospider",
)
# Package by DB crawl ID
dbseospider = pack_dbseospider_from_db_id(
"7c356a1b-ea14-40f3-b504-36c3046432a2",
r"C:\Users\Antonio\my-crawl.dbseospider",
)
# Convert a .seospider crawl into .dbseospider
dbseospider = export_dbseospider_from_seospider(
r"C:\Users\Antonio\schema-discovery\actionnetwork_crawl\crawl.seospider",
r"C:\Users\Antonio\actionnetwork.dbseospider",
)
# Extract a .dbseospider file
unpack_dbseospider(
r"C:\Users\Antonio\my-crawl.dbseospider",
r"C:\Users\Antonio\unpacked_crawl",
)Notes:
export_dbseospider_from_seospiderruns the Screaming Frog CLI, then packages the newly created DB crawl folder. If your DB storage path is custom, setSCREAMINGFROG_PROJECT_DIRor passproject_root=....- The helper can force
storage.mode=DBviaspider.config(setensure_db_mode=Falseto skip).
Use ConfigPatches to build patch JSON for the Java ConfigBuilder:
from screamingfrog import ConfigPatches, CustomSearch, CustomJavaScript
patches = ConfigPatches()
patches.set("mCrawlConfig.mRenderingMode", "JAVASCRIPT")
patches.add_custom_search(CustomSearch(name="Filter 1", query=".*", data_type="REGEX"))
patches.add_custom_javascript(
CustomJavaScript(name="Extractor 1", javascript="return document.title;")
)
patch_json = patches.to_json()Apply patches directly to a .seospiderconfig file:
from screamingfrog import ConfigPatches, write_seospider_config
patches = ConfigPatches().set("mCrawlConfig.mMaxUrls", 5000)
write_seospider_config(
"base.seospiderconfig",
"alpha.seospiderconfig",
patches,
)Recommended install from PyPI:
python -m pip install screamingfrogIf you want the latest unreleased main branch instead:
python -m pip install "git+https://github.com/Amaculus/screaming-frog-api.git@main"For local development from a clone:
python -m pip install -e .[dev]Derby support (.dbseospider), DuckDB export, and .seospiderconfig writing are included in the base install. Optional extras still exist ([derby], [config], [duckdb], [alpha]) but are not required for a standard install.
Bundled Derby jars are included with this package (Apache Derby 10.17.1.0), so
DERBY_JAR is optional. Set DERBY_JAR if you want to override the bundled jars
or use a different Derby install.
The Derby driver jars are bundled, but you still need a Java runtime (java.exe / java) available.
If Java is missing, Derby loads raise:
RuntimeError: Java runtime not found. Set JAVA_HOME or add java to PATH.
Quick checks and fixes:
java -version- If Screaming Frog desktop is installed, this library already tries these paths automatically:
C:\Program Files (x86)\Screaming Frog SEO Spider\jreC:\Program Files\Screaming Frog SEO Spider\jre
- Otherwise install a JRE/JDK and set
JAVA_HOME(or add Java toPATH).
Windows PowerShell example:
$env:JAVA_HOME = "C:\Program Files\Java\jdk-21"
$env:Path = "$env:JAVA_HOME\\bin;$env:Path"Third-party notices for Apache Derby are included in screamingfrog/vendor/derby/NOTICE.
Derby tab mapping uses schemas/mapping.json. Set SCREAMINGFROG_MAPPING if
you store the mapping elsewhere.
To help map more GUI tabs to Derby (see Antonio's LinkedIn for progress):
- Source of truth:
schemas/mapping.json(keys = normalized export filenames, e.g.internal_all.csv). - Workflow: Compare CSV schema in
schemas/csv/with Derby schema inschemas/db/tables/; preferdb_column->db_expression->header_extract/blob_extract/derived_extract/multi_row_extract->NULL; then add/update tests. - Automation: Run from repo root:
python scripts/suggest_mappings.py --tab hreflang_all.csv # suggestions for one tab python scripts/suggest_mappings.py --tab-family hreflang # all hreflang_* tabs python scripts/suggest_mappings.py --list-unmapped # tabs with unmapped columns python scripts/suggest_mappings.py --patch --tab my_tab # JSON fragment to merge into mapping.json python scripts/suggest_mappings.py --report-nulls # regenerate mapping_nulls.md content
- PRs: Prefer PRs to
schemas/mapping.jsonfor new column coverage; for repeated Derby SQL incompatibilities, fix inscreamingfrog/backends/derby_backend.py; for GUI filter parity, usescreamingfrog/filters/*.py. Seescripts/README.md,schemas/mapping_nulls.md,schemas/inlinks_mapping_nulls.md, andMAPPING_BACKLOG.mdfor current backlog and known hard families.
python -m pip install -e .[dev]
pytestOptional live smoke coverage for a real local SF crawl:
SCREAMINGFROG_RUN_LIVE_SMOKE=1 pytest -q tests/test_live_smoke.py -rs --basetemp .pytest-tmp