Discover, verify, and bulk-download every TP-Link GPL source release from one command-line pipeline.
Documentation • Installation • Quick start • Commands
Under the GPL, TP-Link must publish the corresponding source code for the open-source components in its firmware. That source lives in an S3-backed bucket (static.tp-link.com) that serves every file without authentication but blocks directory listing, so there is no index and files surface only on scattered regional support pages. tpwalk rebuilds the index: it discovers GPL archive URLs from nine independent sources, HEAD-checks each one against the S3 origin for live status and metadata, and emits a ready-to-run s5cmd manifest for mirroring the whole corpus. The same technique covers the Mercusys sub-brand (static.mercusys.com).
This is a FOSS-compliance and archival tool. Every file it touches is source code a vendor is legally obligated to distribute and already serves publicly; see Responsible use.
- Nine discovery sources: regional GPL pages, Wayback Machine, Common Crawl, GitHub code search, TP-Link's own GitHub, Mercusys regional pages, Reddit, forums, and Google, each writing its own result file.
- Two-pass regional crawl: parses the
productTreeJSON embedded in every country's GPL page, then follows per-modelphppagesub-pages that hide files the main page never links. Different regions expose different files, so tpwalk scrapes all of them. - S3-origin verification: HEAD-checks the raw S3 origin (not just the CDN) to capture
size,etag,last-modified, version-id, replication status, and server-side encryption per file. - Runnable download manifest:
verifyemitss5cmd_download.txt, acp --if-size-differscript you hand straight tos5cmd --no-sign-request run. - Active brute-force enumeration: optional date-path and model-name generators for files no public page references, with tiered coverage and hard safety valves.
- Async and bounded concurrency: HTTP/2 via
httpx, a configurable concurrency ceiling, and a live Rich progress display with a live/dead counter.
$ tpwalk scrape && tpwalk verify
Scrape complete
Regions scraped: 37
Regions failed: 0
Pass 1 URLs: 756
Pass 2 URLs: 420
Raw total: 1834
Unique (normalized): 1283
Regional crawl URLs: 1176 unique (1834 raw)
Wayback CDX URLs: 107 unique (612 raw)
Common Crawl URLs: 44 unique (980 raw)
GitHub search URLs: skipped (no GITHUB_TOKEN)
Mercusys URLs: 21 unique (21 raw)
...
Verify complete
URLs found (unique): 1283
Live: 1187
Dead: 96
$ s5cmd --no-sign-request run data/s5cmd_download.txt # download everythingtpwalk requires Python 3.11+. Install with uv:
uv tool install git+https://github.com/StrongWind1/tpwalkBulk downloading additionally requires s5cmd. For development, use uv sync instead (see Development).
The pipeline is three steps: discover, verify, download.
# 1. Discover URLs from all passive sources -> data/scrapes/<timestamp>/
tpwalk scrape
# 2. HEAD-check every discovered URL -> data/verified.{json,txt}, data/dead.{json,txt}, data/s5cmd_download.txt
tpwalk verify
# 3. Mirror every confirmed-live archive with s5cmd
s5cmd --no-sign-request run data/s5cmd_download.txtGitHub-based sources are skipped unless a token is present:
export GITHUB_TOKEN=ghp_... # enables `github_search` and `tplink_github`
tpwalk scrapetpwalk exposes three subcommands. Run tpwalk <command> --help for the full option list.
Discovers GPL archive URLs from every source and writes one .txt file per source into a timestamped directory under data/scrapes/.
| Option | Default | Description |
|---|---|---|
--data-dir |
data |
Root data directory; the scrapes/ subdirectory is written here. |
--model-sweep |
off | Also run the phppage model-wordlist sweep (heavy: thousands of requests to www.tp-link.com, several minutes). |
--sweep-max-models |
none | Cap the number of candidate models for --model-sweep. |
Reads every .txt under data/scrapes/, normalizes and deduplicates the URLs, HEAD-checks each against the S3 origin, and batch-writes the five output files.
| Option | Default | Description |
|---|---|---|
-c, --concurrency |
100 |
Maximum concurrent S3 HEAD requests. |
--data-dir |
data |
Root data directory; reads scrapes/ from here and writes output here. |
Actively enumerates candidate URLs the passive sources never reference, by crossing a date-path generator with a model-name generator and HEAD-checking each candidate.
| Option | Default | Description |
|---|---|---|
--strategy |
all |
Which generator(s) to run: dates, models, or all. |
--thorough |
off | Model strategy over ~389 known GPL date directories (medium coverage, ~40M HEADs). |
--exhaustive |
off | Full model x date-path cross (~203M HEADs). Always pair with --max-candidates. |
--max-candidates |
none | Hard cap on candidates checked (the safety valve for the heavy tiers). |
--dry-run |
off | Count candidates without issuing any HEAD requests. |
-c, --concurrency |
100 |
Maximum concurrent S3 HEAD requests. |
--data-dir |
data |
Root data directory; writes a timestamped run directory here. |
The
--exhaustivetier issues hundreds of millions of requests. Start with--dry-runto size the job, then bound it with--max-candidates.
scrape runs each source concurrently under a shared async client; a failure in one source never aborts the others.
| Source | Output file | What it does | Needs token |
|---|---|---|---|
| Regional crawl | regional_crawl.txt |
Two-pass crawl of per-country GPL pages: productTree JSON plus phppage sub-pages |
no |
| Wayback Machine | wayback_cdx.txt |
Internet Archive CDX API across five URL prefixes, with resumeKey pagination |
no |
| Common Crawl | common_crawl.txt |
CC index API across all historical indices | no |
| GitHub search | github_search.txt |
Code search for archive URLs across public repos | yes |
| TP-Link GitHub | tplink_github.txt |
TP-Link's own GitHub organizations and repos | yes |
| Mercusys regional | mercusys_regional.txt |
Per-region Mercusys GPL pages (static.mercusys.com) |
no |
reddit.txt |
Reddit search for archive links | no | |
| Forums | forums.txt |
Community forum threads | no |
google.txt |
Google result parsing | no | |
| Model sweep (opt-in) | model_sweep.txt |
phppage model-wordlist sweep, enabled by --model-sweep |
no |
static.tp-link.com denies ListBucket on every prefix, so the bucket cannot be enumerated. A direct GET (or HEAD) of any known key still succeeds with no authentication, on both the CloudFront CDN and the raw S3 origin. tpwalk verifies against the origin (s3.amazonaws.com/static.tp-link.com) because it returns richer metadata headers (version-id, replication status, and encryption) that the CDN strips. Two key conventions exist: a flat legacy layout (/resources/gpl/<file>, pre-2022) and a date-hierarchical modern layout (/upload/gpl-code/YYYY/YYYYMM/YYYYMMDD/<file>).
verify writes five files into --data-dir (default data/):
| File | Format | Contents |
|---|---|---|
verified.json |
JSON array | One object per live archive with full S3 metadata (size, etag, content_type, last_modified, version_id, replication_status, encryption, server, status, checked_at). |
verified.txt |
text | One s3:// URL per line (direct input for s5cmd). |
s5cmd_download.txt |
text | A runnable cp --if-size-differ manifest for s5cmd --no-sign-request run. |
dead.json |
JSON array | One object per dead URL, classified as http, timeout, or network. |
dead.txt |
text | One https:// URL per line. |
tpwalk/
|-- cli.py # Typer entry point: scrape / verify / bruteforce
|-- _client.py # shared httpx AsyncClient factory (HTTP/2, retries, headers)
|-- _normalize.py # URL canonicalization + s3:// / s5cmd line conversion
|-- models.py # frozen dataclasses: VerifiedEntry, DeadEntry, *Stats
|-- scrape/ # nine discovery sources + the ScrapeRunner orchestrator
|-- verify/ # read -> dedupe -> HEAD-check -> write the five output files
`-- bruteforce/ # active date-path / model-name enumeration
tests/ # pytest suite (mocked HTTP, no live network)
data/ # scrapes/ (discovered URLs incl. curated seed/), verify output, firmware_s3_listing.json
tpwalk uses uv for environment management, Ruff for linting and formatting, ty for type-checking, and pytest for tests.
uv sync # create the dev environment
make check # ruff format --check + ruff check + ty + pytest
make format # auto-fix formatting and lint
make test # pytest onlyThe test suite mocks all HTTP and never touches the live network, so it is safe to run anywhere.
Other projects in this collection:
- NFSWolf - native NFS security toolkit
- WPAWolf - WPA/WPA2/WPA3-FT-PSK handshake extraction from captures
- CredWolf - Active Directory credential validation
The GPL and LGPL require any party distributing covered binaries to make the corresponding source code available. tpwalk only locates and mirrors that license-mandated source, which TP-Link and Mercusys already publish on public, unauthenticated CDNs. It performs read-only requests (HEAD and GET) and never attempts to access non-public data.
That said, be a good neighbor: keep --concurrency reasonable, prefer the passive scrape/verify path over brute force, and treat the --thorough/--exhaustive tiers as deliberate, bounded jobs rather than defaults. You are responsible for complying with the robots directives and terms of service of every source the scraper touches.