Standalone Cloudflare Worker that fetches, validates, and stores robots.txt files for any domain. Results are saved into Supabase Postgres so PublisherLens and other projects can read canonical data via SQL views.
- 🔍 Fetch + lint robots.txt files (HTTPS fallback to HTTP, 512 KB cap, response metadata)
- ✅ Parser detects allow/disallow rules, crawl-delay, host, sitemap directives, and deprecated fields
- 🧪 Simulation endpoint (
path+userAgentquery) for "is this URL crawlable?" - 📦 Batch API to refresh multiple domains
- 🗃️ Persistence in
robots_txt_snapshotswith viewv_latest_robots_txt - ☁️ Cloudflare Worker deployment with caching + CORS-ready responses
src/
index.ts # Hono app + routing
routes/robots.ts # GET /api/v1/robots/:domain + POST /batch
lib/
parser.ts # robots.txt parser + simulator
fetchRobots.ts # Fetch + metadata
supabase.ts # REST helpers
domain.ts, hash.ts, cache.ts, cors.ts
types/ # shared TypeScript interfaces
database/schema.sql # Postgres schema + view
docs/ # requirements + architecture
frontend/ # Next.js UI for robots-txt-checker.com
cd robots-txt-checker
pnpm install
pnpm dev # wrangler dev --localTo run type-check + lint:
pnpm typecheck
pnpm lintSet in wrangler.toml or via Cloudflare dashboard:
| Name | Description |
|---|---|
SUPABASE_URL |
Supabase REST base URL (https://xxx.supabase.co) |
SUPABASE_SERVICE_ROLE_KEY |
Service role key for Supabase REST calls |
DATABASE_SCHEMA |
Defaults to robots_intel |
CACHE_TTL_SECONDS |
Edge cache TTL (default 604800) |
DEFAULT_MAX_AGE_DAYS |
Acceptable data staleness before refetch |
ALLOWED_ORIGINS |
Comma-separated CORS allowlist |
GET /api/v1/robots/:domain?path=/foo&userAgent=Googlebot&refresh=true— fetch + simulatePOST /api/v1/robots/batch—{ "domains": ["example.com", "another.net"], "refresh": false }GET /api/v1/stats— aggregate counters (total rows + last fetch timestamp)GET /api/v1/health— health probe
{
"success": true,
"data": {
"domain": "example.com",
"fetchedAt": "2025-11-21T08:00:00.000Z",
"meta": {
"status": 200,
"finalUrl": "https://example.com/robots.txt",
"contentType": "text/plain",
"contentLength": 345,
"responseMs": 412
},
"directives": {
"domain": "example.com",
"agents": {
"*": {
"allows": ["/"],
"disallows": ["/private"],
"userAgent": "*"
}
},
"sitemaps": ["https://example.com/sitemap.xml"],
"issues": []
},
"issues": [],
"simulation": {
"allowed": false,
"matchedRule": { "type": "disallow", "pattern": "/private" },
"agentApplied": "*"
}
}
}- Log into Supabase (same cluster as PublisherLens) and run
database/schema.sql. - Optionally add a view for PublisherLens-specific joins or grant read-only role.
- Configure
SUPABASE_URL+SUPABASE_SERVICE_ROLE_KEYin Cloudflare secrets:wrangler secret put SUPABASE_URL, etc.
pnpm run deployThis runs wrangler deploy which publishes the Worker to <account>.workers.dev and the configured custom domain (robots-txt-checker.com).
- Located in
frontend/(Next.js 16, Tailwind v4). - Covers the researched keyword intents: robots txt checker, robots txt generator, robots txt validator, robots txt allow/disallow all, robots txt sitemap, robots txt block ai bots, etc.
- Major sections & landing pages:
/— Hero + validator + generator + resources (broad SEO coverage)./validator— dedicated LP for “robots txt validator / tester / indexed though blocked”./generator— LP for “robots txt generator / allow all / disallow all / WordPress / Shopify / AI bots”./examples— LP for “robots txt example / best practices / sitemap / block ai bots”.- Shared components: live validator, generator presets, keyword spotlight, API showcase, guides/FAQ.
- Env var:
NEXT_PUBLIC_ROBOTS_API_BASE(defaults tohttp://localhost:8787). Copy.env.local.example. - Commands:
pnpm web:dev # next dev (frontend) pnpm web:build # next build pnpm web:lint # eslint
- Durable Object queue for scheduled monitoring + alerts
- Diff history + alerting when robots.txt changes materially
- Slack/webhook notifications when files change or block Googlebot