Robots.txt Checker API

Standalone Cloudflare Worker that fetches, validates, and stores robots.txt files for any domain. Results are saved into Supabase Postgres so PublisherLens and other projects can read canonical data via SQL views.

Key Features

🔍 Fetch + lint robots.txt files (HTTPS fallback to HTTP, 512 KB cap, response metadata)
✅ Parser detects allow/disallow rules, crawl-delay, host, sitemap directives, and deprecated fields
🧪 Simulation endpoint (path + userAgent query) for "is this URL crawlable?"
📦 Batch API to refresh multiple domains
🗃️ Persistence in robots_txt_snapshots with view v_latest_robots_txt
☁️ Cloudflare Worker deployment with caching + CORS-ready responses

Project Layout

src/
  index.ts              # Hono app + routing
  routes/robots.ts      # GET /api/v1/robots/:domain + POST /batch
  lib/
    parser.ts           # robots.txt parser + simulator
    fetchRobots.ts      # Fetch + metadata
    supabase.ts         # REST helpers
    domain.ts, hash.ts, cache.ts, cors.ts
  types/                # shared TypeScript interfaces
database/schema.sql    # Postgres schema + view
docs/                  # requirements + architecture
frontend/              # Next.js UI for robots-txt-checker.com

Local Development

cd robots-txt-checker
pnpm install
pnpm dev               # wrangler dev --local

To run type-check + lint:

pnpm typecheck
pnpm lint

Required Environment Variables

Set in wrangler.toml or via Cloudflare dashboard:

Name	Description
`SUPABASE_URL`	Supabase REST base URL (https://xxx.supabase.co)
`SUPABASE_SERVICE_ROLE_KEY`	Service role key for Supabase REST calls
`DATABASE_SCHEMA`	Defaults to `robots_intel`
`CACHE_TTL_SECONDS`	Edge cache TTL (default 604800)
`DEFAULT_MAX_AGE_DAYS`	Acceptable data staleness before refetch
`ALLOWED_ORIGINS`	Comma-separated CORS allowlist

API Overview

GET /api/v1/robots/:domain?path=/foo&userAgent=Googlebot&refresh=true — fetch + simulate
POST /api/v1/robots/batch — { "domains": ["example.com", "another.net"], "refresh": false }
GET /api/v1/stats — aggregate counters (total rows + last fetch timestamp)
GET /api/v1/health — health probe

Response Example

{
  "success": true,
  "data": {
    "domain": "example.com",
    "fetchedAt": "2025-11-21T08:00:00.000Z",
    "meta": {
      "status": 200,
      "finalUrl": "https://example.com/robots.txt",
      "contentType": "text/plain",
      "contentLength": 345,
      "responseMs": 412
    },
    "directives": {
      "domain": "example.com",
      "agents": {
        "*": {
          "allows": ["/"],
          "disallows": ["/private"],
          "userAgent": "*"
        }
      },
      "sitemaps": ["https://example.com/sitemap.xml"],
      "issues": []
    },
    "issues": [],
    "simulation": {
      "allowed": false,
      "matchedRule": { "type": "disallow", "pattern": "/private" },
      "agentApplied": "*"
    }
  }
}

Database Setup

Log into Supabase (same cluster as PublisherLens) and run database/schema.sql.
Optionally add a view for PublisherLens-specific joins or grant read-only role.
Configure SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY in Cloudflare secrets: wrangler secret put SUPABASE_URL, etc.

Deployment

pnpm run deploy

This runs wrangler deploy which publishes the Worker to <account>.workers.dev and the configured custom domain (robots-txt-checker.com).

Frontend (robots-txt-checker.com)

Located in frontend/ (Next.js 16, Tailwind v4).
Covers the researched keyword intents: robots txt checker, robots txt generator, robots txt validator, robots txt allow/disallow all, robots txt sitemap, robots txt block ai bots, etc.
Major sections & landing pages:
- / — Hero + validator + generator + resources (broad SEO coverage).
- /validator — dedicated LP for “robots txt validator / tester / indexed though blocked”.
- /generator — LP for “robots txt generator / allow all / disallow all / WordPress / Shopify / AI bots”.
- /examples — LP for “robots txt example / best practices / sitemap / block ai bots”.
- Shared components: live validator, generator presets, keyword spotlight, API showcase, guides/FAQ.
Env var: NEXT_PUBLIC_ROBOTS_API_BASE (defaults to http://localhost:8787). Copy .env.local.example.

Commands:

pnpm web:dev      # next dev (frontend)
pnpm web:build    # next build
pnpm web:lint     # eslint

Next Steps / Backlog

Durable Object queue for scheduled monitoring + alerts
Diff history + alerting when robots.txt changes materially
Slack/webhook notifications when files change or block Googlebot

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
database		database
docs		docs
frontend		frontend
src		src
workers/robots-txt-parser		workers/robots-txt-parser
.dev.vars.example		.dev.vars.example
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Robots.txt Checker API

Key Features

Project Layout

Local Development

Required Environment Variables

API Overview

Response Example

Database Setup

Deployment

Frontend (robots-txt-checker.com)

Next Steps / Backlog

About

Uh oh!

Releases

Packages

Languages

7and1/robots-txt-checker

Folders and files

Latest commit

History

Repository files navigation

Robots.txt Checker API

Key Features

Project Layout

Local Development

Required Environment Variables

API Overview

Response Example

Database Setup

Deployment

Frontend (robots-txt-checker.com)

Next Steps / Backlog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages