Skip to content

7and1/robots-txt-checker

Repository files navigation

Robots.txt Checker API

Standalone Cloudflare Worker that fetches, validates, and stores robots.txt files for any domain. Results are saved into Supabase Postgres so PublisherLens and other projects can read canonical data via SQL views.

Key Features

  • 🔍 Fetch + lint robots.txt files (HTTPS fallback to HTTP, 512 KB cap, response metadata)
  • ✅ Parser detects allow/disallow rules, crawl-delay, host, sitemap directives, and deprecated fields
  • 🧪 Simulation endpoint (path + userAgent query) for "is this URL crawlable?"
  • 📦 Batch API to refresh multiple domains
  • 🗃️ Persistence in robots_txt_snapshots with view v_latest_robots_txt
  • ☁️ Cloudflare Worker deployment with caching + CORS-ready responses

Project Layout

src/
  index.ts              # Hono app + routing
  routes/robots.ts      # GET /api/v1/robots/:domain + POST /batch
  lib/
    parser.ts           # robots.txt parser + simulator
    fetchRobots.ts      # Fetch + metadata
    supabase.ts         # REST helpers
    domain.ts, hash.ts, cache.ts, cors.ts
  types/                # shared TypeScript interfaces
database/schema.sql    # Postgres schema + view
docs/                  # requirements + architecture
frontend/              # Next.js UI for robots-txt-checker.com

Local Development

cd robots-txt-checker
pnpm install
pnpm dev               # wrangler dev --local

To run type-check + lint:

pnpm typecheck
pnpm lint

Required Environment Variables

Set in wrangler.toml or via Cloudflare dashboard:

Name Description
SUPABASE_URL Supabase REST base URL (https://xxx.supabase.co)
SUPABASE_SERVICE_ROLE_KEY Service role key for Supabase REST calls
DATABASE_SCHEMA Defaults to robots_intel
CACHE_TTL_SECONDS Edge cache TTL (default 604800)
DEFAULT_MAX_AGE_DAYS Acceptable data staleness before refetch
ALLOWED_ORIGINS Comma-separated CORS allowlist

API Overview

  • GET /api/v1/robots/:domain?path=/foo&userAgent=Googlebot&refresh=true — fetch + simulate
  • POST /api/v1/robots/batch{ "domains": ["example.com", "another.net"], "refresh": false }
  • GET /api/v1/stats — aggregate counters (total rows + last fetch timestamp)
  • GET /api/v1/health — health probe

Response Example

{
  "success": true,
  "data": {
    "domain": "example.com",
    "fetchedAt": "2025-11-21T08:00:00.000Z",
    "meta": {
      "status": 200,
      "finalUrl": "https://example.com/robots.txt",
      "contentType": "text/plain",
      "contentLength": 345,
      "responseMs": 412
    },
    "directives": {
      "domain": "example.com",
      "agents": {
        "*": {
          "allows": ["/"],
          "disallows": ["/private"],
          "userAgent": "*"
        }
      },
      "sitemaps": ["https://example.com/sitemap.xml"],
      "issues": []
    },
    "issues": [],
    "simulation": {
      "allowed": false,
      "matchedRule": { "type": "disallow", "pattern": "/private" },
      "agentApplied": "*"
    }
  }
}

Database Setup

  1. Log into Supabase (same cluster as PublisherLens) and run database/schema.sql.
  2. Optionally add a view for PublisherLens-specific joins or grant read-only role.
  3. Configure SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY in Cloudflare secrets: wrangler secret put SUPABASE_URL, etc.

Deployment

pnpm run deploy

This runs wrangler deploy which publishes the Worker to <account>.workers.dev and the configured custom domain (robots-txt-checker.com).

Frontend (robots-txt-checker.com)

  • Located in frontend/ (Next.js 16, Tailwind v4).
  • Covers the researched keyword intents: robots txt checker, robots txt generator, robots txt validator, robots txt allow/disallow all, robots txt sitemap, robots txt block ai bots, etc.
  • Major sections & landing pages:
    • / — Hero + validator + generator + resources (broad SEO coverage).
    • /validator — dedicated LP for “robots txt validator / tester / indexed though blocked”.
    • /generator — LP for “robots txt generator / allow all / disallow all / WordPress / Shopify / AI bots”.
    • /examples — LP for “robots txt example / best practices / sitemap / block ai bots”.
    • Shared components: live validator, generator presets, keyword spotlight, API showcase, guides/FAQ.
  • Env var: NEXT_PUBLIC_ROBOTS_API_BASE (defaults to http://localhost:8787). Copy .env.local.example.
  • Commands:
    pnpm web:dev      # next dev (frontend)
    pnpm web:build    # next build
    pnpm web:lint     # eslint

Next Steps / Backlog

  • Durable Object queue for scheduled monitoring + alerts
  • Diff history + alerting when robots.txt changes materially
  • Slack/webhook notifications when files change or block Googlebot

About

Robots.txt data collection and storage service

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published