A small focused web crawler built with agentiny. Give it a seed URL and a topic — it crawls the site, has Claude score each page for relevance, and only follows links from pages that score above a threshold.
The point of the demo: the fetcher and the scorer run concurrently. While
page N+1 is being fetched, page N is being scored. That falls out naturally from
agentiny's reactive triggers; expressing it cleanly with plain await chains
would be awkward.
seed URL
↓
trigger 1 (fetch) ──────► trigger 2 (score) ──┐
▲ │
│ ▼
└────── new links if score >= 6 ◄──────────┘
trigger 3 (stop) — fires when nothing is fetching, nothing is scoring,
every page has a score, and the queue is drained
Both worker triggers use the same shape:
- Synchronously reserve a unit of work (flip a mutex flag).
- Kick off async work as fire-and-forget.
- When the promise resolves, mutate state in place and call
setStateto wake the loop for the next pass.
Because both triggers return immediately, the agent loop is never blocked. The HTTP request and the LLM request are both in flight at the same time.
npm install
cp .env.example .env # add your Anthropic API key
npm start "https://thomas-wiegold.com" "AI coding tools"The crawler accepts injected fetcher and scorer functions, so you can run
it against a fake site with a fake scorer — no network, no Claude:
npm run mockThis drives the agent against an in-memory site of 10 pages and asserts that the focused crawl correctly skips the irrelevant subtree.
Sample output:
seed: https://thomas-wiegold.com
topic: AI coding tools
[fetch] https://thomas-wiegold.com/ → Thomas Wiegold
[fetch] https://thomas-wiegold.com/blog → Blog | Thomas Wiegold
[score] https://thomas-wiegold.com/ → 7/10 +12 links
[fetch] https://thomas-wiegold.com/blog/claude-code-review → ...
[score] https://thomas-wiegold.com/blog → 9/10 +8 links
[fetch] https://thomas-wiegold.com/about → About
[score] https://thomas-wiegold.com/blog/claude-code-review → 10/10 +3 links
[score] https://thomas-wiegold.com/about → 3/10 (skipped)
...
[done] 10 pages crawled
─── TOP RESULTS ───
10/10 https://thomas-wiegold.com/blog/claude-code-review
9/10 https://thomas-wiegold.com/blog
7/10 https://thomas-wiegold.com/
...
src/crawler.ts— agent state, three triggers, scoring helper (~140 lines)src/extract.ts— minimal HTML extractorsrc/index.ts— CLI runner
This is a demo, not a real crawler. For real use, you'd want:
- A real HTML parser (linkedom, cheerio) instead of regex
- robots.txt compliance
- A polite delay between requests
- Concurrency caps and backoff
- Persistence so a crash doesn't lose progress