Topic Crawler Agent

A small focused web crawler built with agentiny. Give it a seed URL and a topic — it crawls the site, has Claude score each page for relevance, and only follows links from pages that score above a threshold.

The point of the demo: the fetcher and the scorer run concurrently. While page N+1 is being fetched, page N is being scored. That falls out naturally from agentiny's reactive triggers; expressing it cleanly with plain await chains would be awkward.

How it works

seed URL
  ↓
trigger 1 (fetch)  ──────►  trigger 2 (score)  ──┐
       ▲                                          │
       │                                          ▼
       └────── new links if score >= 6 ◄──────────┘

trigger 3 (stop) — fires when nothing is fetching, nothing is scoring,
                   every page has a score, and the queue is drained

Both worker triggers use the same shape:

Synchronously reserve a unit of work (flip a mutex flag).
Kick off async work as fire-and-forget.
When the promise resolves, mutate state in place and call setState to wake the loop for the next pass.

Because both triggers return immediately, the agent loop is never blocked. The HTTP request and the LLM request are both in flight at the same time.

Run it

npm install
cp .env.example .env   # add your Anthropic API key
npm start "https://thomas-wiegold.com" "AI coding tools"

Run without an API key

The crawler accepts injected fetcher and scorer functions, so you can run it against a fake site with a fake scorer — no network, no Claude:

npm run mock

This drives the agent against an in-memory site of 10 pages and asserts that the focused crawl correctly skips the irrelevant subtree.

Sample output:

seed:  https://thomas-wiegold.com
topic: AI coding tools

[fetch] https://thomas-wiegold.com/ → Thomas Wiegold
[fetch] https://thomas-wiegold.com/blog → Blog | Thomas Wiegold
[score] https://thomas-wiegold.com/ → 7/10  +12 links
[fetch] https://thomas-wiegold.com/blog/claude-code-review → ...
[score] https://thomas-wiegold.com/blog → 9/10  +8 links
[fetch] https://thomas-wiegold.com/about → About
[score] https://thomas-wiegold.com/blog/claude-code-review → 10/10  +3 links
[score] https://thomas-wiegold.com/about → 3/10  (skipped)
...
[done] 10 pages crawled

─── TOP RESULTS ───
10/10  https://thomas-wiegold.com/blog/claude-code-review
9/10   https://thomas-wiegold.com/blog
7/10   https://thomas-wiegold.com/
...

Files

src/crawler.ts — agent state, three triggers, scoring helper (~140 lines)
src/extract.ts — minimal HTML extractor
src/index.ts — CLI runner

Notes for production

This is a demo, not a real crawler. For real use, you'd want:

A real HTML parser (linkedom, cheerio) instead of regex
robots.txt compliance
A polite delay between requests
Concurrency caps and backoff
Persistence so a crash doesn't lose progress

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
.oxfmtrc.json		.oxfmtrc.json
.oxlintrc.json		.oxlintrc.json
README.md		README.md
crawler.ts		crawler.ts
extract.ts		extract.ts
index.ts		index.ts
mock.ts		mock.ts
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Crawler Agent

How it works

Run it

Run without an API key

Files

Notes for production

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Topic Crawler Agent

How it works

Run it

Run without an API key

Files

Notes for production

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages