Skip to content

Keldrik/aiseoagent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic Crawler Agent

A small focused web crawler built with agentiny. Give it a seed URL and a topic — it crawls the site, has Claude score each page for relevance, and only follows links from pages that score above a threshold.

The point of the demo: the fetcher and the scorer run concurrently. While page N+1 is being fetched, page N is being scored. That falls out naturally from agentiny's reactive triggers; expressing it cleanly with plain await chains would be awkward.

How it works

seed URL
  ↓
trigger 1 (fetch)  ──────►  trigger 2 (score)  ──┐
       ▲                                          │
       │                                          ▼
       └────── new links if score >= 6 ◄──────────┘

trigger 3 (stop) — fires when nothing is fetching, nothing is scoring,
                   every page has a score, and the queue is drained

Both worker triggers use the same shape:

  1. Synchronously reserve a unit of work (flip a mutex flag).
  2. Kick off async work as fire-and-forget.
  3. When the promise resolves, mutate state in place and call setState to wake the loop for the next pass.

Because both triggers return immediately, the agent loop is never blocked. The HTTP request and the LLM request are both in flight at the same time.

Run it

npm install
cp .env.example .env   # add your Anthropic API key
npm start "https://thomas-wiegold.com" "AI coding tools"

Run without an API key

The crawler accepts injected fetcher and scorer functions, so you can run it against a fake site with a fake scorer — no network, no Claude:

npm run mock

This drives the agent against an in-memory site of 10 pages and asserts that the focused crawl correctly skips the irrelevant subtree.

Sample output:

seed:  https://thomas-wiegold.com
topic: AI coding tools

[fetch] https://thomas-wiegold.com/ → Thomas Wiegold
[fetch] https://thomas-wiegold.com/blog → Blog | Thomas Wiegold
[score] https://thomas-wiegold.com/ → 7/10  +12 links
[fetch] https://thomas-wiegold.com/blog/claude-code-review → ...
[score] https://thomas-wiegold.com/blog → 9/10  +8 links
[fetch] https://thomas-wiegold.com/about → About
[score] https://thomas-wiegold.com/blog/claude-code-review → 10/10  +3 links
[score] https://thomas-wiegold.com/about → 3/10  (skipped)
...
[done] 10 pages crawled

─── TOP RESULTS ───
10/10  https://thomas-wiegold.com/blog/claude-code-review
9/10   https://thomas-wiegold.com/blog
7/10   https://thomas-wiegold.com/
...

Files

  • src/crawler.ts — agent state, three triggers, scoring helper (~140 lines)
  • src/extract.ts — minimal HTML extractor
  • src/index.ts — CLI runner

Notes for production

This is a demo, not a real crawler. For real use, you'd want:

  • A real HTML parser (linkedom, cheerio) instead of regex
  • robots.txt compliance
  • A polite delay between requests
  • Concurrency caps and backoff
  • Persistence so a crash doesn't lose progress

About

A small focused web crawler built with agentiny.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors