Generate structured metadata for a list of URLs. The pipeline reads links.md, fetches page content, classifies each link, and writes results to JSON.
uv sync
cp .env.example .envSet OPENROUTER_API_KEY in .env, add URLs to links.md, then run:
uv run main.pyResults are written to data/links.json.
- Install deps:
uv sync(creates a local environment and installs packages) - Run pipeline:
uv run main.py(readslinks.mdand writesdata/links.json) - Run tests:
uv run pytest(executes the test suite)
Data flow:
links.mdis parsed byutils.extract_urls.pipeline.process_linksdeduplicates URLs and runs concurrent classification.classifier.Classifierfetches metadata and calls the LLM.storage.LinkStorewrites records todata/links.json.
Design principles:
- Deterministic input: URL order is preserved after deduplication.
- Idempotent writes: existing URLs are skipped by normalized URL.
- Simple I/O: JSON file storage to keep dependencies minimal.
data/links.json:
{
"links": [
{
"id": 1,
"url": "https://github.com/astral-sh/ruff",
"normalized_url": "https://github.com/astral-sh/ruff",
"domain": "github.com",
"title": "Ruff",
"description": "Fast Python linter and formatter written in Rust",
"site_name": "GitHub",
"image_url": null,
"category": "code",
"context": "Useful for Python developers seeking faster linting",
"created_at": "2026-01-01T00:00:00+00:00"
}
],
"next_id": 2
}main.pyentry pointsrc/core modules (fetcher.py,classifier.py,pipeline.py,storage.py)tests/pytest testsdata/output directorylinks.mdinput file.env.exampleconfiguration template
Environment variables (via .env):
-
OPENROUTER_API_KEYrequired -
OPENROUTER_MODELdefaultopenai/gpt-4o-mini -
OPENROUTER_BASE_URLdefaulthttps://openrouter.ai/api/v1 -
LINKREC_DATA_PATHdefaultdata/links.json -
LINKREC_MAX_CONCURRENCYdefault8 -
LINKREC_TIMEOUTdefault12seconds
links.md:
https://github.com/astral-sh/ruffRun uv run main.py to generate data/links.json.