Skip to content

StarTrail-org/PixelRAG

Repository files navigation

PixelRAG

Visual Retrieval-Augmented Generation — a framework for building visual search systems from any document type.

PixelRAG renders documents (web pages, PDFs, images) as screenshots, embeds them with a vision-language model, builds FAISS indexes, and serves a search API. Wikipedia's 8.28M articles are the primary benchmark, but the system is general-purpose.

Architecture

Five packages, each independently installable:

Package What it does Install
pixelrag-render Document → image tiles (Playwright CDP, PDF) uv sync --package pixelrag-render
pixelrag-embed Tiles → vectors → FAISS index (three independent tools) uv sync --package pixelrag-embed
pixelrag-index Orchestrates the full pipeline: source → ingest → embed → index uv sync --package pixelrag-index
pixelrag-serve FAISS search API (FastAPI, CPU or GPU) uv sync --package pixelrag-serve
pixelrag-train LoRA/DoRA fine-tuning for Qwen3-VL-Embedding uv sync --package pixelrag-train
render ←── index ──→ embed       serve (independent)       train → serve (HTTP)

Quick Start

Search pre-built Wikipedia index

uv sync --package pixelrag-serve

# Download a pre-built index
aws s3 sync s3://wiki-screenshot-tiles-backup/kiwix_tiles/text_search_index_1024/ ./index/

# Start the API
pixelrag-serve --index-dir ./index --port 30001

# Query
curl -X POST http://localhost:30001/search \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

Build an index from local documents

uv sync --package pixelrag-index

# Create pixelrag.yaml
cat > pixelrag.yaml << 'EOF'
source:
  type: local
  path: ./my_docs

embed:
  model: Qwen/Qwen3-VL-Embedding-2B
  device: cuda
  gpu_ids: [0]

output: ./my_index
EOF

# Build
pixelrag-index build

# Serve
pixelrag-serve --index-dir ./my_index --port 30001

Render a single URL (agent use)

from pixelrag_render import render_url

tiles = render_url("https://en.wikipedia.org/wiki/Python", "./tiles")

Claude Code plugin — give Claude eyes

Setup (one-time):

./plugin/setup.sh

Then copy-paste any of these:

# "What does Hacker News look like right now?"
claude --plugin-dir ./plugin -p "screenshot https://news.ycombinator.com and summarize the top stories"

# "Read a research paper visually"
claude --plugin-dir ./plugin -p "screenshot https://arxiv.org/abs/2404.12387 and explain the key findings"

# "Check if my site looks right"
claude --plugin-dir ./plugin -p "screenshot http://localhost:3000 and tell me if anything looks broken"

Or start an interactive session and use the slash command:

claude --plugin-dir ./plugin
# then type: /screenshot https://example.com

No MCP server, no backend required — the plugin teaches Claude to call pixelrag-render directly via Bash and read the resulting tile images.

Embed tools (standalone)

Each tool works independently without the orchestrator:

pixelrag-chunk --tiles-dir ./tiles
pixelrag-embed --shard-dir ./tiles --output-dir ./embeddings --gpu-ids 0,1
pixelrag-build-index --embeddings-dir ./embeddings --output-dir ./index

License

Apache-2.0

About

The end of web parsing. The beginning of scalable pixel-native search.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors