A single-binary screen-watcher that detects on-screen questions and answers them with an LLM, exposing the result over HTTP.
solid polls the screen, identifies any quiz / homework / interview question, sends it to an LLM, and serves the answer over a tiny HTTP API (REST + Server-Sent Events). It runs as one binary on Windows 10/11 and macOS 14+.
- Features
- How it works
- Quick start
- Installation
- Configuration
- HTTP API
- G2 companion app
- Architecture
- Development
- Troubleshooting
- Roadmap & extending
- License
- Two strategies, one binary
- Vision (default for
LLM_PROVIDER=anthropic) — sends the screenshot directly to Claude Sonnet; no OCR step, handles "All of the above" / "Both A and B" catch-alls correctly. - OCR + text (for
LLM_PROVIDER=ollama) — Tesseract extracts the question, then a local Ollama model answers it.
- Vision (default for
- Question typing —
multiple_choice,code,short_answer,plainmarkers tune the prompt so MCQ replies stay one-letter-plus-justification while code replies stay in fenced blocks. - Live updates over SSE —
GET /question/streampushes a new event the moment the worker writes a fresh answer. - Smart de-duplication — perceptual content hash skips unchanged screens; normalized Levenshtein collapses near-identical questions caused by OCR jitter.
- Fast capture paths — DXGI Output Duplication on Windows (GDI fallback), ScreenCaptureKit on macOS 14+ (CGDisplay fallback).
- Self-contained release bundles — the
*-ocrarchives ship Tesseract + Leptonica +eng.traineddataso you can extract and run with no extra install. - G2 HUD companion app — mirrors the current Q/A onto Even Realities G2 smart glasses (see G2 companion app).
- Every 2 seconds, the worker captures the primary display.
- A perceptual hash skips frames where nothing changed.
- Depending on the strategy, the frame is either:
- shipped to Claude vision, which returns
{question, answer}or{none: true}, or - OCR'd, classified into a
QuestionType, and forwarded to Ollama.
- shipped to Claude vision, which returns
- Near-duplicate questions (normalized Levenshtein ≥ 0.90 vs the previous one) are dropped.
- Fresh answers are written to SQLite, cached in memory, and broadcast to any open SSE subscribers.
The fastest path is the pre-built *-ocr archive from the latest release — Tesseract + Leptonica are linked in and eng.traineddata is staged under tessdata/ next to the binary.
# 1. extract the archive, then:
export ANTHROPIC_API_KEY=sk-ant-...
./solid # macOS
solid.exe # Windows
# 2. while a question is on screen
curl http://127.0.0.1:8088/question
# 3. live stream
curl -N http://127.0.0.1:8088/question/streamThe slim archives (without -ocr) are smaller but require a local Tesseract install and your own eng.traineddata per the discovery rules in Tesseract data files.
git clone https://github.com/0x000NULL/solid.git
cd solid
cargo build --release
./target/release/solid # macOS
./target/release/solid.exe # WindowsRequirements:
| Requirement | Source |
|---|---|
| Rust 1.75+ | https://rustup.rs/ |
| Windows 10/11 (x86-64) or macOS 14+ (Apple Silicon or Intel) | — |
Tesseract eng.traineddata (when using OCR) |
https://github.com/tesseract-ocr/tessdata_best |
ANTHROPIC_API_KEY (when LLM_PROVIDER=anthropic) |
https://console.anthropic.com/ |
Ollama (when LLM_PROVIDER=ollama) |
https://ollama.ai/ |
SQLite ships via rusqlite with the bundled feature — no separate install. To build without OCR support: cargo build --release --no-default-features.
Only required when using the OCR strategy (LLM_PROVIDER=ollama). solid resolves the directory in this order:
TESSDATA_PREFIX— if set, libtesseract uses this path. Point it at the directory that containstessdata/(or directly at thetessdata/directory, depending on your Tesseract version).tessdata/next to the executable — dropeng.traineddatainto atessdata/directory beside the binary.
Windows:
target/release/
├── solid.exe
└── tessdata/
└── eng.traineddata
macOS (via Homebrew):
brew install tesseract pkgconf
export TESSDATA_PREFIX="$(brew --prefix tesseract)/share/"ScreenCaptureKit and CGDisplayCreateImage both require Screen Recording permission. The first capture attempt prompts the user; grant the permission to the launching app (Terminal, iTerm2, your IDE, or the solid binary itself), then restart solid. If permission is missing, the SCK path is disabled for the process lifetime and the binary falls back to a black/empty CG capture.
All configuration is via environment variables. A .env file is loaded automatically if present.
| Variable | Purpose | Default | Required |
|---|---|---|---|
LLM_PROVIDER |
anthropic (vision) or ollama (OCR + text) |
anthropic |
No |
ANTHROPIC_API_KEY |
Claude API key | — | When LLM_PROVIDER=anthropic |
OLLAMA_MODEL |
Model name for Ollama | llama3.1 |
No |
DATABASE_PATH |
Path to SQLite file | conversation.db |
No |
BIND_ADDR |
HTTP server bind address | 0.0.0.0:8088 |
No |
TESSDATA_PREFIX |
Override Tesseract data lookup | — | No |
SOLID_FORCE_GDI |
Skip DXGI, use GDI capture only (Windows) | unset | No |
SOLID_FORCE_CG |
Skip ScreenCaptureKit, use CG capture only (macOS) | unset | No |
RUST_LOG |
Tracing filter (error, warn, info, debug, trace) |
info |
No |
Default bind: 0.0.0.0:8088. CORS is permissive. All timestamps are UTC, ISO 8601 (Unix seconds in JSON).
| Method | Endpoint | Description | Success | Other statuses |
|---|---|---|---|---|
| GET | /question |
Most recent question + answer | 200 JSON Answer |
204 when none detected yet |
| GET | /history |
Up to 50 most recent records, newest first | 200 JSON Answer[] |
500 on DB error |
| GET | /question/stream |
Server-Sent Events feed; emits an Answer event each time a new answer is recorded |
200 text/event-stream |
— |
Answer shape:
{
"question": "What is the capital of France?",
"answer": "Paris.",
"confidence": 0.9,
"timestamp": 1745500000
}Example SSE consumer:
const es = new EventSource("http://127.0.0.1:8088/question/stream");
es.onmessage = (ev) => {
const a = JSON.parse(ev.data);
console.log(`Q: ${a.question}\nA: ${a.answer}`);
};g2-app/ is a Vite + TypeScript app that runs on Even Realities G2 smart glasses via the EvenHub SDK. It subscribes to the local solid server's /question/stream and renders the current Q/A onto the HUD, with scroll-up / scroll-down events from the temple touch surface.
cd g2-app
npm install
npm run dev # local browser dev
npm run simulate # run against the EvenHub simulator
npm run pack # package as solid.ehpk for the glassesConfigure the host (e.g. http://192.168.1.42:8088) on the settings screen the first time the app starts.
┌──────────────┐ ┌────────────────────────────────┐ ┌──────────────┐
│ Screen │────▶│ Strategy │────▶│ LLM │
│ Capture │ │ ├── Vision (Claude Sonnet) │ │ (Anthropic │
│ DXGI / SCK │ │ └── OCR (Tesseract) + text │ │ or Ollama) │
└──────────────┘ └────────────────────────────────┘ └──────┬───────┘
│ │ │
│ perceptual hash │ qtype + dedup │ answer
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ skip if │ │ AppState │◀────│ Worker Loop │─▶│ SQLite │
│ unchanged │ │ (latest) │ │ tick 2 s │ │ conversation │
└──────────────┘ └──────┬───────┘ └──────┬───────┘ └──────────────┘
│ read │ broadcast
▼ ▼
┌──────────────────────────────────┐
│ HTTP API │
│ /question /history /question/ │
│ stream │
└──────────────────────────────────┘
All capture / OCR / LLM work happens on a single Tokio task with per-stage timeouts (capture 3 s, OCR 8 s, LLM 30 s, vision 20 s). The HTTP server runs on the same runtime; the worker pushes new answers over a tokio::sync::broadcast channel that the SSE handler subscribes to.
solid/
├─ Cargo.toml
├─ src/
│ ├─ main.rs # entry point, worker loop, server bind
│ ├─ lib.rs # crate root for tests
│ ├─ state.rs # in-memory state + JSON model
│ ├─ capture/ # platform-specific screen capture
│ │ ├─ windows.rs # DXGI Output Duplication + GDI fallback
│ │ └─ macos.rs # ScreenCaptureKit + CG fallback
│ ├─ ocr/ # Tesseract wrapper + question typing
│ ├─ llm.rs # Anthropic + Ollama clients (text + vision)
│ ├─ http.rs # actix-web handlers, SSE
│ ├─ db.rs # SQLite persistence
│ ├─ perceptual_hash.rs # frame-change detection
│ └─ text.rs # normalize + dedup helpers
├─ tests/ # integration tests (wiremock + font8x8 fixtures)
├─ g2-app/ # Even Realities G2 HUD companion (Vite + TS)
└─ spec.md # original design notes
cargo fmt --all
cargo clippy --all-targets -- -D warnings
cargo test
cargo test --no-default-features # build without OCRNotable test fixtures:
tests/useswiremockto stub the Anthropic / Ollama HTTP endpoints.- An OCR round-trip test renders
"What is 2+2?"viafont8x8and feeds it toextract_question(gated on theocrfeature,#[ignore]d by default).
CI runs fmt, clippy, and cargo test on Windows + macOS. Tagging vX.Y.Z triggers release.yml, which builds slim and OCR-bundled archives for x86_64-pc-windows-msvc and aarch64-apple-darwin.
| Symptom | Likely cause | Fix |
|---|---|---|
failed to open database |
Bad DATABASE_PATH or unwritable directory |
Use an absolute, writable path |
Tesseract::Error at startup |
tessdata not found |
Set TESSDATA_PREFIX or place tessdata/ next to the binary |
| macOS: empty / black captures | Screen Recording permission missing | System Settings → Privacy & Security → Screen Recording → enable for the launching app, then restart solid |
401 Unauthorized from Anthropic |
Missing or invalid ANTHROPIC_API_KEY |
Re-check, regenerate if expired |
/question always 204 |
No question detected yet | Wait for a question on screen, or check RUST_LOG=debug for OCR / vision output |
| Vision strategy never fires | Wrong provider | Confirm LLM_PROVIDER is unset or anthropic |
| Ollama connection refused | Service not running | ollama serve; confirm :11434 is listening |
| Windows: TLS errors on large captures | Schannel limit | The vision path already downsizes to 1568 px; force GDI capture with SOLID_FORCE_GDI=1 if DXGI is misbehaving |
| Goal | How |
|---|---|
| Add an LLM backend | New variant on LlmProvider, new branch in query_llm |
| Swap OCR engine | Replace ocr/; keep the extract_question signature |
| More metadata | Add columns to conversation; update Db::insert and Answer |
| Web UI | Mount static files in actix, consume /question/stream |
| Windows service | Wrap with the windows-service crate |
?limit=N on /history |
Parse via web::Query<Limit>, pass to Db::recent |
See spec.md for the original design doc; some details (binary name, polling interval, port) have evolved since v0.1 — this README is the source of truth for the current behavior.
MIT. Fork, modify, distribute. Attribution appreciated.