Skip to content

0x000NULL/solid

Repository files navigation

solid

A single-binary screen-watcher that detects on-screen questions and answers them with an LLM, exposing the result over HTTP.

CI Release Latest release License: MIT

solid polls the screen, identifies any quiz / homework / interview question, sends it to an LLM, and serves the answer over a tiny HTTP API (REST + Server-Sent Events). It runs as one binary on Windows 10/11 and macOS 14+.


Table of Contents


Features

  • Two strategies, one binary
    • Vision (default for LLM_PROVIDER=anthropic) — sends the screenshot directly to Claude Sonnet; no OCR step, handles "All of the above" / "Both A and B" catch-alls correctly.
    • OCR + text (for LLM_PROVIDER=ollama) — Tesseract extracts the question, then a local Ollama model answers it.
  • Question typingmultiple_choice, code, short_answer, plain markers tune the prompt so MCQ replies stay one-letter-plus-justification while code replies stay in fenced blocks.
  • Live updates over SSEGET /question/stream pushes a new event the moment the worker writes a fresh answer.
  • Smart de-duplication — perceptual content hash skips unchanged screens; normalized Levenshtein collapses near-identical questions caused by OCR jitter.
  • Fast capture paths — DXGI Output Duplication on Windows (GDI fallback), ScreenCaptureKit on macOS 14+ (CGDisplay fallback).
  • Self-contained release bundles — the *-ocr archives ship Tesseract + Leptonica + eng.traineddata so you can extract and run with no extra install.
  • G2 HUD companion app — mirrors the current Q/A onto Even Realities G2 smart glasses (see G2 companion app).

How it works

  1. Every 2 seconds, the worker captures the primary display.
  2. A perceptual hash skips frames where nothing changed.
  3. Depending on the strategy, the frame is either:
    • shipped to Claude vision, which returns {question, answer} or {none: true}, or
    • OCR'd, classified into a QuestionType, and forwarded to Ollama.
  4. Near-duplicate questions (normalized Levenshtein ≥ 0.90 vs the previous one) are dropped.
  5. Fresh answers are written to SQLite, cached in memory, and broadcast to any open SSE subscribers.

Quick start

The fastest path is the pre-built *-ocr archive from the latest release — Tesseract + Leptonica are linked in and eng.traineddata is staged under tessdata/ next to the binary.

# 1. extract the archive, then:
export ANTHROPIC_API_KEY=sk-ant-...
./solid               # macOS
solid.exe             # Windows

# 2. while a question is on screen
curl http://127.0.0.1:8088/question

# 3. live stream
curl -N http://127.0.0.1:8088/question/stream

The slim archives (without -ocr) are smaller but require a local Tesseract install and your own eng.traineddata per the discovery rules in Tesseract data files.


Installation

From source

git clone https://github.com/0x000NULL/solid.git
cd solid
cargo build --release
./target/release/solid          # macOS
./target/release/solid.exe      # Windows

Requirements:

Requirement Source
Rust 1.75+ https://rustup.rs/
Windows 10/11 (x86-64) or macOS 14+ (Apple Silicon or Intel)
Tesseract eng.traineddata (when using OCR) https://github.com/tesseract-ocr/tessdata_best
ANTHROPIC_API_KEY (when LLM_PROVIDER=anthropic) https://console.anthropic.com/
Ollama (when LLM_PROVIDER=ollama) https://ollama.ai/

SQLite ships via rusqlite with the bundled feature — no separate install. To build without OCR support: cargo build --release --no-default-features.

Tesseract data files (tessdata)

Only required when using the OCR strategy (LLM_PROVIDER=ollama). solid resolves the directory in this order:

  1. TESSDATA_PREFIX — if set, libtesseract uses this path. Point it at the directory that contains tessdata/ (or directly at the tessdata/ directory, depending on your Tesseract version).
  2. tessdata/ next to the executable — drop eng.traineddata into a tessdata/ directory beside the binary.

Windows:

target/release/
├── solid.exe
└── tessdata/
    └── eng.traineddata

macOS (via Homebrew):

brew install tesseract pkgconf
export TESSDATA_PREFIX="$(brew --prefix tesseract)/share/"

macOS Screen Recording permission

ScreenCaptureKit and CGDisplayCreateImage both require Screen Recording permission. The first capture attempt prompts the user; grant the permission to the launching app (Terminal, iTerm2, your IDE, or the solid binary itself), then restart solid. If permission is missing, the SCK path is disabled for the process lifetime and the binary falls back to a black/empty CG capture.


Configuration

All configuration is via environment variables. A .env file is loaded automatically if present.

Variable Purpose Default Required
LLM_PROVIDER anthropic (vision) or ollama (OCR + text) anthropic No
ANTHROPIC_API_KEY Claude API key When LLM_PROVIDER=anthropic
OLLAMA_MODEL Model name for Ollama llama3.1 No
DATABASE_PATH Path to SQLite file conversation.db No
BIND_ADDR HTTP server bind address 0.0.0.0:8088 No
TESSDATA_PREFIX Override Tesseract data lookup No
SOLID_FORCE_GDI Skip DXGI, use GDI capture only (Windows) unset No
SOLID_FORCE_CG Skip ScreenCaptureKit, use CG capture only (macOS) unset No
RUST_LOG Tracing filter (error, warn, info, debug, trace) info No

HTTP API

Default bind: 0.0.0.0:8088. CORS is permissive. All timestamps are UTC, ISO 8601 (Unix seconds in JSON).

Method Endpoint Description Success Other statuses
GET /question Most recent question + answer 200 JSON Answer 204 when none detected yet
GET /history Up to 50 most recent records, newest first 200 JSON Answer[] 500 on DB error
GET /question/stream Server-Sent Events feed; emits an Answer event each time a new answer is recorded 200 text/event-stream

Answer shape:

{
  "question": "What is the capital of France?",
  "answer": "Paris.",
  "confidence": 0.9,
  "timestamp": 1745500000
}

Example SSE consumer:

const es = new EventSource("http://127.0.0.1:8088/question/stream");
es.onmessage = (ev) => {
  const a = JSON.parse(ev.data);
  console.log(`Q: ${a.question}\nA: ${a.answer}`);
};

G2 companion app

g2-app/ is a Vite + TypeScript app that runs on Even Realities G2 smart glasses via the EvenHub SDK. It subscribes to the local solid server's /question/stream and renders the current Q/A onto the HUD, with scroll-up / scroll-down events from the temple touch surface.

cd g2-app
npm install
npm run dev          # local browser dev
npm run simulate     # run against the EvenHub simulator
npm run pack         # package as solid.ehpk for the glasses

Configure the host (e.g. http://192.168.1.42:8088) on the settings screen the first time the app starts.


Architecture

  ┌──────────────┐     ┌────────────────────────────────┐     ┌──────────────┐
  │ Screen       │────▶│ Strategy                       │────▶│ LLM          │
  │ Capture      │     │  ├── Vision (Claude Sonnet)    │     │ (Anthropic   │
  │ DXGI / SCK   │     │  └── OCR (Tesseract) + text    │     │  or Ollama)  │
  └──────────────┘     └────────────────────────────────┘     └──────┬───────┘
        │                       │                                    │
        │ perceptual hash       │ qtype + dedup                      │ answer
        ▼                       ▼                                    ▼
  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐  ┌──────────────┐
  │ skip if      │     │ AppState     │◀────│ Worker Loop  │─▶│ SQLite       │
  │ unchanged    │     │ (latest)     │     │ tick 2 s     │  │ conversation │
  └──────────────┘     └──────┬───────┘     └──────┬───────┘  └──────────────┘
                              │ read                │ broadcast
                              ▼                     ▼
                      ┌──────────────────────────────────┐
                      │ HTTP API                         │
                      │  /question  /history  /question/ │
                      │                       stream     │
                      └──────────────────────────────────┘

All capture / OCR / LLM work happens on a single Tokio task with per-stage timeouts (capture 3 s, OCR 8 s, LLM 30 s, vision 20 s). The HTTP server runs on the same runtime; the worker pushes new answers over a tokio::sync::broadcast channel that the SSE handler subscribes to.


Project structure

solid/
├─ Cargo.toml
├─ src/
│  ├─ main.rs              # entry point, worker loop, server bind
│  ├─ lib.rs               # crate root for tests
│  ├─ state.rs             # in-memory state + JSON model
│  ├─ capture/             # platform-specific screen capture
│  │  ├─ windows.rs        # DXGI Output Duplication + GDI fallback
│  │  └─ macos.rs          # ScreenCaptureKit + CG fallback
│  ├─ ocr/                 # Tesseract wrapper + question typing
│  ├─ llm.rs               # Anthropic + Ollama clients (text + vision)
│  ├─ http.rs              # actix-web handlers, SSE
│  ├─ db.rs                # SQLite persistence
│  ├─ perceptual_hash.rs   # frame-change detection
│  └─ text.rs              # normalize + dedup helpers
├─ tests/                  # integration tests (wiremock + font8x8 fixtures)
├─ g2-app/                 # Even Realities G2 HUD companion (Vite + TS)
└─ spec.md                 # original design notes

Development

cargo fmt --all
cargo clippy --all-targets -- -D warnings
cargo test
cargo test --no-default-features    # build without OCR

Notable test fixtures:

  • tests/ uses wiremock to stub the Anthropic / Ollama HTTP endpoints.
  • An OCR round-trip test renders "What is 2+2?" via font8x8 and feeds it to extract_question (gated on the ocr feature, #[ignore]d by default).

CI runs fmt, clippy, and cargo test on Windows + macOS. Tagging vX.Y.Z triggers release.yml, which builds slim and OCR-bundled archives for x86_64-pc-windows-msvc and aarch64-apple-darwin.


Troubleshooting

Symptom Likely cause Fix
failed to open database Bad DATABASE_PATH or unwritable directory Use an absolute, writable path
Tesseract::Error at startup tessdata not found Set TESSDATA_PREFIX or place tessdata/ next to the binary
macOS: empty / black captures Screen Recording permission missing System Settings → Privacy & Security → Screen Recording → enable for the launching app, then restart solid
401 Unauthorized from Anthropic Missing or invalid ANTHROPIC_API_KEY Re-check, regenerate if expired
/question always 204 No question detected yet Wait for a question on screen, or check RUST_LOG=debug for OCR / vision output
Vision strategy never fires Wrong provider Confirm LLM_PROVIDER is unset or anthropic
Ollama connection refused Service not running ollama serve; confirm :11434 is listening
Windows: TLS errors on large captures Schannel limit The vision path already downsizes to 1568 px; force GDI capture with SOLID_FORCE_GDI=1 if DXGI is misbehaving

Roadmap & extending

Goal How
Add an LLM backend New variant on LlmProvider, new branch in query_llm
Swap OCR engine Replace ocr/; keep the extract_question signature
More metadata Add columns to conversation; update Db::insert and Answer
Web UI Mount static files in actix, consume /question/stream
Windows service Wrap with the windows-service crate
?limit=N on /history Parse via web::Query<Limit>, pass to Db::recent

See spec.md for the original design doc; some details (binary name, polling interval, port) have evolved since v0.1 — this README is the source of truth for the current behavior.


License

MIT. Fork, modify, distribute. Attribution appreciated.

About

Windows helper: screen capture + OCR + LLM → SQLite + HTTP API

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors