An autonomous AI tour guide for cultural and heritage sites — point your phone at any landmark and get a real-time, persona-driven narration with conversational follow-ups.
A tourist photographs the Brandenburg Gate. The system extracts GPS from the photo's EXIF metadata, classifies the scene with CLIP, finds the nearest landmark in its knowledge base, retrieves contextual facts via RAG, and streams a narration through GPT-4o-mini — all in one API call. The user can then ask follow-up questions in natural language; the agent detects intent, queries live web results when needed, and responds in character.
Core loop:
Photo → EXIF GPS → CLIP scene → nearest landmark → RAG context → GPT narration (streamed)
↓
follow-up chat with intent detection + web search
┌─────────────────────────────────────────────────────────────────┐
│ Streamlit UI │
│ Upload photo ──► SSE stream narration ──► Chat follow-ups │
└────────────────────────────┬────────────────────────────────────┘
│ HTTP
┌────────────────────────────▼────────────────────────────────────┐
│ FastAPI (port 8001) │
│ │
│ POST /agent/stream POST /agent/followup │
│ │ │ │
│ ┌──────▼──────────────────┐ ┌─────▼──────────────────────┐ │
│ │ LangGraph Pipeline │ │ Intent Detector │ │
│ │ vision → context → │ │ keyword + GPT-4o-mini │ │
│ │ narration (streamed) │ │ routing to tools │ │
│ └──────┬──────────────────┘ └─────┬──────────────────────┘ │
│ │ │ │
│ ┌──────▼────────┐ ┌───────────────▼──────────────────────┐ │
│ │ CLIP ViT-B/32│ │ Tools │ │
│ │ scene class. │ │ • RAG retriever (pgvector) │ │
│ └───────────────┘ │ • Foursquare reverse geocoder │ │
│ │ • DuckDuckGo web search (live) │ │
│ ┌──────────────┐ │ • Nearby places (DB) │ │
│ │ PostgreSQL │ └──────────────────────────────────────┘ │
│ │ landmarks │ │
│ │ pgvector │ Session memory (DB-backed) │
│ │ conv. turns │ Location context injected per request │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Layer | Technology |
|---|---|
| Language | Python 3.11 |
| Vision | OpenAI CLIP (ViT-B/32) via PyTorch — zero-shot scene classification |
| Orchestration | LangGraph StateGraph — vision → context → narration nodes |
| LLM | GPT-4o-mini via LangChain (langchain-openai) |
| RAG | pgvector + sentence-transformers (all-MiniLM-L6-v2) |
| Geolocation | EXIF GPS extraction + Foursquare Places API v3 |
| Web search | DuckDuckGo (duckduckgo-search) — live results, no API key |
| API | FastAPI + Uvicorn, SSE via StreamingResponse |
| Database | PostgreSQL 16 + pgvector extension |
| ORM | SQLAlchemy 2.0 with Mapped / mapped_column |
| Session memory | DB-backed (UserSession + ConversationTurn models) |
| Containerisation | Docker multi-stage build + docker-compose |
| CI | GitHub Actions — ruff lint + 84 pytest tests on every push |
| Demo UI | Streamlit with live SSE token streaming |
| Testing | pytest, SQLite StaticPool, all CLIP calls mocked |
| Linting | Ruff |
# 1. Clone
git clone https://github.com/Rithub14/spatial-context-agent.git
cd spatial-context-agent
# 2. Copy env and fill in your keys
cp .env.example .env
# Required: OPENAI_API_KEY
# Optional: FOURSQUARE_API_KEY (falls back to DB-only landmarks without it)
# 3. Start API + PostgreSQL (pgvector image)
docker-compose up --build -d
# 4. Seed 18 Berlin landmarks
docker-compose exec api python -m src.db.seed
# 5. (Optional) Build the RAG knowledge base from Wikipedia
docker-compose exec api python -m src.db.build_knowledge_base
# 6. Launch Streamlit
pip install streamlit httpx pandas Pillow
streamlit run streamlit_app/app.pyOr run the API locally without Docker:
uv venv && uv pip install -r requirements.txt
uvicorn src.api.main:app --port 8001 --reloadRuns the full LangGraph pipeline and streams narration token-by-token.
Request
{
"image": "<base64 JPEG/PNG>",
"latitude": 52.5163,
"longitude": 13.3777,
"persona": "historian",
"session_id": null
}latitude / longitude are optional if the image contains GPS EXIF metadata.
persona options: historian · storyteller · local · child_friendly
SSE event stream
data: {"type": "step", "content": "👁️ Scene identified: monument (87% confidence)"}
data: {"type": "step", "content": "📍 Landmark found: Brandenburg Gate (42m away)"}
data: {"type": "step", "content": "📚 Retrieved knowledge context (1842 chars)"}
data: {"type": "token", "content": "Standing before"}
data: {"type": "token", "content": " the iconic"}
...
data: {"type": "done", "session_id": "...", "scene": {...}, "location": {...}, "metadata": {...}}
{
"session_id": "abc123",
"question": "any exhibitions happening nearby?",
"persona": "historian"
}Response
{
"session_id": "abc123",
"answer": "As of March 2026, the Pergamon Museum is...",
"intent": "current_events",
"intent_confidence": 1.0
}Detected intents:
| Intent | Triggered by | Action |
|---|---|---|
nearby_places |
"what else is nearby?" | DB proximity search |
historical_facts |
"when was it built?" | RAG knowledge retrieval |
current_events |
"any exhibitions?" | Live DuckDuckGo web search |
opening_hours |
"when does it open?" | LLM with advisory note |
directions |
"how do I get there?" | LLM using last known GPS |
translation |
"say that in German" | LLM |
photo_tip |
"best angle for a photo?" | LLM |
moved |
"I've moved" | Prompt to upload new photo |
general |
anything else | LLM |
The user's GPS coordinates from their last uploaded photo are injected into every follow-up so distance questions ("how far is X?") are answered relative to their actual location.
Today's date is always injected so the LLM cannot hallucinate stale event information.
Same pipeline as /agent/stream but returns a single JSON response. Useful for testing or non-streaming clients.
Paginated list of all landmarks in the database.
Add a landmark (requires X-API-Key header when ENABLE_AUTH=true).
{"status": "ok", "model_loaded": true, "db_connected": true, "uptime_seconds": 142.7}spatial-context-agent/
├── src/
│ ├── api/
│ │ ├── main.py # FastAPI app, lifespan, middleware wiring
│ │ ├── routes/
│ │ │ ├── agent.py # All agent endpoints + intent routing helpers
│ │ │ └── health.py # GET /health
│ │ ├── middleware/
│ │ │ ├── auth.py # API key validation (toggleable)
│ │ │ └── rate_limiter.py # Per-IP sliding window (toggleable)
│ │ └── schemas/
│ │ ├── request.py # Pydantic request models
│ │ └── response.py # Pydantic response models
│ ├── agent/
│ │ ├── graph.py # LangGraph StateGraph (vision→context→narration)
│ │ ├── tools.py # LangChain @tool wrappers for pipeline components
│ │ └── memory.py # DB-backed session memory (UserSession, ConversationTurn)
│ ├── pipeline/
│ │ ├── clip_inference.py # CLIP model loading + logit-scaled inference
│ │ ├── scene_classifier.py # Zero-shot classification (12 categories)
│ │ ├── location_extractor.py # EXIF GPS extraction (IFDRational-safe)
│ │ ├── context_retriever.py # Haversine nearest-landmark lookup
│ │ ├── narration_engine.py # GPT-4o-mini narration + template fallback
│ │ ├── embedder.py # sentence-transformers singleton
│ │ ├── rag_retriever.py # pgvector cosine similarity retrieval
│ │ └── intent_detector.py # Intent classification (LLM + keyword fallback)
│ ├── db/
│ │ ├── models.py # Landmark, InferenceLog, LandmarkChunk,
│ │ │ # UserSession, ConversationTurn
│ │ ├── seed.py # 18 Berlin landmarks with GPS + narration templates
│ │ ├── build_knowledge_base.py # Embeds DB text + Wikipedia into pgvector
│ │ └── session.py # Engine, SessionLocal, get_db
│ └── config.py # pydantic-settings (all config from env vars)
├── tests/ # 84 tests — CLIP always mocked, SQLite StaticPool
├── streamlit_app/
│ └── app.py # Streamlit demo: SSE streaming + chat interface
├── scripts/
│ └── smoke_test.py # Manual health + analyze + locations check
├── docker/
│ └── init-pgvector.sql # CREATE EXTENSION vector (runs on DB init)
├── .github/workflows/
│ └── ci-cd.yml # CI: ruff + pytest on every push/PR
├── Dockerfile # Multi-stage: builder (gcc, git) + runtime
├── docker-compose.yml # pgvector/pgvector:pg16 + FastAPI
├── requirements.txt
└── .env.example
# Run all 84 tests
PYTHONPATH=. pytest tests/ -v
# With coverage
PYTHONPATH=. pytest --cov=src --cov-report=term-missing tests/
# Smoke test against a running instance
python scripts/smoke_test.py --url http://localhost:8001CLIP is always mocked (slow to load). The test database uses SQLite in-memory with StaticPool so all connections share one instance. pgvector tests are skipped in SQLite mode.
# Database
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/spatial_agent
# Model
CLIP_MODEL_NAME=ViT-B/32
DEVICE=cpu
# API
API_HOST=0.0.0.0
API_PORT=8000
# Security (toggleable)
ENABLE_AUTH=false
ENABLE_RATE_LIMIT=false
API_KEY=dev-key-change-in-production
RATE_LIMIT_RPM=60
# Agentic system (required for LLM narration, intent detection, web search)
OPENAI_API_KEY=sk-...
# Foursquare (optional — enables worldwide reverse geocoding)
FOURSQUARE_API_KEY=...MIT