RAGU is a local-first Streamlit application for retrieval-augmented question answering over structural biology and computational biology PDFs. It lets users upload papers, index them with self-hosted models, ask evidence-grounded questions, optionally visualize structures in Mol*, and optionally listen to the generated answer through Piper when text-to-speech is available.
The project runs:
- locally with Python and a local Ollama installation
- with Docker Compose using a dedicated Chroma service plus monitoring and observability services
- Upload one or more PDF documents.
- Parse and split them into chunks.
- Create embeddings through Ollama.
- Store vectors in Chroma.
- Retrieve candidate chunks for a question.
- Rerank them with a cross-encoder.
- Ask a local Ollama chat model to answer strictly from retrieved context.
- Detect a PDB code in the answer and render a Mol* structure viewer when possible.
- Optionally synthesize the answer with Piper and play it in the browser.
flowchart TD
User[User] --> App[Streamlit app]
App --> Upload[PDF upload]
Upload --> Parse[PyMuPDFLoader]
Parse --> Split[RecursiveCharacterTextSplitter]
Split --> Embed[Ollama embeddings]
Embed --> Chroma[Chroma service]
User --> Ask[Question]
Ask --> App
App --> Retrieve[Retrieve from Chroma]
Retrieve --> Rerank[Cross-encoder reranking]
Rerank --> Chat[Ollama chat]
Chat --> Answer[Streamed answer]
Answer --> PDB[PDB extraction]
PDB --> MolStar[Mol* structure viewer]
Answer --> Piper[Piper TTS if available]
flowchart LR
App[app]
Ollama[ollama]
Chroma[chroma]
Redis[redis]
OTel[otel-collector]
Zipkin[zipkin]
Exporter[ollama-exporter]
Prom[prometheus]
Grafana[grafana]
CAdvisor[cadvisor]
App --> Ollama
App --> Chroma
App -. future .-> Redis
Chroma --> OTel
OTel --> Zipkin
OTel --> Prom
Ollama --> Exporter
Exporter --> Prom
CAdvisor --> Prom
Prom --> Grafana
- Multi-PDF upload from the Streamlit UI
- Dedicated Chroma service over HTTP
- Local/self-hosted Ollama embedding and chat inference
- Cross-encoder reranking for better retrieval quality
- Streamed answer rendering in the UI
- PDB extraction from the answer text
- Mol* 3D structure viewer integration
- Optional text-to-speech through Piper with graceful fallback
- Chroma observability through OTEL Collector and Zipkin
- Ollama monitoring through exporter, Prometheus, Grafana, and cAdvisor
- Docker Compose stack with Redis provisioned for future use
.
├── AGENT.md
├── ARCHITECTURE.md
├── CONTRIBUTING.md
├── DESIGN.md
├── Dockerfile
├── OPERATIONS.md
├── ROADMAP.md
├── README.md
├── SECURITY.md
├── USERS.md
├── app.py
├── docker-compose.yml
├── grafana/
├── otel-collector-config.yaml
├── prometheus/
├── public/
├── requirements.txt
├── scripts/
│ └── ollama-entrypoint.sh
└── examples/
The application is still mostly implemented as a single Streamlit entrypoint in app.py.
Key runtime helpers in the current code:
process_document: PDF loading and chunkingget_chroma_client: Chroma client selectionget_vector_collection: collection bootstrap with Ollama embeddingsadd_to_vector_collection: upsert chunksquery_collection: retrievalre_rank_cross_encoders: rerankingcall_llm: Ollama chat generationsearch_pdb_code: PDB extractionget_piper_status/synthesize_speech: optional text-to-speech
- Python 3.12
- Streamlit
- Ollama
- ChromaDB
- LangChain document loading and text splitting utilities
- Sentence Transformers cross-encoder reranking
- PyMuPDF
- Piper
- Docker Compose
- Redis
- OpenTelemetry Collector
- Zipkin
- Prometheus
- Grafana
- cAdvisor
- Python 3.12 recommended
pip- Ollama installed and running locally
- Docker
- Docker Compose
Project runtime configuration is stored in .env.
Current variables:
STREAMLIT_HOST_PORT=8501
OLLAMA_HOST_PORT=11434
REDIS_HOST_PORT=6379
CHROMA_HOST_PORT=8000
ZIPKIN_HOST_PORT=9411
PROMETHEUS_HOST_PORT=9090
GRAFANA_HOST_PORT=3000
OLLAMA_EXPORTER_HOST_PORT=8001
CADVISOR_HOST_PORT=8081
OLLAMA_BASE_URL=http://ollama:11434
HOST_OLLAMA_BASE_URL=http://host.docker.internal:11434
HOST_OLLAMA_EXPORTER_HOST=host.docker.internal:11434
OLLAMA_CHAT_MODEL=qwen2.5:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=qwen2.5:3b nomic-embed-text
CHROMA_CLIENT_MODE=http
CHROMA_HOST=chroma
CHROMA_PORT=8000
CHROMA_SSL=false
CHROMA_OPEN_TELEMETRY__SERVICE_NAME=chroma
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=adminNotes:
OLLAMA_BASE_URLpoints to the Compose service hostname when running in containers.HOST_OLLAMA_BASE_URLis used bydocker-compose.host-ollama.ymlwhen the app should call an Ollama service running on the Docker host.HOST_OLLAMA_EXPORTER_HOSTis the matching host Ollama address for the optional exporter.OLLAMA_MODELSis used by the Ollama startup script to preload the required models.CHROMA_CLIENT_MODE=httpis the intended Docker Compose mode.- Redis is provisioned but not yet used by the Python code.
- Piper is optional and degrades to text-only mode when unavailable.
As of April 7, 2026, these Ollama model names are available in the official library and are reasonable options if you want more capability without jumping straight to very large local models.
- Chat:
llama3.2:1b - Embeddings:
all-minilm
Why:
llama3.2:1bis the smallest current Llama 3.2 text model in the Ollama library.all-minilmis a very small embedding model at about46MB.
Tradeoff:
- Fastest and lightest option
- Lowest answer quality of the recommended set
- Chat:
llama3.2:3b - Embeddings:
nomic-embed-text
Why:
llama3.2:3bis still relatively small at about2.0GBand is a solid default chat model.nomic-embed-textis compact at about274MBand has a stronger retrieval profile than very small embedding models.
Tradeoff:
- Good balance of quality and resource usage
- This remains the safest default recommendation for this project
- Chat:
qwen2.5:3b - Embeddings:
nomic-embed-text
Why:
qwen2.5:3bis still in the small-model range and is generally a stronger reasoning/instruction-following step up than the most lightweight options.nomic-embed-textremains a good retrieval fit for this app.
Tradeoff:
- Better quality than the lighter pairings
- Heavier than
llama3.2:3b
- Chat:
qwen2.5:7b - Embeddings:
all-minilmornomic-embed-text
Why:
qwen2.5:7bis a meaningful quality step up if your laptop can tolerate it.- Use
all-minilmif memory pressure is tight. - Use
nomic-embed-textif retrieval quality matters more than minimizing memory.
Tradeoff:
- This is no longer a universally safe low-end choice
- I would treat this as borderline for low-end hardware, especially without enough RAM
If chat speed is acceptable but retrieval quality is still your bottleneck, a reasonable heavier embedding upgrade is:
- Embeddings:
mxbai-embed-large
Tradeoff:
- better retrieval potential than the smallest embedding models
- materially heavier than
all-minilmandnomic-embed-text
If you are unsure, use:
OLLAMA_CHAT_MODEL=llama3.2:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=llama3.2:3b nomic-embed-text
If your laptop struggles, drop to:
OLLAMA_CHAT_MODEL=llama3.2:1b
OLLAMA_EMBED_MODEL=all-minilm
OLLAMA_HEALTHCHECK_EMBED_MODEL=all-minilm
OLLAMA_MODELS=llama3.2:1b all-minilm
If you want a stronger model and your machine can handle it, try:
OLLAMA_CHAT_MODEL=qwen2.5:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=qwen2.5:3b nomic-embed-text
Sources:
python3 -m venv .venv
source .venv/bin/activatepip install --upgrade pip
pip install -r requirements.txtollama pull qwen2.5:3b
ollama pull nomic-embed-textIf you run locally without Docker, make sure Ollama is available at:
http://localhost:11434
If needed, export local runtime variables before starting Streamlit:
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CHAT_MODEL=qwen2.5:3b
export OLLAMA_EMBED_MODEL=nomic-embed-textstreamlit run app.pyBy default:
http://localhost:8501
Adjust ports, models, Grafana credentials, or related service configuration if needed.
The Compose file uses profiles:
full: starts the complete stack with Streamlit, Chroma, Ollama, Redis, tracing, and monitoring.minimal: starts only the Streamlit app, Ollama, and Redis.
For the complete stack, run:
docker compose --profile full up --buildFor the minimal stack, run:
CHROMA_CLIENT_MODE=persistent docker compose --profile minimal up --buildTo use an Ollama service already running on the host instead of the bundled Compose service, first make sure the required models are available on the host:
ollama pull qwen2.5:3b
ollama pull nomic-embed-textThen start Compose with the host override:
docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml --profile minimal up --buildThe override makes the app call HOST_OLLAMA_BASE_URL, which defaults to http://host.docker.internal:11434, and keeps the bundled ollama container out of the active profile. On Linux hosts, Ollama may need to listen on the Docker host gateway instead of only loopback, for example by starting Ollama with OLLAMA_HOST=0.0.0.0:11434; only do this on a trusted local network.
With the default .env values:
- Streamlit app:
http://localhost:8501 - Chroma API:
http://localhost:8000 - Ollama API:
http://localhost:11434 - Redis:
localhost:6379 - Zipkin tracing UI:
http://localhost:9411 - Ollama Exporter metrics:
http://localhost:8001/metrics - Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000 - cAdvisor:
http://localhost:8081
Important:
- The first Ollama startup may take time because required models are downloaded automatically.
- The app service waits for Ollama and Redis health checks before starting.
- The minimal profile uses the app's embedded persistent Chroma store and does not start the dedicated Chroma, OpenTelemetry, Zipkin, Prometheus, Grafana, cAdvisor, or Ollama exporter containers.
- Chroma traces are exported through OpenTelemetry Collector to Zipkin.
- Piper installation happens inside the app image on supported architectures.
docker compose downTo also remove named volumes:
docker compose down -v- Builds from
Dockerfile - Runs
streamlit run app.py - Connects to Chroma over HTTP
- Contains optional Piper text-to-speech support
- Uses the official
chromadb/chromaimage - Stores Chroma data in the
chroma_datavolume - Exports OpenTelemetry traces to the collector
- Uses
otel/opentelemetry-collector-contrib - Receives OTLP traces from Chroma
- Exports traces to Zipkin and logs them through the debug exporter
- Uses
openzipkin/zipkin - Provides the tracing UI for Chroma observability
- Uses the unofficial community exporter
lucabecker42/ollama-exporter - Scrapes Ollama and exposes Prometheus metrics at
/metrics - Provides Ollama-specific metrics such as version info, model inventory, running models, VRAM usage, and scrape stats
- Scrapes the Ollama exporter
- Scrapes cAdvisor for container CPU, RAM, and accelerator metrics
- Stores time-series metrics locally
- Provides the query backend for dashboards and alerting
- Uses the Docker Hub mirror
litetex/ghcr.google.cadvisor - Exposes container-level CPU and memory metrics for Docker services
- Can expose accelerator metrics when GPU support is available on the host
- Connects to Prometheus as the default data source
- Auto-provisions a starter
Ollama Overviewdashboard - Provides the UI for Ollama monitoring
- Uses the official
ollama/ollamaimage - Runs
scripts/ollama-entrypoint.sh - Pulls the required chat and embedding models automatically
- Persists model data in a named Docker volume
- Uses
redis:7-alpine - Persists Redis data in a named Docker volume
- Present for future caching and memory features
Data persisted by the stack:
app_data: embedded Chroma data used by the minimal Docker Compose profilechroma_data: Chroma database contentsollama_data: downloaded Ollama modelsredis_data: Redis stateprometheus_data: Prometheus time-series datagrafana_data: Grafana state
The Docker stack includes the Chroma observability pattern described in the Chroma Docker guide:
- Chroma emits OpenTelemetry traces
- Chroma can also emit OTLP metrics through the collector
- OpenTelemetry Collector receives them over OTLP
- Zipkin stores and visualizes the resulting traces
- Prometheus scrapes the collector's Prometheus metrics endpoint
Once the stack is running, open:
http://localhost:9411
Zipkin will start empty until requests hit Chroma. To generate a quick sample trace, call:
curl http://localhost:8000/api/v2/heartbeatThen use the Zipkin UI and click Run Query.
If you see an error like:
unknown service opentelemetry.proto.collector.metrics.v1.MetricsService
it means Chroma is trying to export OTLP metrics but the OpenTelemetry Collector was configured only for traces. The included collector config in this repository now defines both:
- a
tracespipeline for Zipkin - a
metricspipeline with a Prometheus exporter onotel-collector:8889
The Docker stack now includes a full Ollama monitoring path:
- community Ollama exporter
- Prometheus
- Grafana
Endpoints:
- Ollama exporter metrics:
http://localhost:8001/metrics - Prometheus UI:
http://localhost:9090 - Grafana UI:
http://localhost:3000
Grafana credentials come from .env:
GRAFANA_ADMIN_USER
GRAFANA_ADMIN_PASSWORD
The included Grafana provisioning automatically:
- creates Prometheus as the default data source
- loads a starter dashboard named
Ollama Overview
The dashboard now includes gauge panels for:
- Ollama CPU usage
- Ollama RAM usage
- Ollama GPU memory usage
GPU note:
- the GPU gauge depends on accelerator metrics being available from cAdvisor
- on systems without exposed GPU metrics, that gauge will stay at
0
RAGU supports optional text-to-speech with Piper.
Current behavior:
- if Piper is installed and the configured voice model exists, the app renders a browser audio player after generating an answer
- if Piper is unavailable, the app shows a visible informational message instead of failing silently
- if Piper synthesis fails, the text answer still works and the user sees a warning
Docker behavior:
- the app image attempts to install Piper for
amd64 - the app image also attempts an
arm64install path using theaarch64Piper release asset - if Piper cannot be installed for the current architecture, the app remains text-only
Important:
- the app no longer uses hardcoded local Windows paths for Piper
- TTS playback happens in the browser via
st.audio, not on the server/container speakers
The Ollama exporter used here is unofficial and community-maintained.
Example PDFs are available in examples/ for quick manual testing.
Suggested manual flow:
- Start the application.
- Upload one or more PDFs from
examples/. - Click
Process. - Ask a structural biology question.
- Inspect the retrieved chunks and reranked context in the UI expanders.
- The application is still implemented as a single-file Streamlit app.
- Redis is provisioned but not yet integrated into the Python runtime.
- There is no automated test suite yet.
- Error handling for dependency failures can still be improved.
See CONTRIBUTING.md for architecture and technical decisions.
If you are an AI agent modifying the repository, read AGENT.md first.
Additional project documentation:
The project was originally inspired by: