OpenAI-compatible API server for Rockchip NPU (RK3588/RK3576) running RKLLM models, designed as a drop-in backend for Open WebUI.
Built for single-board computers like the Orange Pi 5 Plus, this server bridges the gap between the librkllmrt.so C runtime and any OpenAI-compatible frontend — enabling local, private LLM inference on ARM hardware with zero cloud dependency.
- Features
- Architecture
- Requirements
- Installation
- Pre-Built Models
- Model Setup
- Running the Server
- API Endpoints
- Open WebUI Configuration
- Home Assistant Integration
- SearXNG Configuration
- RAG Pipeline
- Reasoning Models
- KV Cache Strategy
- Configuration Reference
- Logging
- Security
- Troubleshooting
- Testing
- File Structure
- V1 (Subprocess) vs V2 (ctypes) — Why We Migrated
- Tested Hardware
- Tested Models
- VL Model Evaluation
- Vision Encoder Resolution Comparison
- Re-Exporting VL Models at Higher Resolution
- Benchmarks
- Git Tags & Branches
- License
- Acknowledgements
- OpenAI-compatible API —
/v1/chat/completions,/v1/modelsendpoints work with any OpenAI client - Direct NPU access via ctypes binding to
librkllmrt.so(no subprocess overhead) - KV cache incremental mode — follow-up turns only prefill the new message (~50ms vs ~500ms)
- Prompt cache preloading — saves KV state to disk after first inference; subsequent model loads restore it instantly, skipping system prompt re-prefill
- Model-aware sampling profiles — per-family tuned sampling parameters (Qwen3, Gemma, Phi, DeepSeek) with
model_config.jsonoverride support - Context-aware sliding window — automatically trims oldest conversation turns when history exceeds context length, keeping the most recent exchange intact
- Context overflow hard-rejection — if the built prompt exceeds 110% of the model's context length (after sliding window), the request is rejected with a
400 context_length_exceedederror instead of sending an oversized prompt to the NPU runtime - Streaming & non-streaming responses with proper SSE (Server-Sent Events) format
- Auto-detection of all
.rkllmmodels in~/modelsdirectory - Context length auto-detection from filename patterns (2k/4k/8k/16k/32k)
- Auto-generated aliases — short names like
qwen,phiresolve automatically - Multi-turn conversation history — full chat context preserved via KV cache across turns
- Model hot-switching — request a different model and it loads automatically
- On-demand loading via
/v1/models/selectfor warm-up - Explicit unloading via
/v1/models/unloadto free NPU memory
- Request tracking with automatic stale-request cleanup (prevents deadlocks)
- Idle auto-unload — frees NPU memory after configurable idle period (default 5 min)
- Clean abort — native
rkllm_abort()for instant cancellation (no SIGKILL needed) - Graceful shutdown on SIGTERM/SIGINT with model cleanup
- RLock-based locking — prevents model switch deadlock scenarios
- Repetition loop detection — aborts generation when the model enters paragraph-level repetition loops (configurable window/threshold)
- SSE heartbeats during prefill — sends keep-alive comments every 15s during long prefill to prevent HTTP proxy/client timeouts
- Error callback state — detects C library errors and surfaces them as proper HTTP responses
- Automatic RAG detection when Open WebUI injects web search or document results
- Document/PDF RAG — works with Open WebUI's document upload and embedding pipeline
- Summarization detection — detects "summarize" queries and adds stronger multi-paragraph instructions
- Smart prompt restructuring — reading comprehension format optimized for small models
- 5-pass web content cleaning — strips navigation, boilerplate, cookie banners, stale dates
- Score-based paragraph selection — jusText-inspired content quality scoring
- Near-duplicate removal — Jaccard similarity deduplication across sources
- Quality floor — drops irrelevant search results instead of confusing the model
- Follow-up detection — 3-layer system prevents RAG on conversational replies
- Response caching — LRU cache eliminates redundant inference for repeated questions
- Context-dependent thinking — disables reasoning on small context models to save tokens
- Auto-capability detection — infers model type (thinking, instruct, VL, OCR) from folder names; only reasoning models get
<think>enabled
<think>tag parsing for Qwen3 and similar reasoning modelsreasoning_contentfield in both streaming deltas and non-streaming responses- Streaming state machine handles tags split across output chunks
- Thinking block stripping —
<think>...</think>blocks are automatically stripped from assistant history before re-sending to the model (per Qwen3 docs: "historical output should only include the final output") - Open WebUI integration — reasoning appears as collapsible thinking blocks
- Query generation shortcircuit — Open WebUI asks the model to generate search queries for retrieval; instead of wasting 5s of inference, the server extracts the user's actual question from the chat history and returns it as the query instantly (~0ms). For vague follow-ups ("can you verify that?"), it enriches the query with entities extracted from the assistant's previous response (bold text, quoted strings, capitalized phrases)
- Title generation shortcircuit — extracts the first user message as the chat title (~0ms instead of 5-10s inference)
- Tag generation shortcircuit — returns a default tag instantly (~0ms instead of 5-10s inference)
- Meta-task thinking disabled — auto-detects Open WebUI internal tasks (query gen, title gen, tags, autocomplete) and disables
<think>reasoning to avoid wasting 20+ seconds on trivial tasks - No JSON leakage — query generation shortcircuit prevents raw JSON from appearing in the chat display
- Auto-detection — recognizes Home Assistant requests by system prompt signatures (
smart home manager,Available Devices:,execute_services) - Thinking auto-disabled — skips
<think>reasoning for HA requests, cutting response time in half - Compatible with Extended OpenAI Conversation (HACS) — works as a drop-in conversation agent for Assist
- Prometheus metrics (optional) —
rkllm_tokens_generated,rkllm_prefill_duration,rkllm_decode_duration,rkllm_tokens_per_request,rkllm_queue_wait_seconds,rkllm_active_requests,rkllm_model_load_seconds,rkllm_current_model— exposed at/metrics - Graceful degradation — metrics are disabled automatically if
prometheus-flask-exporteris not installed
stream_options.include_usage— streaming token counts per OpenAI specsystem_fingerprintin all responsesmax_tokens/max_completion_tokenssupport- Request body size limit (50 MB)
- Proper error responses matching OpenAI error format
- Dual-model architecture — text model (e.g. Qwen3-1.7B) + VL model (e.g. Qwen3-VL-2B) loaded simultaneously
- Automatic image routing — requests with images route to VL model, text-only to text model
- Base64 image support — accepts
image_urlwithdata:image/...;base64,...format (Open WebUI compatible) - Direct NPU vision encoding via ctypes binding to
librknnrt.so(no Python RKNN toolkit needed) - Image preprocessing — auto square-pad (128,128,128 background) and resize to encoder input size
- Multiple VL model support — auto-detects
.rknnvision encoder alongside.rkllmdecoder - Configurable special tokens —
VL_MODEL_CONFIGSmaps model families to their image tokens - Multi-turn VL context — follow-up questions, RAG/web search data, and conversation history are included in VL prompts (not just the original image caption)
- Seamless Open WebUI experience — paste/upload images in chat, responses stream normally
┌──────────────┐ HTTP/SSE ┌──────────────────────────────────┐
│ Open WebUI │ ◄──────────────── │ api.py (Flask + gunicorn) │
│ or any │ ─────────────────►│ gthread, -w 1 │
│ OpenAI │ │ │
│ client │ │ ┌──────────────────────────┐ │
└──────────────┘ │ │ VL Auto-Router │ │
│ │ (image → VL, text → LLM)│ │
┌──────────┐ │ └────┬─────────────┬───────┘ │
│ SearXNG │ ◄──── Open │ │ text │ image │
│ (search) │ WebUI injects│ ▼ ▼ │
└──────────┘ results │ ┌─────────┐ ┌─────────────┐ │
│ │ Prompt │ │ Vision Enc. │ │
┌──────────┐ │ │ Builder │ │ librknnrt.so│ │
│ Ollama │ │ │ + RAG │ │ (.rknn NPU) │ │
│ (CPU) │ │ └────┬────┘ └──────┬──────┘ │
└──────────┘ │ │ │ │
optional │ ▼ ▼ │
│ ┌──────────────────────────┐ │
│ │ librkllmrt.so v1.2.3 │ │
│ │ Text: RKLLMWrapper │ │
│ │ VL: RKLLMWrapper #2 │ │
│ │ C callback → Queue │ │
│ └────────────┬─────────────┘ │
│ │ │
│ ┌────────────▼─────────────┐ │
│ │ RK3588 NPU (3 cores) │ │
│ │ 6 TOPS per core │ │
│ └──────────────────────────┘ │
│ │
│ ┌──────────────────────────┐ │
│ │ ThinkTagParser │ │
│ │ (reasoning_content) │ │
│ └──────────────────────────┘ │
└──────────────────────────────────┘
Key design decisions:
-
Plain text only — The rkllm runtime applies chat templates internally using actual token IDs. Special tokens (
<|im_start|>,<start_of_turn>, etc.) are stripped from the text vocabulary during model conversion. Sending them as literal text causes the model to see garbage. -
Single worker — The NPU can only load one model at a time. The server enforces
-w 1(one gunicorn worker) and rejects concurrent generation with HTTP 503. -
ctypes + callback — The C library's
rkllm_run()is blocking, so it runs in a worker thread. A C callback pushes tokens to aqueue.Queue, which the main thread reads and yields as SSE chunks. This keeps the KV cache in-process across turns. -
gthread, not gevent —
rkllm_run()is a blocking C function that freezes gevent's event loop. Using-k gthreadwith real OS threads avoids this. -
Dual-model VL — Text and VL models are loaded simultaneously into separate
RKLLMWrapperinstances. The vision encoder runs on a third ctypes binding (librknnrt.so). Image requests are auto-routed to the VL pipeline; text requests use the primary model. A shared_token_queueserialized byPROCESS_LOCKprevents interleaving. Default VL model: Qwen3-VL-2B (replaced DeepSeekOCR-3B — smaller, saves ~1.1 GB RAM).
This project was developed and tested on:
| Component | Details |
|---|---|
| Board | Orange Pi 5 Plus (16 GB RAM) |
| SoC | Rockchip RK3588 (3 NPU cores) |
| OS | Armbian Pelochus 24.11.0 — Armbian-Pelochus_24.11.0-OrangePi5-plus_jammy_vendor.7z |
| Kernel NPU Driver | 0.9.8 (included in the Pelochus image — no driver build required) |
| RKLLM Runtime | v1.2.3 (only the runtime library needs to be installed) |
Why Pelochus Armbian? The standard Armbian images ship with an older RKNPU driver (0.9.6 or earlier). The Pelochus builds bundle RKNPU driver 0.9.8 in the kernel, so you only need to install the RKLLM runtime — no kernel module compilation required.
- Rockchip RK3588 or RK3576 SBC (Orange Pi 5 Plus, Rock 5B, etc.)
- NPU driver installed and functional
- Minimum 8 GB RAM recommended (16 GB for larger models)
- Linux (ARM64) — tested on Ubuntu/Debian (Armbian)
- Python 3.8+
- RKNPU driver ≥ 0.9.6 (0.9.8 recommended — see Installation)
- RKLLM Runtime ≥ v1.2.0 (tested with v1.2.3) —
librkllmrt.soshared library (see Installation) - RKLLM models (
.rkllmformat) placed in~/models/ - RKNN Runtime (optional) —
librknnrt.soshared library (only needed for VL/multimodal models with.rknnvision encoders)
SDK Version Coupling: The ctypes struct definitions in
api.pytarget the RKLLM SDK v1.2.x C header (rkllm.h). Older SDK versions used a flat 112-byte reserved blob inRKLLMExtendParamand lacked fields liken_keep,n_batch,use_cross_attn, andenable_thinking. Running this server against an olderlibrkllmrt.so(pre-1.2) will cause silent struct-offset misalignment — the parameter block passed torkllm_init()would be corrupted, producing wrong sampling behaviour rather than a crash. Always use the runtime from the v1.2.x release or later.
# Core (required)
pip install flask flask-cors gunicorn
# Prometheus monitoring (optional — metrics at /metrics endpoint)
pip install prometheus-flask-exporter prometheus-client
# VL / multimodal support (optional — needed only for vision-language models)
pip install numpy PillowA zero-configuration setup script is included that handles everything — system packages, Python venv, RKLLM runtime installation, kernel module/driver verification, udev rules, systemd service, and NPU frequency fix:
git clone https://github.com/GatekeeperZA/RKLLM-API-Server.git
cd RKLLM-API-Server
chmod +x setup.sh
./setup.shDo NOT run as root. The script uses
sudointernally only where needed (installing system packages, copying libraries, creating the systemd service). User-level files (venv, models directory) are owned by your normal account.
The script is idempotent — safe to run multiple times. It detects what's already installed and skips those steps.
What it installs / verifies:
- System packages:
python3,python3-venv,python3-pip,build-essential,git,git-lfs - RKNPU kernel module check (
lsmod,modinfo,/dev/rknpu, udev rules,rendergroup) - RKLLM Runtime:
librkllmrt.so→/usr/lib/ - Python venv (
.venv) withflask,flask-cors,gunicorn - Systemd services:
rkllm-api(API server) +fix-freq(NPU/CPU frequency governor)
After setup, download a model and start the service:
# Download Qwen3-1.7B (recommended)
mkdir -p ~/models/Qwen3-1.7B && cd ~/models/Qwen3-1.7B
git lfs install && git clone https://huggingface.co/GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3 .
# Start the server
sudo systemctl start rkllm-api
# Check status
sudo systemctl status rkllm-api
curl http://localhost:8000/v1/modelsClick to expand manual step-by-step instructions
git clone https://github.com/GatekeeperZA/RKLLM-API-Server.git
cd RKLLM-API-Server
# Install Python dependencies
pip install flask flask-cors gunicorn
# Create models directory
mkdir -p ~/modelsThe RKNPU kernel driver enables communication with the NPU hardware. Some board images ship with an older driver — you need ≥ 0.9.6 (0.9.8 recommended).
Check your current driver version:
dmesg | grep -i rknpu
# Look for a line like: "RKNPU driver loaded version 0.9.8"
# or:
cat /sys/kernel/debug/rknpu/version 2>/dev/null || echo "Check dmesg"If you need to update:
The driver source is included in the rknn-llm repository as a pre-built tarball. It must be compiled against your running kernel's headers.
# Clone the rknn-llm repo (if not already done)
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm/rknpu-driver
# Extract the driver source
tar xjf rknpu_driver_0.9.8_20241009.tar.bz2
cd rknpu_driver_0.9.8
# Install kernel headers (required for compilation)
sudo apt update
sudo apt install -y linux-headers-$(uname -r) build-essential
# Build the driver module
make -C /lib/modules/$(uname -r)/build M=$(pwd)/drivers/rknpu modules
# Install the new driver
sudo cp drivers/rknpu/rknpu.ko /lib/modules/$(uname -r)/kernel/drivers/rknpu/
sudo depmod -a
# Load the new driver (or reboot)
sudo modprobe -r rknpu 2>/dev/null # unload old
sudo modprobe rknpu # load new
# Verify
dmesg | tail -5 | grep -i rknpuNote: Many Armbian and Orange Pi images already include RKNPU driver 0.9.8. Check before building. If
dmesg | grep rknpushows0.9.8, you're good.
Recommended: The Pelochus Armbian builds ship with RKNPU driver 0.9.8 pre-installed — no manual driver compilation needed. Use
Armbian-Pelochus_24.11.0-OrangePi5-plus_jammy_vendor.7z(or the latest release for your board) and skip straight to the runtime setup.
The RKLLM runtime provides the librkllmrt.so shared library that this API server loads via ctypes. The ctypes struct layouts in api.py require SDK v1.2.0 or later — see Requirements for details on version coupling.
# Clone the rknn-llm repo (if not already done)
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm
# --- Install the runtime library ---
sudo cp rkllm-runtime/Linux/librkllm_api/aarch64/librkllmrt.so /usr/lib/
sudo ldconfig
# Verify the library is findable
ldconfig -p | grep rkllm
# Should show: librkllmrt.so => /usr/lib/librkllmrt.soVerify everything works:
# Check RKNPU driver
dmesg | grep -i rknpu
# Check runtime library
ldconfig -p | grep rkllmFor consistent performance, pin the NPU and CPU frequencies. The rknn-llm repo includes scripts for this:
cd rknn-llm/scripts
# RK3588
sudo bash fix_freq_rk3588.sh
# RK3576 (if using that platform)
sudo bash fix_freq_rk3576.shRun this after each reboot, or use the setup script which creates a systemd service for automatic frequency pinning.
Ready-to-run .rkllm models converted by the author for RK3588 NPU are available on HuggingFace:
| Model | Parameters | Quant | Context | Speed | RAM | Thinking | Link |
|---|---|---|---|---|---|---|---|
| Qwen3-1.7B | 1.7B | w8a8 | 4,096 | ~13.6 tok/s | ~2 GB | ✅ Yes | Download |
| Phi-3-mini-4k-instruct | 3.82B | w8a8 | 4,096 | ~6.8 tok/s | ~3.7 GB | ❌ No | Download |
Browse all models: huggingface.co/GatekeeperZA
All models are converted with RKLLM Toolkit v1.2.3, targeting RK3588 (3 NPU cores), and tested on an Orange Pi 5 Plus (16 GB RAM, RKNPU driver 0.9.8).
⚠️ DeepSeek-R1 on NPU — Currently Not UsableDeepSeek-R1 (including distilled variants like DeepSeek-R1-Distill-Qwen-1.5B) does not work correctly with RKLLM Runtime v1.2.3. The model converts without errors but produces garbage output — repeating
[PAD151935]tokens instead of real text (rknn-llm#424). The Airockchip team has acknowledged this is a known issue and stated it will be fixed in a future runtime version.For NPU reasoning, use Qwen3-1.7B instead — it supports
<think>tags, runs at ~13.6 tok/s on the NPU, and works reliably with RKLLM v1.2.3.If you need DeepSeek-R1, run
deepseek-r1:7bvia Ollama on CPU — it works correctly (just slower, ~2-3 tok/s on RK3588 ARM cores). See Using Ollama Alongside below.
# Install git-lfs (required for large files)
sudo apt install git-lfs
git lfs install
# Qwen3-1.7B (thinking/reasoning model — recommended)
mkdir -p ~/models/Qwen3-1.7B
cd ~/models/Qwen3-1.7B
git clone https://huggingface.co/GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3 .
# Phi-3-mini (3.8B — strong at math/code, MIT licensed)
mkdir -p ~/models/Phi-3-mini-4k-instruct
cd ~/models/Phi-3-mini-4k-instruct
git clone https://huggingface.co/GatekeeperZA/Phi-3-mini-4k-instruct-w8a8 .Qwen3-1.7B — Hybrid thinking model. Produces <think>...</think> reasoning blocks that this API server parses into reasoning_content for Open WebUI's collapsible thinking display. Supports English and Chinese.
Phi-3-mini-4k-instruct — Microsoft's 3.8B parameter model excelling at reasoning, math (85.7% GSM8K), and code generation (57.3% HumanEval). English-primary. No thinking mode — this is a standard instruct model. MIT licensed.
Place each .rkllm model in its own subfolder under ~/models/:
~/models/
├── Qwen3-1.7B/
│ └── Qwen3-1.7B-w8a8-rk3588.rkllm
├── Qwen3-4B-Instruct-2507/
│ └── Qwen3-4B-Instruct-16k-w8a8-rk3588.rkllm
├── Gemma-3-4B-IT/
│ └── Gemma-3-4B-IT-w8a8-rk3588.rkllm
└── Phi-3-Mini-4K-Instruct/
└── Phi-3-Mini-4K-Instruct-w8a8-rk3588.rkllm
VL models require two files in the same folder: a .rkllm decoder and a .rknn vision encoder.
~/models/
├── Qwen3-1.7B/ # Text-only model
│ └── Qwen3-1.7B-w8a8-rk3588.rkllm
└── Qwen3-VL-2b/ # VL model (text + vision)
├── Qwen3-VL-2b-w8a8-rk3588.rkllm # LLM decoder
└── Qwen3-VL-2b-vision-encoder.rknn # Vision encoder
How it works:
- The server auto-detects
.rknnfiles alongside.rkllmfiles - The folder name is matched against
VL_MODEL_CONFIGS(supports DeepSeekOCR, Qwen2-VL, Qwen2.5-VL, Qwen3-VL, InternVL3, MiniCPM) - When a chat request includes an image (base64
image_url), it auto-routes to the VL model - Text-only requests continue using the text model normally
Supported VL models (model folder name must contain):
| Pattern | Model Family | Notes |
|---|---|---|
qwen3-vl |
Qwen3-VL | Recommended — best OCR, fastest |
qwen2.5-vl |
Qwen2.5-VL | Lower 392×392 encoder |
qwen2-vl |
Qwen2-VL | Lower 392×392 encoder |
internvl3 |
InternVL3 / InternVL3.5 | W8A8 precision loss, poor OCR |
deepseekocr |
DeepSeekOCR | Severe hallucination in RKNN conversion |
minicpm |
MiniCPM-V | Untested |
Requirements:
numpyandPillowPython packages (installed bysetup.sh)librknnrt.so(RKNN runtime library, usually at/usr/lib/librknnrt.so)- Sufficient RAM for both models (~4.4 GB for Qwen3-1.7B + Qwen3-VL-2B)
The server auto-detects context length from the filename or folder name:
| Pattern in name | Detected context |
|---|---|
-2k or _2k |
2,048 tokens |
-4k or _4k |
4,096 tokens |
-8k or _8k |
8,192 tokens |
-16k or _16k |
16,384 tokens |
-32k or _32k |
32,768 tokens |
| (none found) | 4,096 (default) |
The server auto-detects each model's capabilities from its folder name. This controls whether <think> reasoning is enabled and what metadata is exposed via /v1/models.
Auto-detected capabilities by model family:
| Folder pattern | Detected capabilities | Thinking |
|---|---|---|
qwen3 (not VL) |
instruct, thinking |
Yes |
deepseek*r1 / deepseek*r2 |
instruct, thinking |
Yes |
qwq |
instruct, thinking |
Yes |
deepseekocr |
instruct, ocr |
No |
qwen*vl |
instruct, vl |
No |
phi |
instruct |
No |
gemma |
instruct |
No |
llama |
instruct |
No |
mistral / mixtral |
instruct |
No |
internvl / minicpm |
instruct |
No |
(contains instruct or -it) |
instruct |
No |
| (no match) | base |
No |
Any model with a .rknn vision encoder file automatically gains the vl capability.
Override with model_config.json: Place a JSON file in the model folder to override auto-detection:
// ~/models/MyCustomModel/model_config.json
{
"capabilities": ["thinking", "instruct"]
}Effect on thinking: Only models with the thinking capability get enable_thinking=True on the RKLLM runtime. Non-thinking models (Phi-3, Gemma, etc.) always run with enable_thinking=False, preventing wasted tokens on models that don't support <think> blocks.
Model folder names are converted to IDs (lowercase, hyphens). Aliases are auto-generated:
| Model ID | Auto-Aliases |
|---|---|
qwen3-1.7b |
qwen, qwen3 |
qwen3-4b-instruct-2507 |
qwen3-4b, qwen3-4b-instruct |
gemma-3-4b-it |
gemma, gemma-3, gemma-3-4b |
phi-3-mini-4k-instruct |
phi, phi-3, phi-3-mini |
Aliases are only created when unambiguous (one model claims the alias). If two models share a prefix, that alias is skipped.
gunicorn -w 1 -k gthread --threads 4 --timeout 300 -b 0.0.0.0:8000 api:appCritical: Always use
-w 1(single worker). The NPU can only load one model at a time.Critical: Always use
-k gthread, NOT-k gevent.rkllm_run()is a blocking C call that freezes gevent's event loop.
python api.pyThis starts Flask's built-in server on 0.0.0.0:8000 with threading enabled.
The setup script creates this automatically. Manual setup:
[Unit]
Description=RKLLM API Server
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/RKLLM-API-Server
ExecStart=/path/to/.venv/bin/gunicorn -w 1 -k gthread --threads 4 --timeout 300 -b 0.0.0.0:8000 api:app
Restart=always
RestartSec=5
Environment=RKLLM_API_LOG_LEVEL=INFO
[Install]
WantedBy=multi-user.target# Start/stop/restart
sudo systemctl start rkllm-api
sudo systemctl stop rkllm-api
sudo systemctl restart rkllm-api
# View logs
sudo journalctl -u rkllm-api -f
# Enable/disable auto-start on boot
sudo systemctl enable rkllm-api
sudo systemctl disable rkllm-apiList all detected models with capabilities and context length.
curl http://localhost:8000/v1/models{
"object": "list",
"data": [
{
"id": "qwen3-1.7b",
"object": "model",
"created": 1738972800,
"owned_by": "rkllm",
"capabilities": ["instruct", "thinking"],
"context_length": 4096
},
{
"id": "gemma-3-4b-it",
"object": "model",
"created": 1738972800,
"owned_by": "rkllm",
"capabilities": ["instruct"],
"context_length": 4096
},
{
"id": "qwen3-vl-2b",
"object": "model",
"created": 1738972800,
"owned_by": "rkllm",
"capabilities": ["instruct", "vl"],
"context_length": 4096
}
]
}Capability values:
| Capability | Meaning |
|---|---|
thinking |
Native <think> reasoning support (Qwen3, DeepSeek-R1, QwQ) |
instruct |
Instruction-tuned / chat model |
vl |
Vision-language model (image understanding) |
ocr |
Specialised for document OCR |
base |
Base / completion-only model (no chat template) |
OpenAI-compatible chat completions (streaming and non-streaming).
# Non-streaming
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-1.7b",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}'
# Streaming
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-1.7b",
"stream": true,
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}'Supported parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
string | required | Model ID or alias |
messages |
array | required | OpenAI messages format |
stream |
bool | false |
Enable SSE streaming |
max_tokens |
int | 2048 |
Max completion tokens |
temperature |
float | (ignored) | Accepted but has no effect — rkllm uses model-compiled sampling |
top_p |
float | (ignored) | Accepted but has no effect |
frequency_penalty |
float | (ignored) | Accepted but has no effect |
presence_penalty |
float | (ignored) | Accepted but has no effect |
stream_options.include_usage |
bool | false |
Include token counts in stream |
Pre-load a model without generating (warm-up).
curl -X POST http://localhost:8000/v1/models/select \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-1.7b"}'Explicitly unload the current model to free NPU memory.
curl -X POST http://localhost:8000/v1/models/unloadHealth check endpoint.
curl http://localhost:8000/health{
"status": "ok",
"current_model": "qwen3-1.7b",
"model_loaded": true,
"vl_model": {
"model": "qwen3-vl-2b",
"encoder_loaded": true,
"llm_loaded": true
},
"active_request": null,
"models_available": 4
}The
vl_modelfield isnullwhen no VL model is loaded.
Prometheus metrics endpoint (only available when prometheus-flask-exporter is installed).
curl http://localhost:8000/metricsExposes counters, histograms, and gauges for tokens generated, prefill/decode duration, tokens per request, queue wait time, active requests, model load time, and current model state.
Open WebUI runs as a Docker container on the same Orange Pi (or any machine on the network). All optimized settings are hardcoded as environment variables so they persist across container recreations.
Option A: Docker Compose (recommended)
A docker-compose.yml is included in this repo with all settings pre-configured:
# Copy docker-compose.yml to the Orange Pi and start:
docker compose up -d
# Update to latest Open WebUI image:
docker compose pull && docker compose up -d
# Update with full backup + rollback (recommended):
# Uses the automated backup script — backs up DB + uploads,
# verifies integrity, pulls latest image, health checks,
# and auto-rolls back on failure. See Backup & Update section below.
/home/armbian/scripts/openwebui_full_backup_update.sh
# Full reset (deletes all data + settings, env vars re-apply):
docker compose down -v && docker compose up -dOption B: Docker Run
Equivalent single command with all env vars:
docker run -d \
--name open-webui \
--restart always \
--add-host=host.docker.internal:host-gateway \
-p 3000:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=sk-unused \
-e RAG_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 \
-e RAG_RERANKING_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 \
-e RAG_EMBEDDING_BATCH_SIZE=10 \
-e ENABLE_ASYNC_EMBEDDING=True \
-e RAG_SYSTEM_CONTEXT=True \
-e RAG_TOP_K=5 \
-e RAG_TOP_K_RERANKER=3 \
-e RAG_RELEVANCE_THRESHOLD=0.0 \
-e ENABLE_RAG_HYBRID_SEARCH=True \
-e ENABLE_RAG_HYBRID_SEARCH_ENRICHED_TEXTS=True \
-e RAG_HYBRID_BM25_WEIGHT=0.1 \
-e CHUNK_SIZE=1000 \
-e CHUNK_OVERLAP=0 \
-e CHUNK_MIN_SIZE_TARGET=400 \
-e ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=True \
-e ENABLE_RETRIEVAL_QUERY_GENERATION=True \
-e 'RAG_TEMPLATE=### Task:
Answer the user'"'"'s question using ONLY the provided context. Be thorough and detailed.
### Guidelines:
- If the answer is in the context, provide a comprehensive response with all relevant details.
- If the context doesn'"'"'t contain the answer, say so clearly.
- Respond in the same language as the user'"'"'s query.
- Do not use XML tags in your response.
<context>
{{CONTEXT}}
</context>
<user_query>
{{QUERY}}
</user_query>' \
-e ENABLE_WEB_SEARCH=True \
-e WEB_SEARCH_ENGINE=searxng \
-e SEARXNG_QUERY_URL=http://host.docker.internal:8080/search?q=<query> \
-e WEB_SEARCH_RESULT_COUNT=5 \
-e WEB_SEARCH_CONCURRENT_REQUESTS=3 \
-e BYPASS_WEB_SEARCH_WEB_LOADER=True \
-e BYPASS_WEB_SEARCH_EMBEDDING_AND_RETRIEVAL=True \
-e FILE_IMAGE_COMPRESSION_WIDTH=672 \
-e FILE_IMAGE_COMPRESSION_HEIGHT=672 \
-e PDF_EXTRACT_IMAGES=True \
-e ENABLE_CODE_EXECUTION=False \
-e ENABLE_CODE_INTERPRETER=False \
-e ENABLE_CHANNELS=True \
-e ENABLE_MEMORIES=True \
-e ENABLE_NOTES=True \
-e ANONYMIZED_TELEMETRY=false \
-e DO_NOT_TRACK=true \
ghcr.io/open-webui/open-webui:mainEnvironment variables explained:
Connection:
| Variable | Value | Reason |
|---|---|---|
OPENAI_API_BASE_URL |
http://host.docker.internal:8000/v1 |
Auto-connects to the RKLLM API server on the host. No manual UI setup needed — models appear immediately |
OPENAI_API_KEY |
sk-unused |
The RKLLM server has no auth, but Open WebUI requires a non-empty key for OpenAI-compatible endpoints |
RAG Pipeline:
| Variable | Value | Reason |
|---|---|---|
ENABLE_RETRIEVAL_QUERY_GENERATION |
True |
Enables retrieval query generation — Open WebUI sends a query gen request, which the API server shortcircuits instantly (~0ms) with a context-enriched query instead of wasting 5-10s on inference |
RAG_SYSTEM_CONTEXT |
True |
Injects retrieved document/search content into the system message instead of user message, enabling KV prefix caching for faster follow-up turns |
RAG_EMBEDDING_MODEL |
BAAI/bge-small-en-v1.5 |
Best retrieval-quality embedding model that runs efficiently on ARM CPU (see Embedding Model below) |
RAG_RERANKING_MODEL |
cross-encoder/ms-marco-MiniLM-L-6-v2 |
Lightweight cross-encoder reranker (22M params, ~88MB RAM) — re-scores Top K results for much better precision. Open WebUI's sigmoid normalization is specifically designed for MS MARCO models |
RAG_EMBEDDING_BATCH_SIZE |
10 |
Processes 10 text chunks per embedding batch — speeds up document ingestion without excessive memory use on ARM |
ENABLE_ASYNC_EMBEDDING |
True |
Embeds documents asynchronously — prevents blocking the UI during file uploads |
ENABLE_RAG_HYBRID_SEARCH |
True |
Combines semantic (vector) + keyword (BM25) search for significantly better retrieval than vector-only |
RAG_HYBRID_BM25_WEIGHT |
0.1 |
10% keyword / 90% semantic — heavily semantic-leaning since bge-small-en-v1.5 delivers strong retrieval. Higher values dilute precision |
ENABLE_RAG_HYBRID_SEARCH_ENRICHED_TEXTS |
True |
Enriches BM25 index with document filenames, titles, and section headers — improves keyword recall for metadata-based queries |
ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER |
True |
Splits documents by Markdown headers (H1-H6) first, preserving document structure. The character splitter only runs as a secondary pass on oversized sections |
RAG_RELEVANCE_THRESHOLD |
0.0 |
Critical. Must be 0.0 — higher values filter out valid results because cross-encoder sigmoid scores are often below 0.1 for cross-lingual or loosely-related content. The reranker handles quality filtering instead |
CHUNK_SIZE |
1000 |
Maximum characters per chunk. Balanced for 4K context models — large enough for coherent passages, small enough to fit multiple chunks |
CHUNK_OVERLAP |
0 |
Zero overlap — Chroma Research showed overlap actively hurts retrieval IoU by returning redundant tokens. With Hybrid Search, overlap is unnecessary |
CHUNK_MIN_SIZE_TARGET |
400 |
Merges tiny fragments (<400 chars) with neighbors, preventing low-quality micro-chunks. Works with Markdown Header Splitter to reduce chunk count by up to 90% |
RAG_TOP_K |
5 |
Retrieves 5 candidate chunks, then reranker narrows to best 3 (Top K Reranker). Good funnel ratio for 4K context models |
RAG_TOP_K_RERANKER |
3 |
Reranker keeps top 3 from the 5 retrieved chunks — only the most relevant content reaches the model |
RAG_TEMPLATE |
(custom) | Custom reading-comprehension prompt that instructs the model to answer from context only. See docker-compose.yml for the full template |
Web Search (SearXNG):
| Variable | Value | Reason |
|---|---|---|
ENABLE_WEB_SEARCH |
True |
Enables the web search toggle in the chat UI |
WEB_SEARCH_ENGINE |
searxng |
Uses the self-hosted SearXNG instance for privacy and JSON API support |
SEARXNG_QUERY_URL |
http://host.docker.internal:8080/search?q=<query> |
SearXNG instance URL. Uses host.docker.internal to reach the host-side SearXNG container. Change if your SearXNG is on a different host or port |
WEB_SEARCH_RESULT_COUNT |
5 |
Number of search results to fetch. 5 gives good coverage — the API server's quality-floor filtering drops irrelevant results automatically |
WEB_SEARCH_CONCURRENT_REQUESTS |
3 |
Limits concurrent web search requests to 3 — prevents overwhelming SearXNG while keeping searches fast |
BYPASS_WEB_SEARCH_WEB_LOADER |
True |
Uses search engine snippets instead of scraping full pages — cleaner, faster, and more reliable for small models |
BYPASS_WEB_SEARCH_EMBEDDING_AND_RETRIEVAL |
True |
Sends search snippets directly to the model without embedding/retrieving — the API server builds its own optimized prompt internally |
File Upload / Image Compression:
| Variable | Value | Reason |
|---|---|---|
FILE_IMAGE_COMPRESSION_WIDTH |
672 |
Compresses uploaded images to 672px width — matches the active VL encoder resolution. Other options: 448 (faster, less detail), 896 (slower, more detail) |
FILE_IMAGE_COMPRESSION_HEIGHT |
672 |
Compresses uploaded images to 672px height — must match the width value. See Vision Encoder Resolution Comparison |
Document Processing:
| Variable | Value | Reason |
|---|---|---|
PDF_EXTRACT_IMAGES |
True |
Extracts text from scanned images inside PDFs using OCR (Tesseract inside the container) |
Code Execution:
| Variable | Value | Reason |
|---|---|---|
ENABLE_CODE_EXECUTION |
False |
Small NPU models (1.7B–4B) generate unreliable code — running it wastes time or produces wrong results |
ENABLE_CODE_INTERPRETER |
False |
Same reason — disable to prevent unreliable code interpretation |
Features:
| Variable | Value | Reason |
|---|---|---|
ENABLE_CHANNELS |
True |
Enables group chat channels |
ENABLE_MEMORIES |
True |
Enables persistent user memories across conversations |
ENABLE_NOTES |
True |
Enables the notes feature for saving snippets |
Privacy:
| Variable | Value | Reason |
|---|---|---|
ANONYMIZED_TELEMETRY |
false |
Disables telemetry (optional, recommended for privacy) |
DO_NOT_TRACK |
true |
Disables tracking (optional, recommended for privacy) |
Port mapping:
3000:8080— access Open WebUI athttp://<device-ip>:3000. Change3000to any port you prefer.
--add-hostflag (Linux-specific, required): On Linux, Docker does not resolvehost.docker.internalby default — this is a Docker Desktop feature for macOS/Windows only. The--add-host=host.docker.internal:host-gatewayflag maps it to the host's gateway IP, allowing the container to reach services running on the host (the RKLLM API server, Ollama, etc.). Without this flag, Open WebUI's default Ollama connection (http://host.docker.internal:11434) and any OpenAI connections usinghost.docker.internalwill fail withClientConnectorDNSError: Cannot connect to host host.docker.internal.
The goal is minimal user configuration — a fresh install should work correctly out of the box with zero manual settings. The docker-compose.yml file and database setup scripts together achieve this by hardcoding every setting that can be automated.
Three hardcoding layers are used:
| Layer | How | Survives Container Recreate | Survives Volume Delete |
|---|---|---|---|
Docker env vars (docker-compose.yml) |
PersistentConfig — env var sets the initial default, DB value takes precedence once changed in UI |
Yes | Yes (re-applies) |
Database scripts (tests/set_model_prompts.py, tests/fix_owui_models.py) |
Directly write to webui.db inside the container |
Yes (data on Docker volume) | No (must re-run) |
| Admin UI only | No env var or script — must configure manually | Yes (data on Docker volume) | No (must redo) |
Settings hardcoded via Docker env vars (auto-restore on fresh install):
| Category | Settings |
|---|---|
| Connection | API base URL, API key |
| RAG pipeline | Embedding model, reranking model, batch size, async embedding, hybrid search, BM25 weight, enriched texts, relevance threshold, top_k, top_k_reranker, custom RAG template |
| Chunking | Chunk size, overlap, min size target, markdown header splitter |
| Document processing | PDF image extraction via OCR (PDF_EXTRACT_IMAGES=True) — extracts text from scanned images inside PDFs |
| Web search | Engine, SearXNG URL, result count, concurrent requests, bypass modes |
| File upload | Image compression (672×672 for VL model — matches active encoder resolution) |
| Code execution | Disabled (ENABLE_CODE_EXECUTION=False) — small NPU models (1.7B–4B) generate unreliable code |
| Code interpreter | Disabled (ENABLE_CODE_INTERPRETER=False) — same reason; wastes time or produces wrong results |
| Features | Channels, memories, notes |
| Privacy | Telemetry disabled |
Settings hardcoded via database scripts (must re-run after volume reset):
| Script | What it sets |
|---|---|
tests/set_model_prompts.py |
System prompt on all models (date/time context) |
tests/fix_owui_models.py |
Model capability flags (vision, image_gen, code_interpreter, etc.) |
Settings only configurable via Admin UI (no env var available, must redo manually after volume reset):
| Setting | Current Value | Where to Set |
|---|---|---|
| Web search domain filter list | !reddit.com, !twitter.com, !x.com, !linkedin.com, !facebook.com, !instagram.com, !tripadvisor.com, !timeanddate.com |
Admin > Settings > Web Search > Domain Filter List |
| Model display order | qwen3-1.7b, qwen3-4b, phi-3-mini, gemma-3, deepseek-r1:7b, qwen3:8b, qwen3-vl-2b | Admin > Settings > Interface > Model Order |
| Prompt suggestions | 16 custom suggestions (study, coding, travel, etc.) | Admin > Settings > Interface > Prompt Suggestions |
Full recovery after volume reset (docker compose down -v):
# 1. Re-create container (env vars auto-apply)
docker compose up -d
# 2. Create admin account in browser, then run DB scripts:
docker cp tests/set_model_prompts.py open-webui:/tmp/
docker exec open-webui python3 /tmp/set_model_prompts.py
docker cp tests/fix_owui_models.py open-webui:/tmp/
docker exec open-webui python3 /tmp/fix_owui_models.py
# 3. Re-configure Admin UI-only settings manually (domain filters, model order, prompt suggestions)The RKLLM API server connection is auto-configured via OPENAI_API_BASE_URL and OPENAI_API_KEY env vars — no manual setup needed. Models appear in the dropdown immediately after container startup.
If you need to change the connection later: Admin > Settings > Connections > edit the OpenAI-compatible endpoint:
| Setting | Value |
|---|---|
| API Base URL | http://host.docker.internal:8000/v1 (default via env var) |
| API Key | sk-unused (default via env var) |
Ollama can be installed on the same board and added as a second connection in Open WebUI. This gives you access to CPU-only models (e.g., larger models that don't have RKLLM conversions) alongside your NPU models — both appear in the model selector.
# Install Ollama on the same ARM board
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull gemma2:2bAdmin > Settings > Connections:
Add Ollama as an additional connection (don't remove the RKLLM one):
| Setting | Value |
|---|---|
| Ollama API URL | http://localhost:11434 |
Both backends appear in Open WebUI's model dropdown:
- NPU models (fast, via this RKLLM API server) — use for everyday chat and web search
- CPU models (slower, via Ollama) — use for larger models or architectures not yet supported by RKLLM
Note: NPU and CPU inference don't conflict — they use different hardware. You can have an NPU model loaded via this server while Ollama runs a CPU model simultaneously.
Recommended Ollama models for RK3588:
# DeepSeek-R1 reasoning (works on CPU, broken on NPU — see Pre-Built Models note)
ollama pull deepseek-r1:7b
# Other useful CPU models
ollama pull gemma2:2b
ollama pull phi3:3.8bCPU models (Ollama) do NOT need the NPU-specific settings below. The system prompt, disabled "Builtin Tools", and other restrictions apply only to small NPU models served by this RKLLM API.
The system prompt is set at the model level in the database — it applies to all users automatically with zero user configuration required. New users get the correct prompt immediately without needing to set anything up.
Current prompt (set on all models):
Today is {{CURRENT_DATE}} ({{CURRENT_WEEKDAY}}), {{CURRENT_TIME}}. This is the ONLY correct current date. Ignore any conflicting dates from search results.
Open WebUI resolves the template variables server-side before sending to the model:
{{CURRENT_DATE}}→ e.g. "February 10, 2026"{{CURRENT_WEEKDAY}}→ e.g. "Tuesday"{{CURRENT_TIME}}→ e.g. "14:30:00"
How it works: The prompt is stored in each model's params.system field in the database. Open WebUI injects it server-side on every request via apply_system_prompt_to_body(). This is enforced regardless of which user sends the message.
To change the prompt: Edit SYSTEM_PROMPT in tests/set_model_prompts.py and re-run:
# Edit the prompt in the script, then:
docker cp tests/set_model_prompts.py open-webui:/tmp/
docker exec open-webui python3 /tmp/set_model_prompts.pyUsers can optionally add their own prompt in Settings > General > System Prompt. If set, both prompts are sent (they stack). Leave the user-level prompt empty to use only the model-level default.
Why model-level instead of user-level? User-level prompts must be configured manually by each user — if a new user joins and forgets to set it, models won't know today's date. Model-level prompts are server-enforced, zero-configuration, and apply to everyone.
Why "Ignore any conflicting dates"? Web search results often contain stale "current date" claims from cached pages (e.g. a time zone site showing "Today is October 26, 2025"). Small models (1.7B) can latch onto these and output the wrong date. This instruction, combined with the API server's date cleanup (see below), significantly reduces false dates.
When web search results contain date/time information, three layers work together to help the model use the correct date:
- System prompt — Explicitly states the current date with "This is the ONLY correct current date"
- Stale date cleanup (
api.py) — The API server automatically strips misleading "current date is X" and "today is Y" claims from web search snippets before they reach the model. Only factual date references (like DST transition dates) are preserved. - Date anchor injection (
api.py) — A[Current date: February 10, 2026. Any conflicting dates below are outdated.]line is prepended to all RAG context, placing the correct date immediately before the web content.
These are pure preprocessing steps — zero inference overhead. They significantly improve date accuracy, especially on larger models (4B+). The 1.7B model may still occasionally get confused with heavily date-laden content; for time-sensitive web search queries, prefer the 4B model.
Web search is auto-configured via Docker env vars — no manual UI setup needed. The search icon appears in the chat UI immediately.
Prerequisite: SearXNG must be running as a Docker container named
searxngon the same host (see SearXNG Configuration below). If your SearXNG container has a different name or IP, update theSEARXNG_QUERY_URLenv var.
All web search settings are hardcoded (see Docker Setup env var table above). To verify or adjust: Admin > Settings > Web Search.
| Setting | Value | Hardcoded |
|---|---|---|
| Web Search | ON | ENABLE_WEB_SEARCH=True |
| Search Engine | searxng |
WEB_SEARCH_ENGINE=searxng |
| SearXNG Query URL | http://host.docker.internal:8080/search?q=<query> |
SEARXNG_QUERY_URL |
| Result Count | 5 |
WEB_SEARCH_RESULT_COUNT=5 |
| Bypass Web Loader | ON | BYPASS_WEB_SEARCH_WEB_LOADER=True |
| Bypass Embedding & Retrieval | ON | BYPASS_WEB_SEARCH_EMBEDDING_AND_RETRIEVAL=True |
Why Bypass Web Loader? Search engine snippets are cleaner and faster than raw page scraping. Small models handle structured snippets better than noisy full-page HTML.
Why Bypass Embedding & Retrieval? The API server builds its own optimized reading-comprehension prompt internally. Sending snippets directly avoids unnecessary embedding overhead.
The embedding model determines how well Open WebUI finds the right document chunks when you ask a question. This runs on CPU (not NPU), so it needs to be small enough for ARM hardware.
Recommended: BAAI/bge-small-en-v1.5 — set via the Docker RAG_EMBEDDING_MODEL env var above.
| Model | Params | MTEB Avg | Retrieval Score | RAM | Verdict |
|---|---|---|---|---|---|
| BAAI/bge-small-en-v1.5 | 33M | 62.17 | 51.68 | ~150 MB | Best for RAG on ARM |
| sentence-transformers/all-MiniLM-L6-v2 | 22.7M | 56.08 | 41.95 | ~90 MB | Decent but lower retrieval quality |
| minishlab/potion-base-8M | 8M | 50.54 | 31.71 | ~30 MB | Ultra-fast but poor retrieval — wrong chunks = worse answers |
| TaylorAI/bge-micro-v2 | ~4M | ~45 | ~28 | ~16 MB | Too small for reliable RAG |
| BAAI/bge-base-en-v1.5 | 109M | 63.55 | 53.25 | ~450 MB | Marginal gain, 3× more RAM — overkill for ARM |
Why not a faster/smaller model? Embedding speed is not the bottleneck — embedding 5 chunks takes <100ms even with transformer models on ARM CPU. The NPU generation at 13 tok/s is the actual bottleneck. Trading retrieval quality for embedding speed is a bad trade: the model gets worse context and gives worse answers.
Changing the model: If you switch embedding models, go to Admin > Settings > Documents > Danger Zone and click Reindex Knowledge Base Vectors to re-embed all existing documents with the new model.
Admin > Settings > Documents:
These settings control how Open WebUI chunks, embeds, and retrieves uploaded documents. Most are hardcoded via Docker env vars (see Docker Setup) so they persist across container recreations. The values below are tuned for 1.5-4B parameter models on constrained ARM hardware, backed by Chroma Research and Open WebUI best practices:
| Setting | Value | Hardcoded | Reason |
|---|---|---|---|
| Text Splitter | Default (Character) |
default | RecursiveCharacterTextSplitter outperforms TokenTextSplitter (Chroma Research). Tokenizer-agnostic — no mismatch between tiktoken and BERT tokenizer |
| Markdown Header Splitter | ON | ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=True |
Splits by H1-H6 headers first, preserving document structure. Character splitter only runs on oversized sections |
| Chunk Size | 1000 |
default | ~200-250 tokens. Chroma Research found 200 tokens optimal; 1000 chars is a good character equivalent |
| Chunk Overlap | 0 |
CHUNK_OVERLAP=0 |
Overlap actively hurts retrieval IoU (Chroma Research). With Hybrid Search + BM25, overlap is unnecessary |
| Min Chunk Size Target | 400 |
CHUNK_MIN_SIZE_TARGET=400 |
Merges tiny fragments with neighbors, reducing chunk count by up to 90% while improving accuracy |
| Embedding Model | BAAI/bge-small-en-v1.5 |
RAG_EMBEDDING_MODEL |
62.17 MTEB avg, 33M params, ~150MB RAM (see Embedding Model) |
| Reranking Model | cross-encoder/ms-marco-MiniLM-L-6-v2 |
RAG_RERANKING_MODEL |
22M params, ~88MB RAM. Re-scores Top K candidates for much better precision. Sigmoid normalization built-in for MS MARCO models |
| Top K | 5 |
RAG_TOP_K=5 |
Retrieves 5 chunks, reranker narrows to best 3. Good funnel ratio |
| Top K Reranker | 3 |
default | Keeps the 3 highest-scored chunks after reranking |
| Full Context Mode | OFF | default | Injecting the entire document overflows the 4K context window |
| Hybrid Search | ON | ENABLE_RAG_HYBRID_SEARCH=True |
Combines semantic (vector) + keyword (BM25) search |
| Enrich Hybrid Search Text | ON | ENABLE_RAG_HYBRID_SEARCH_ENRICHED_TEXTS=True |
Enriches BM25 index with filenames, titles, and section headers |
| BM25 Weight | 0.1 |
RAG_HYBRID_BM25_WEIGHT=0.1 |
10% keyword / 90% semantic — heavily semantic-leaning since bge-small-en-v1.5 delivers strong retrieval |
| Relevance Threshold | 0 |
RAG_RELEVANCE_THRESHOLD=0.0 |
Must be 0. Cross-encoder sigmoid scores are often below 0.1 for valid content — any threshold filters out real results. Let the reranker handle quality |
RAG Template: Use the default template (clear the field) — it includes inline citation support with
[id]format and comprehensive guidelines. The API server's RAG pipeline works with the default template.
Image Compression: Set to
672x672to match the active VL encoder resolution. Available resolutions: 448 (fast/low detail), 672 (balanced, default), 896 (slow/high detail). Change via Admin > Settings > Documents or thetests/owui_set_compression.pyscript.
After changing settings: Click "Reindex Knowledge Base Vectors" at the bottom of the Documents page to rebuild all embeddings with the new chunking/embedding configuration.
For vision-language (VL) models like Qwen3-VL-2B to work with Open WebUI image uploads and OCR:
Admin > Settings > Images:
| Setting | Value | Reason |
|---|---|---|
| Image Generation (Engine) | OFF (leave unset) | Do NOT enable image generation — it interferes with image upload routing. VL/OCR is handled by the chat API, not the image generation pipeline |
Workspace > Models > Edit (for all NPU models):
| Setting | Value | Reason |
|---|---|---|
| Vision capability | ON | Enables the image upload button in chat for every model. The API server auto-routes image requests to the VL pipeline regardless of which model is selected — users don't need to manually switch to the VL model |
| Builtin Tools | OFF | Small NPU models (1.5B–4B) cannot do function-calling |
How VL works: When you upload an image in chat, Open WebUI sends it as a base64-encoded image_url in the OpenAI multimodal content array format. The RKLLM API server auto-detects image content and routes the request to the VL model pipeline (vision encoder → NPU embedding → LLM decoder). No special configuration is needed beyond enabling the Vision capability on the model.
Supported image formats: JPEG, PNG, WebP, BMP, GIF (first frame). Images are automatically resized to the VL encoder's input resolution.
Tip: For OCR tasks (extracting text from screenshots, documents, photos), use the VL model (Qwen3-VL-2B). Simply upload an image in chat — it auto-routes to the VL pipeline.
Workspace > Models > Edit > Capabilities — configure for each NPU model:
| Setting | Value | Reason |
|---|---|---|
| Vision | ON | Enables image upload button in chat. The API server auto-routes images to the VL model — set this on all models, not just the VL model |
| Builtin Tools | OFF |
Required. Small NPU models (1.5B-4B) cannot do function-calling. Leaving this on injects tool-use instructions that confuse the model |
| File Context | ON | Enables document and search result injection |
Admin > Settings > Interface > Generation Settings:
| Setting | Value | Reason |
|---|---|---|
| Show Generation Settings | OFF | The RKLLM runtime handles sampling internally. UI sliders are ignored by the API |
The API server works as a conversation agent for Home Assistant via the Extended OpenAI Conversation HACS integration. This enables voice and text control of smart home devices using your local NPU.
- Install HACS in Home Assistant if not already installed
- Install Extended OpenAI Conversation from HACS
- Add the integration in Settings > Devices & Services:
| Field | Value |
|---|---|
| Name | RKLLM Orange Pi (or any name) |
| API Key | sk-no-key-required (any dummy value) |
| Base Url | http://<ORANGE_PI_IP>:8000/v1 |
| Skip Authentication | Checked |
| Api Provider | OpenAI |
- Configure the conversation agent:
| Setting | Recommended Value |
|---|---|
| chat_model | qwen3-1.7b (fast) or qwen3-4b-instruct-2507 (smarter) |
| Max tokens | 2048 |
| Temperature | 0.3 (low for reliable device control) |
| Top P | 0.9 |
| Max function calls | 3 |
| Context Threshold | 3500 (for 1.7B) or 13000 (for 4B) |
- Create a Voice Assistant in Settings > Voice Assistants using the new conversation agent
- Expose entities in the Expose tab (start with 5-10 lights/switches)
| Model | Prefill | Generate | Total | Best For |
|---|---|---|---|---|
| qwen3-1.7b | ~3s | ~0.5s | ~3.5s | Simple commands, fast response |
| qwen3-4b-instruct-2507 | ~19s | ~8s | ~27s | Complex queries, more entities |
The 1.7B model is recommended for HA — simple commands like "turn off the living room light" work reliably and respond in under 5 seconds. Keep exposed entities under 20 to stay within the 4096 context window.
The API server automatically detects Home Assistant requests and:
- Disables thinking — skips
<think>reasoning tokens, cutting latency significantly - Detection is based on system prompt signatures (
smart home manager of home assistant,available devices:,execute_services function) - This does not affect Open WebUI or other clients
- Complex multi-step commands may fail on 1.7B (e.g., "turn off all lights except the kitchen")
- Entity count affects prompt size — more entities = slower prefill and less room for response
- No native tool calling — relies on Extended OpenAI Conversation's prompt-based function calling
The included settings.yml is optimized for Open WebUI on ARM hardware. Key settings:
use_default_settings:
engines:
keep_only:
- google
- google news
- duckduckgo
- bing
- brave
- wikipedia
search:
formats:
- html
- json # REQUIRED for Open WebUI API accessInstallation:
cp settings.yml ~/Downloads/searxng-docker/searxng/settings.yml
cd ~/Downloads/searxng-docker
docker compose down && docker compose up -dA cron job runs every Sunday at 3 AM to back up Open WebUI data and pull the latest image:
0 3 * * 0 /home/armbian/scripts/openwebui_full_backup_update.sh >> /var/log/openwebui_backup.log 2>&1
What the script does:
- Stops the Open WebUI container
- Backs up
webui.db(SQLite) anduploads/+vector_db/directories - Verifies backup integrity (
sqlite3 PRAGMA integrity_check+tar -tzf) - Rotates backups — keeps the last 5 (oldest deleted automatically)
- Pulls the latest Open WebUI image
- Recreates the container via
docker compose up -d - Runs an HTTP health check (up to 180 s timeout)
- On failure: auto-rollback — restores DB + files, tags the old image, restarts with it
Backup location: /home/armbian/backups/openwebui/
Manual trigger:
/home/armbian/scripts/openwebui_full_backup_update.shThe docker-compose.yml uses an environment variable for the image tag so the rollback path can override it:
image: ${OPEN_WEBUI_IMAGE:-ghcr.io/open-webui/open-webui:main}During normal operation OPEN_WEBUI_IMAGE is unset, so Docker pulls main. During rollback the script sets OPEN_WEBUI_IMAGE to the previously-tagged image and runs docker compose up -d, which picks up the old version.
A separate hourly cron runs enforce-owui-models.sh to ensure the Open WebUI model list stays in sync with the RKLLM API server:
0 * * * * /home/armbian/scripts/enforce-owui-models.sh >> /var/log/enforce-owui-models.log 2>&1
When Open WebUI performs a web search or retrieves document chunks, the results are injected into the system message as <source> tags (or via a custom RAG template). The server detects this and activates a specialized RAG pipeline:
- Detection —
<source>tags in the system message trigger RAG mode - Extraction — Content extracted from between
<source>...</source>tags - Web Content Cleaning (5-pass):
- Pass 0: Strip misleading "current date/time" claims from cached web pages
- Pass 1: Remove known boilerplate phrases (cookies, sign-in, privacy policy, etc.)
- Pass 2: Remove navigation patterns (CamelCase runs, title-case-heavy lines, URL clusters)
- Pass 3: Collapse consecutive short-line menus (4+ short lines = navigation)
- Pass 4: Keep only lines with data signals (digits, prose punctuation, ≥40 chars)
- Deduplication — Exact prefix key + Jaccard word-similarity removal
- Score-based selection — jusText-inspired paragraph scoring:
- Stopword density (prose ≥ 30%, boilerplate < 15%)
- Length, sentence count, data presence
- Query keyword matching (3x weight)
- Negative signals: short fragments, navigation patterns, boilerplate keywords
- Quality floor — If best paragraph scores below threshold, RAG is dropped entirely
- Prompt construction — SQuAD-style reading comprehension format:
{reference data} According to the above, {question}. Answer in detail with specific facts and examples - Summarization boost — When the query contains "summarize", "summary", "overview", or "outline", a stronger instruction is appended: "Cover all major points, sections, and key details. Use multiple paragraphs."
Open WebUI searches SearXNG with the raw user message. Short follow-ups produce garbage results:
| Layer | Trigger | Example |
|---|---|---|
| Layer 0: Document-referential bypass | Query contains document-related words/phrases — forces RAG mode | "summarize this", "the attached file" |
| Layer 1: Word list | Exact match to known conversational words | "yes", "thanks", "tell me more" |
| Layer 2: Zero-overlap check | Zero query content-words found in reference text (w/ conversation history) | Off-topic follow-up after RAG |
Layer 0 fires first and overrides the other layers (document-referential queries always use RAG). When Layer 1 or 2 fires, RAG is skipped and the model uses normal conversation mode.
The server preserves full conversation context across turns within a chat session. Open WebUI sends the entire message history (system, user, and assistant messages) with each request, and the server formats them into a multi-turn prompt:
User: What is the capital of France?
Assistant: The capital of France is Paris.
User: What is its population?
The model sees all previous turns and can answer follow-up questions in context (e.g., "its" refers to Paris). With KV cache incremental mode, only the new user message is prefilled — prior turns are already in the NPU's KV cache.
RAG responses are cached in an LRU cache (key: model + question hash) to avoid redundant NPU inference:
| Setting | Default | Description |
|---|---|---|
RAG_CACHE_TTL |
300s | Cache lifetime |
RAG_CACHE_MAX_ENTRIES |
50 | Max cached responses |
Models like Qwen3 output chain-of-thought wrapped in <think>...</think> tags.
The server:
- Parses these tags from the token stream using a state machine (handles tags split across chunks)
- Sends
reasoning_contentin streaming deltas (Open WebUI displays these as collapsible thinking blocks) - Returns
reasoning_contentin non-streaming responses - Thinking blocks stripped from history — prior assistant responses have
<think>...</think>removed before re-sending to the model, per Qwen3 docs ("historical output should only include the final output part"). This saves tokens and prevents the model from mimicking its own chain-of-thought - Context-dependent thinking for RAG: On small context models (< 8k), thinking is disabled via
enable_thinking = falseto save tokens for the actual answer
Note: DeepSeek-R1 is currently not usable on the NPU with RKLLM Runtime v1.2.3 (produces
[PAD]garbage tokens). Use Qwen3-1.7B for NPU reasoning, or rundeepseek-r1:7bvia Ollama on CPU. See the Pre-Built Models section for details.
The NPU runtime maintains an internal KV cache. With keep_history=1, prior conversation turns are preserved, so follow-up messages only need to prefill the new tokens:
| Scenario | Strategy | Prefill Time | What's Sent |
|---|---|---|---|
| New conversation | clear_kv_cache() + keep_history=1 |
~90ms (full) | Full prompt |
| Follow-up turn | keep_history=1 |
~50ms (incremental) | Only new user message |
| RAG query | keep_history=0 |
~90ms (full) | RAG context + question |
| Model switch | New model loaded | ~90ms (full) | Full prompt |
- First turn — The server calls
rkllm_clear_kv_cache()then sends the full prompt withkeep_history=1. After generation, the KV cache contains the full conversation. - Follow-up turns — The server compares the list of prior user messages against what the KV cache already contains. If the lists match (same conversation, same model), only the new user message is sent with
keep_history=1. The NPU appends to the existing KV cache. - New conversation — List mismatch triggers
rkllm_clear_kv_cache()+ full prompt resend. - RAG queries — Always use
keep_history=0(standalone, no history needed).
This makes multi-turn conversations significantly faster — Turn 2+ take ~50ms to prefill regardless of total conversation length.
When PROMPT_CACHE_ENABLED = True (default), the server saves the KV state to disk after the first inference on a freshly loaded model. On subsequent model loads (e.g., after a model swap or service restart), this cache is restored automatically, pre-populating the system prompt tokens so the first turn starts faster.
- Cache file:
<model_dir>/prompt_cache.bin(e.g.,~/models/Qwen3-1.7B/prompt_cache.bin) - Save: Triggered on the first KV reset (new conversation) after model load, if no cache file exists yet
- Load: Called automatically during
load_model()if a cache file is found - Uses the RKLLM SDK's
rkllm_load_prompt_cache()/rkllm_release_prompt_cache()API - Graceful fallback: if the SDK version doesn't support the cache API, the feature is silently disabled
When conversation history exceeds the model's context window, the server automatically trims the oldest turns to make room. This prevents context overflow errors while preserving the most recent exchange:
- Reserves
HISTORY_CONTEXT_RESERVE(35%) of context for the current turn + generation output - Caps each prior assistant message at
ASSISTANT_HISTORY_CAP(1500 chars) to prevent single long responses from dominating history - Trims from the oldest turn first, always keeping at least the most recent user/assistant pair
- Strips
<think>...</think>blocks from assistant history before inclusion (per Qwen3 guidelines)
All configuration is at the top of api.py:
| Variable | Default | Description |
|---|---|---|
GENERATION_TIMEOUT |
600s | Max total generation time |
FIRST_TOKEN_TIMEOUT |
300s | Max wait for first token (includes prefill) |
FALLBACK_SILENCE |
20s | Max silence between tokens after first |
REPETITION_WINDOW |
200 chars | Sliding window size for repetition loop detection |
REPETITION_MAX_HITS |
2 | Abort after this many repeated windows detected |
| Variable | Default | Description |
|---|---|---|
MAX_TOKENS_DEFAULT |
2048 | Default max completion tokens |
CONTEXT_LENGTH_DEFAULT |
4096 | Fallback when not detected from filename |
| Variable | Default | Description |
|---|---|---|
RAG_MIN_QUALITY_SCORE |
2 | Minimum score for paragraph inclusion |
RAG_MAX_PARAGRAPHS |
10 | Max paragraphs (prevents "lost in the middle") |
RAG_QUALITY_FLOOR_THRESHOLD |
3 | Below this, RAG is dropped entirely |
RAG_DEDUP_SIMILARITY |
0.70 | Jaccard threshold for near-duplicate removal |
RAG_CACHE_TTL |
300 | Cache lifetime in seconds (0 to disable) |
RAG_CACHE_MAX_ENTRIES |
50 | Max cached responses |
DISABLE_THINK_FOR_RAG_BELOW_CTX |
8192 | Disable thinking for RAG when context < this |
| Variable | Default | Description |
|---|---|---|
VL_RAG_CONTEXT_CAP |
2000 chars | Max RAG reference text in VL multi-turn prompts |
VL_ASSISTANT_HISTORY_CAP |
500 chars | Max chars per prior assistant answer in VL prompts |
| Variable | Default | Description |
|---|---|---|
HISTORY_CONTEXT_RESERVE |
0.35 | Fraction of context reserved for current turn + output |
ASSISTANT_HISTORY_CAP |
1500 chars | Max chars per prior assistant answer in text prompts |
| Variable | Default | Description |
|---|---|---|
PROMPT_CACHE_ENABLED |
True |
Enable/disable KV state save/load on model init |
Model-aware sampling is configured via MODEL_SAMPLING_PROFILES in api.py. Each model family gets tuned defaults:
| Family | top_k | top_p | temp | repeat_penalty | presence_penalty |
|---|---|---|---|---|---|
qwen3 |
20 | 0.8 | 0.7 | 1.1 | 1.5 |
gemma |
40 | 0.95 | 0.7 | 1.1 | 0.0 |
phi |
40 | 0.9 | 0.6 | 1.1 | 0.0 |
deepseek |
20 | 0.9 | 0.6 | 1.1 | 1.0 |
Override per model: Add a "sampling" key in model_config.json inside the model directory:
{
"sampling": {
"top_k": 30,
"temperature": 0.5
}
}Only specified fields are overridden; unset fields use the family profile defaults.
| Variable | Default | Description |
|---|---|---|
REQUEST_STALE_TIMEOUT |
180s | Auto-clear tracked request after this idle time |
MONITOR_INTERVAL |
10s | Health check / idle monitoring frequency |
IDLE_UNLOAD_TIMEOUT |
300s | Auto-unload text model after idle (0 to disable) |
VL_IDLE_UNLOAD_TIMEOUT |
300s | Auto-unload VL model after idle (0 to disable) |
| Variable | Description |
|---|---|
RKLLM_LIB_PATH |
Path to librkllmrt.so (auto-detected from /usr/lib/ by default) |
RKNN_LIB_PATH |
Path to librknnrt.so for VL vision encoder (auto-detected from /usr/lib/ by default) |
RKLLM_API_LOG_LEVEL |
Python API log level: DEBUG, INFO, WARNING, ERROR |
Logs are written to both stderr and a rotating log file (api.log in the script directory):
- Max file size: 10 MB
- Backup count: 3 rotated files
- Default level:
DEBUG(setRKLLM_API_LOG_LEVEL=INFOfor production)
2026-02-08 17:45:12 [INFO] Detected: qwen3-1.7b (context=4096)
2026-02-08 17:45:12 [INFO] Models: ['qwen3-1.7b']
2026-02-08 17:45:12 [INFO] Aliases: {'qwen': 'qwen3-1.7b', 'qwen3': 'qwen3-1.7b'}
2026-02-08 17:45:30 [INFO] >>> NEW REQUEST chatcmpl-a1b2c3d4e5f6
2026-02-08 17:45:30 [INFO] Resolved alias 'qwen' -> 'qwen3-1.7b'
2026-02-08 17:45:30 [INFO] Loading model: qwen3-1.7b
2026-02-08 17:45:33 [INFO] Model loaded in 3.2s
2026-02-08 17:45:33 [DEBUG] KV incremental: sending only new user message (hash match)
2026-02-08 17:45:33 [DEBUG] First token in 0.05s
2026-02-08 17:45:45 [INFO] Request ENDED: chatcmpl-a1b2c3d4e5f6
This server has NO authentication. It is designed to run on a trusted local network.
- Binds to
0.0.0.0:8000— accessible from all network interfaces - No API key validation (any non-empty string works for Open WebUI)
- Request body limited to 50 MB
- Do NOT expose directly to the public internet
- Place behind a reverse proxy (nginx, Caddy) with authentication if external access is needed
- Ensure the
.rkllmfile is inside a subfolder of~/models/(not directly in~/models/) - Check folder naming — spaces become hyphens, underscores become hyphens
- Run
curl http://localhost:8000/v1/modelsto see detected models
- Check that
librkllmrt.sois in/usr/lib/:ldconfig -p | grep rkllm - Verify NPU driver is loaded:
dmesg | grep -i npu - Check
api.logfor init failure messages — may indicate corrupt.rkllmfile or version mismatch
- NPU is single-task — only one request at a time
- Previous request may be stuck — check
/healthendpoint - Stale requests auto-clear after 180s idle
- Check
FALLBACK_SILENCEtimeout (default 20s) — increase if model is slow - Large prompts near context limit may cause long prefill — increase
FIRST_TOKEN_TIMEOUT(default 300s) - Check
/healthendpoint for status
- Ensure you are using
-k gthread, not-k gevent.rkllm_run()is a blocking C call that freezes gevent's event loop - Check
gunicorncommand:gunicorn -w 1 -k gthread --threads 4 --timeout 300 -b 0.0.0.0:8000 api:app
- Verify SearXNG is returning JSON: add
jsontosearch.formatsin SearXNG settings - Check if "Bypass Web Loader" is ON in Open WebUI
- Set
RAG_QUALITY_FLOOR_THRESHOLDhigher to drop poor search results - Check logs for "RAG SKIP" and "Quality floor triggered" messages
- Set
IDLE_UNLOAD_TIMEOUTto auto-unload after idle periods - Use
/v1/models/unloadto manually free NPU memory - Smaller quantized models (W4A16) use less memory
Four test suites verify every code path against a live server — from unit-level parsing to full end-to-end integration across all models.
Section-by-section diagnostic covering 17 areas of the codebase — 108 tests total. Designed for copy-paste output analysis.
python tests/diagnostic_test.py # Run all 17 sections
python tests/diagnostic_test.py --skip-vl # Skip VL tests (faster)
python tests/diagnostic_test.py --section 4 # Run only section 4| Section | Coverage |
|---|---|
| 1 | Server connectivity & health endpoint structure |
| 2 | Model detection, /v1/models listing, response format |
| 3 | Alias generation & model name resolution |
| 4 | Error handling: bad body, empty messages, invalid types, bad base64 |
| 5 | Text generation (non-streaming): response structure, usage stats |
| 6 | Text generation (streaming): SSE format, chunk structure, include_usage |
| 7 | Think tag parsing (reasoning_content in SSE) |
| 8 | KV cache tracking & incremental mode (multi-turn memory) |
| 9 | Model select, unload, switch, idle state |
| 10 | Concurrent request rejection (single-NPU guard) |
| 11 | RAG pipeline: <source> extraction, boilerplate cleaning, skip detection |
| 12 | RAG response cache (generate vs cached timing) |
| 13 | Content normalization (multimodal arrays with text only) |
| 14 | VL auto-routing, image processing, streaming, model name in response |
| 15 | Text-after-VL (dual-model isolation) |
| 16 | Route variants (/chat/completions vs /v1/...), edge cases |
| 17 | Final system state consistency |
Focused integration tests across 17 categories — 68 assertions. Tests text generation, VL multimodal, streaming, error handling, model lifecycle, and concurrent rejection.
python tests/vl_test.py all # Run all tests
python tests/vl_test.py complete # Non-streaming tests only
python tests/vl_test.py stream # Streaming tests only| Suite | Total | Pass | Fail | Warn | Time |
|---|---|---|---|---|---|
tests/diagnostic_test.py |
91 | 91 | 0 | 0 | ~12 min |
tests/vl_test.py |
68 | 68 | 0 | 0 | ~5 min |
tests/e2e_test.py |
78 | 78 | 0 | 0 | ~11 min |
tests/deep_diagnostic.py |
84 | 84 | 0 | 1 | ~7 min |
tests/realworld_smoke.py |
47 | 47 | 0 | 0 | ~4 min |
Ad-hoc real-world smoke test with 15 scenarios — 47 checks. Exercises the API with natural prompts to verify end-to-end behavior including Q&A, streaming, reasoning, multi-turn context, RAG, shortcircuits, date awareness, Home Assistant, model switching, concurrent rejection, long output, and error handling.
python tests/realworld_smoke.py # Run all 15 scenariosFull integration test that exercises every model with real inference — 78 checks across 9 sections, run against all 4 text models. Covers streaming, non-streaming, shortcircuits, RAG, cache, web search, KV multi-turn, prompt building, Open WebUI database config, and API compliance. WebUI/SearXNG URLs are auto-derived from RKLLM_API host.
python tests/e2e_test.py # Run all 9 sections against all models
python tests/e2e_test.py --section 3 # Run only section 3
python tests/e2e_test.py --fast # Skip slow models (gemma, phi)| Section | Coverage |
|---|---|
| 1 | Per-model text generation (streaming + non-streaming, all 4 models) |
| 2 | Shortcircuit detection (title gen, tag gen, query gen) |
| 3 | RAG pipeline with real document context |
| 4 | RAG response cache (hit/miss timing) |
| 5 | Web search flow (SearXNG integration) |
| 6 | KV cache multi-turn memory |
| 7 | Prompt building & detection (date, HA, summarization) |
| 8 | Open WebUI database config |
| 9 | API compliance & edge cases (alias resolution, stream_options) |
Targeted deep-dive into 12 under-tested areas identified via gap analysis — 72 checks covering protocol compliance, edge cases, and stress scenarios.
python tests/deep_diagnostic.py # Run all 12 sections
python tests/deep_diagnostic.py --section 4 # Run only section 4| Section | Coverage |
|---|---|
| 1 | SSE stream format strict compliance (Content-Type, JSON validity, [DONE], role, finish_reason) |
| 2 | Concurrent request rejection (503 on second request, recovery after) |
| 3 | Model hot-swap correctness (swap timing, response model field, health state) |
| 4 | Unicode / special character handling (emoji, CJK, Arabic, HTML entities in stream) |
| 5 | ThinkTagParser edge cases (partial tags, char-by-char, multi-block, regex parity, 10K blocks) |
| 6 | CORS headers & preflight (OPTIONS, Access-Control-Allow-Origin, methods) |
| 7 | Token usage accuracy (prompt+completion=total, streaming include_usage, ranges) |
| 8 | Select / unload endpoints (invalid model, double unload, invalid JSON) |
| 9 | VL / OCR pipeline (image inference, streaming, invalid base64, URL-based image rejection) |
| 10 | Context overflow / large input (22K chars, 50-turn conversation, server recovery) |
| 11 | Message normalization (list content, integer coercion, null content, non-dict filtering) |
| 12 | Shortcircuit streaming SSE compliance (chunk count, usage block, system_fingerprint) |
All suites default to http://localhost:8000. To target a remote server, set the RKLLM_API environment variable:
RKLLM_API=http://192.168.x.x:8000 python tests/diagnostic_test.pyAutomated NPU benchmark tool that measures cold load time, warm TTFT, generation speed (tok/s), and NPU memory usage. Fetches real perf stats from the server log via SSH.
python tests/benchmark_test.py # Benchmark all models
python tests/benchmark_test.py --models qwen3-1.7b phi-3-mini-4k-instruct # Specific models only
python tests/benchmark_test.py --skip-vl # Skip VL model tests
python tests/benchmark_test.py --runs 3 # Average over 3 runs
python tests/benchmark_test.py --remote-log # Fetch NPU perf via SSHResults are saved to tests/benchmark_results.json and printed as formatted markdown tables.
RKLLM-API-Server/
├── api.py # Main API server (ctypes, v2.0)
├── docker-compose.yml # Open WebUI Docker config (all settings hardcoded)
├── setup.sh # Zero-config installer (762 lines)
├── settings.yml # SearXNG configuration for Open WebUI
├── README.md # This file
├── tests/
│ ├── diagnostic_test.py # Section-by-section diagnostic (17 sections, 108 tests)
│ ├── e2e_test.py # End-to-end integration (9 sections, 85 checks, all models)
│ ├── deep_diagnostic.py # Deep diagnostic (12 sections, 72 checks, edge cases)
│ ├── vl_test.py # Integration test suite (17 categories, 68 tests)
│ ├── realworld_smoke.py # Real-world smoke test (15 scenarios, 47 checks)
│ ├── benchmark_test.py # NPU model benchmark tool (tok/s, TTFT, memory)
│ ├── benchmark_results.json # Latest benchmark results
│ ├── set_model_prompts.py # Set system prompts on all OWUI models (DB script)
│ ├── fix_owui_models.py # Set model capabilities: vision, tools, etc. (DB script)
│ ├── remove_stale_models.py # Mark old/removed models as inactive in OWUI DB
│ ├── dump_owui_models_quick.py # Quick dump of all OWUI model records
│ ├── dump_owui_settings.py # Dump all OWUI admin settings from DB
│ ├── owui_set_compression.py # Set OWUI image compression (DB + runtime API)
│ ├── vl_multi_image_test.py # Multi-image VL model integration test
│ └── vl_multiturn_test.py # VL multi-turn context + RAG integration test
├── archive/
│ ├── api_v1_subprocess.py # Original subprocess version (archived)
│ └── CTYPES_MIGRATION_PLAN.md # V1→V2 migration planning document
└── .gitignore
The original server (archive/api_v1_subprocess.py) worked by spawning a separate C++ binary and communicating via stdin/stdout pipes. While functional, this architecture had significant limitations. The current version (api.py) uses direct ctypes bindings to the shared library, eliminating the process boundary entirely.
| Aspect | V1 — Subprocess | V2 — ctypes (current) |
|---|---|---|
| NPU communication | Pipes stdin/stdout to a C++ binary | Direct C library calls via ctypes |
| Token delivery | Parse stdout line-by-line | C callback pushes to queue.Queue |
| KV cache | Lost on every turn (binary restarts) | Preserved across turns (keep_history=1) |
| Prefill (Turn 2+) | ~500ms (re-process entire conversation) | ~50ms (only new user message) |
| Abort / cancel | SIGKILL the process |
rkllm_abort() — clean, instant |
| Performance stats | Parsed from stdout text | Native RKLLMResult.perf struct |
| Thinking mode toggle | Append /no_think to prompt text |
RKLLMInput.enable_thinking flag |
| Error handling | Detect process crash / timeout | C return codes + error callback state |
| Process management | ~500 lines (spawn, monitor, kill, restart) | 0 lines (no process to manage) |
| VL / multimodal | Not supported | Dual-model architecture with RKNN vision encoder |
| Code size | 2682 lines | ~3700 lines (text + VL + RAG) |
The biggest win is KV cache retention. In the subprocess architecture, every turn killed and restarted the C++ binary, destroying the NPU's key-value cache. This meant the model had to re-prefill the entire conversation history from scratch on every single message — growing linearly with conversation length.
With ctypes, the library stays loaded in-process. The KV cache persists between calls. On a 10-turn conversation, Turn 1 takes ~90ms to prefill. All subsequent turns take ~50ms regardless of conversation length, because only the new message is processed.
Performance impact (measured on Orange Pi 5 Plus, Qwen3-1.7B):
| Metric | V1 (Subprocess) | V2 (ctypes) | Improvement |
|---|---|---|---|
| Turn 1 prefill | ~90ms | ~90ms | Same |
| Turn 2 prefill | ~500ms | ~50ms | 10x faster |
| Turn 5 prefill | ~1200ms | ~50ms | 24x faster |
| Turn 10 prefill | ~2000ms+ | ~50ms | 40x faster |
| Model switch | ~5s (kill + restart + reload) | ~3s (destroy + init) | ~40% faster |
| Cancel generation | ~1s (SIGKILL + wait) | instant (rkllm_abort()) |
Near-instant |
The original subprocess version is preserved at archive/api_v1_subprocess.py (2682 lines, fully functional). You can also access it via the git tag:
# View the last working subprocess version
git checkout v1.0-subprocess -- api.py
# Return to current ctypes version
git checkout main -- api.pyThe V1 code may be useful as a reference if:
- You need to run on a system where ctypes binding is not possible
- You want to see how stdout parsing / process management was implemented
- You're porting to a different inference runtime that only provides a CLI binary
| Board | RAM | NPU Driver | Runtime | Status |
|---|---|---|---|---|
| Orange Pi 5 Plus | 16 GB | 0.9.8 | v1.2.3 | Fully tested, production use |
| Model | Quantization | Context | File Size | Speed | Status |
|---|---|---|---|---|---|
| Qwen3-1.7B | W8A8 | 4K | ~1.7 GB | 13.0 tok/s avg | Fully benchmarked |
| Phi-3-Mini-4K-Instruct | W8A8 | 4K | ~3.8 GB | 6.8 tok/s avg | Fully benchmarked |
| Qwen3-4B-Instruct | W8A8 | 16K | ~4 GB | ~6 tok/s | Tested |
| Gemma-3-4B-IT | W8A8 | 4K | ~4 GB | ~6 tok/s | Tested |
Pre-built RKLLM models available on HuggingFace: Qwen3-1.7B-RKLLM-v1.2.3 · Phi-3-mini-4k-instruct-w8a8 · Qwen3-VL-2B-Instruct-RKLLM-v1.2.3
| Model | Quantization | Encoder Res | Decode Speed | Encoder Time | Peak RAM | Status |
|---|---|---|---|---|---|---|
| Qwen3-VL-2B | W8A8 | 672×672 | ~15 tok/s | ~4s | ~6.5 GB | Active (recommended) |
| Qwen3-VL-2B | W8A8 | 448×448 | ~15 tok/s | ~2s | ~5.5 GB | Available (default export) |
| Qwen3-VL-2B | W8A8 | 896×896 | ~15 tok/s | ~12s | ~8.5 GB | Available (high-detail) |
| InternVL3.5-2B | W8A8 | 448×448 | ~12.1 tok/s | ~2.0s | ~3.0 GB | Tested — poor OCR accuracy |
| DeepSeekOCR-3B | W8A8 | 448×448 | ~31.8 tok/s | ~2.1s | ~3.0 GB | Tested — severe hallucination |
| Qwen2.5-VL-3B | W8A8 | 392×392 | ~8.7 tok/s | ~2.9s | ~5.3 GB | Supported (lower resolution) |
| Qwen2-VL-2B | W8A8 | 392×392 | ~16.6 tok/s | ~3.3s | ~3.0 GB | Supported |
| InternVL3-1B | W8A8 | 448×448 | ~TBD | ~TBD | ~TBD | Supported |
| MiniCPM-V-2.6 | W8A8 | 448×448 | ~TBD | ~TBD | ~TBD | Supported |
Qwen3-VL-2B is the recommended VL model with the vision encoder re-exported at 672×672 for 2.25× more visual detail than the default 448×448. Three encoder resolutions (448/672/896) are available — see Vision Encoder Resolution Comparison. To switch encoders, rename the
.rknnfiles (only one should have the.rknnextension; others use.rknn.alt).All Qwen3-VL-2B files (LLM + all 3 vision encoders) are on HuggingFace: GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3. Pre-converted models for other architectures available in the RKLLM official model zoo (fetch code:
rkllm).
Measured on Orange Pi 5 Plus (16 GB) — RK3588, 3 NPU cores, RKNPU driver 0.9.8, librkllmrt.so v1.2.3. All measurements are server-side NPU timing from the RKLLMResult.perf struct (not client-side estimates). Both text and VL models remain loaded in NPU memory simultaneously during normal operation.
| Prompt | Prefill | Prefill Tokens | Generate Time | Output Tokens | tok/s | Cold TTFT |
|---|---|---|---|---|---|---|
| Short Q&A | 95 ms | 15 | 39.6s | 539 | 13.6 | 4.1s |
| Medium explanation | 128 ms | 26 | 145.3s | 1,829 | 12.6 | 4.1s |
| Long generation | 176 ms | 49 | 134.6s | 1,701 | 12.6 | 4.1s |
| Reasoning (step-by-step) | 140 ms | 33 | 67.0s | 889 | 13.3 | 4.2s |
| Average | 135 ms | 31 | — | 1,240 | 13.0 | 4.1s |
- Model load time: 2.8s
- NPU memory: 2,343 MB (standalone) · 7,308 MB (with VL model co-loaded)
- HuggingFace: GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3
| Prompt | Prefill | Prefill Tokens | Generate Time | Output Tokens | tok/s | Cold TTFT |
|---|---|---|---|---|---|---|
| Short Q&A | 204 ms | 10 | 9.4s | 68 | 7.2 | 6.8s |
| Medium explanation | 258 ms | 25 | 141.3s | 913 | 6.5 | 6.9s |
| Long generation | 454 ms | 47 | 149.9s | 963 | 6.4 | 6.6s |
| Reasoning (step-by-step) | 263 ms | 29 | 29.4s | 207 | 7.0 | 6.9s |
| Average | 295 ms | 28 | — | 538 | 6.8 | 6.8s |
- Model load time: 4.5s (avg)
- NPU memory: 7,308 MB (with text model co-loaded)
- HuggingFace: GatekeeperZA/Phi-3-mini-4k-instruct-w8a8
- Qwen3-1.7B generates ~2× faster than Phi-3-Mini despite similar quantization — smaller parameter count means fewer operations per token
- Prefill scales linearly with input token count (~6–10 ms per token for Qwen3, ~10–15 ms for Phi-3)
- Cold TTFT includes model load — Qwen3 loads in 2.8s, Phi-3 in ~4.5s. Warm TTFT (model already loaded) is ~1.3s for Qwen3 and ~2.0s for Phi-3
- Generation speed is stable across prompt lengths — the NPU maintains consistent tok/s regardless of output length
- Reasoning prompts generate fewer tokens but at slightly higher tok/s (less KV cache pressure from shorter context)
We tested all available pre-converted VL models to find the best option for real-world OCR tasks (reading gas meters from phone photos). All models use the same 448×448 (or lower) vision encoder, which crushes high-resolution phone photos (14-15 MP) down to a tiny thumbnail — a fundamental bottleneck for OCR accuracy.
- Hardware: Orange Pi 5 Plus (16 GB), RK3588, RKNPU driver 0.9.8, rkllm-runtime v1.2.3
- Test images: Two real gas meter photos (14-15 MB JPEG, taken with phone camera)
- Task: Read the numeric meter display and identify the brand name printed on the meter
| Model | Source | Meter Reading | Brand Detection | Speed | Verdict |
|---|---|---|---|---|---|
| Qwen3-VL-2B | RKLLM model zoo | Produces plausible numbers (wrong but in range) | Detected similar text | 5-10s | Best available |
| InternVL3.5-2B | happyme531/InternVL3_5-2B-RKLLM | Completely wrong ("200.0", "48") | Hallucinated ("Bundix") | ~30s | Not usable for OCR |
| DeepSeekOCR-3B | RKLLM model zoo | Completely wrong ("1234") or empty | Hallucinated entire scenes (car dashboards, industrial panels) | 5-15s | Not usable — severe hallucination |
| Qwen2.5-VL-3B | vuong1/Qwen2.5-VL-3B-Instruct-RKLLM | Not tested (lower 392×392 resolution) | — | ~8.7 tok/s | Rejected — lower resolution than current |
- Qwen3-VL-2B is the best pre-converted option — while not perfectly accurate for OCR, it produces contextually plausible results and is the fastest
- InternVL3.5-2B has a larger language model (Qwen2.5-1.5B) but W8A8 quantization destroyed its vision capability — also generates Chinese chain-of-thought gibberish
- DeepSeekOCR-3B was specifically designed for OCR but the RKNN conversion is fundamentally broken — it hallucinates entirely different scenes
- All models share the same root problem: the 448×448 (or 392×392) vision encoder crushes phone photos too aggressively for reliable text/number reading
- The real solution is re-exporting the vision encoder at higher resolution — see Vision Encoder Resolution Comparison for results
After identifying that the 448×448 default vision encoder was the bottleneck, we re-exported the Qwen3-VL-2B vision encoder at 672×672 and 896×896 using the rknn-llm export scripts on an x86 host (Ubuntu, 15GB RAM + 36GB swap, CPU-only — no GPU required).
~/models/Qwen3-VL-2b/
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm # LLM decoder (shared)
qwen3-vl-2b_vision_672_rk3588.rknn # Active encoder (672×672)
qwen3-vl-2b_vision_448_rk3588.rknn.alt # Fast, low-detail (inactive)
qwen3-vl-2b_vision_896_rk3588.rknn.alt # Slow, high-detail (inactive)
To switch encoder: rename the active .rknn to .rknn.alt and the desired one from .rknn.alt to .rknn, then sudo systemctl restart rkllm-api.
Tested on Orange Pi 5 Plus (16GB) with real 14-15MB JPEG gas meter photos:
| Resolution | Visual Tokens | RKNN Size | Encoder Time | Total Response | Peak RAM |
|---|---|---|---|---|---|
| 448×448 | 196 (14×14) | 812 MB | ~2s | 5–10s | ~5.5 GB |
| 672×672 ⭐ | 441 (21×21) | 854 MB | ~4s | 9–11s | ~6.5 GB |
| 896×896 | 784 (28×28) | 923 MB | ~12s | 25–28s | ~8.5 GB |
| Resolution | meter1.jpg | meter2.jpg | Consistency |
|---|---|---|---|
| 448×448 | 975648 |
3700211 |
Unique readings |
| 672×672 | 37866 |
3709217 |
— |
| 896×896 | 37866 |
57709217 |
672 & 896 agree on meter1 |
Key conclusions:
- 672×672 is the sweet spot — 2.25× more visual detail with only ~1s extra latency vs 448
- 896×896 is 3× slower (25-28s vs 9-11s) for marginal benefit
- The 2B LLM is the accuracy bottleneck, not the vision encoder — different resolutions produce different (all incorrect) readings, suggesting the model size limits OCR reliability
- Text decode speed is unaffected (~15 tok/s) — only the vision encode + prefill time increases
- All files are on HuggingFace: GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3
The RKNN export scripts accept --height and --width parameters, so you can re-export the Qwen3-VL-2B vision encoder at a higher resolution (e.g., 672×672 or 896×896) to improve OCR accuracy. This only affects the .rknn vision encoder — the .rkllm language model stays the same.
- x86 Linux machine with Python 3.9-3.12 (the toolkits do not run on ARM)
- No GPU required — CPU-only export works (tested on Ubuntu 22.04, 15GB RAM)
- ~20 GB RAM+swap for 672×672, ~35 GB for 896×896 (use
fallocate+mkswapfor extra swap) rknn-toolkit2v2.3.2 (pip install rknn-toolkit2)torch==2.4.0,torchvision==0.19.0transformers>=4.57.0,onnx>=1.18.0
# Clone the export scripts
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm/examples/multimodal_model_demo
# Install dependencies
pip install transformers==4.57.0 torch rkllm-toolkit rknn-toolkit2
# Download the original Qwen3-VL-2B HuggingFace model
# (needed as source for the vision encoder weights)
git clone https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
# Step 1: Export vision encoder to ONNX at custom resolution
# Height/width must be divisible by (merge_size * patch_size) = 2 * 16 = 32 for Qwen3-VL
python export/export_vision.py \
--path ./Qwen3-VL-2B-Instruct \
--model_name qwen3-vl \
--height 672 --width 672
# Output: ./onnx/qwen3-vl_vision.onnx
# Step 2: Convert ONNX to RKNN for RK3588
python export/export_vision_rknn.py \
--path ./onnx/qwen3-vl_vision.onnx \
--model_name qwen3-vl \
--target-platform rk3588 \
--height 672 --width 672
# Output: ./rknn/qwen3-vl_vision_rk3588.rknnQwen3-VL uses patch_size=16 and merge_size=2, so resolution must be divisible by 32:
| Resolution | Visual Tokens | Encoder Time (measured) | RAM Impact | Notes |
|---|---|---|---|---|
| 448×448 | 196 | ~2s | Baseline (5.5 GB) | Default from model zoo |
| 672×672 | 441 | ~4s | +1 GB (6.5 GB) | ⭐ Recommended — 2.25× more pixels |
| 896×896 | 784 | ~12s | +3 GB (8.5 GB) | 4× more pixels, noticeably slower |
| 1120×1120 | 1225 | ~20s+ | +5 GB+ | May OOM on 16 GB devices |
Note: The
.rkllmlanguage model file does NOT need to be re-exported — only the.rknnvision encoder changes. Copy the new.rknnfile to the model folder on the Orange Pi alongside the existing.rkllmfile.
- Height and width must be divisible by
patch_size × merge_size(32 for Qwen3-VL, 28 for Qwen2/2.5-VL) - Higher resolution = more visual tokens = longer prefill time and more NPU memory
- The vision encoder runs on the NPU — very large resolutions may cause OOM on 16 GB devices
- The RKLLM LLM decoder has a fixed
max_context_len— ensure visual tokens + text tokens fit within it
| Tag / Branch | Description |
|---|---|
v1.0-subprocess-stable |
Last working subprocess version (V1) |
v1.1-ctypes-text-only |
Text-only ctypes version before VL additions |
subprocess-legacy |
Branch preserving the subprocess architecture |
main |
Current: ctypes + VL multimodal + meta-task shortcircuits + context-enriched query gen + document RAG + model-aware sampling + prompt cache + sliding window + NPU benchmarks + full test suites (321 checks, 0 failures) |
This project is provided as-is for personal and educational use. The rkllm runtime and model files are subject to their respective licenses from Rockchip and model authors.
- airockchip/rknn-llm — RKLLM runtime, toolkit, and multimodal demo
- airockchip/rknn-toolkit2 — RKNN runtime (
librknnrt.so) for vision encoder NPU inference - Pelochus/armbian-build-rknpu-updates — Armbian builds with RKNPU driver
- Open WebUI — Web interface for LLM interaction