Skip to content

Morxi/RKLLM-API-Server

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

116 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RKLLM API Server

OpenAI-compatible API server for Rockchip NPU (RK3588/RK3576) running RKLLM models, designed as a drop-in backend for Open WebUI.

Built for single-board computers like the Orange Pi 5 Plus, this server bridges the gap between the librkllmrt.so C runtime and any OpenAI-compatible frontend — enabling local, private LLM inference on ARM hardware with zero cloud dependency.


Table of Contents


Features

Core

  • OpenAI-compatible API/v1/chat/completions, /v1/models endpoints work with any OpenAI client
  • Direct NPU access via ctypes binding to librkllmrt.so (no subprocess overhead)
  • KV cache incremental mode — follow-up turns only prefill the new message (~50ms vs ~500ms)
  • Prompt cache preloading — saves KV state to disk after first inference; subsequent model loads restore it instantly, skipping system prompt re-prefill
  • Model-aware sampling profiles — per-family tuned sampling parameters (Qwen3, Gemma, Phi, DeepSeek) with model_config.json override support
  • Context-aware sliding window — automatically trims oldest conversation turns when history exceeds context length, keeping the most recent exchange intact
  • Context overflow hard-rejection — if the built prompt exceeds 110% of the model's context length (after sliding window), the request is rejected with a 400 context_length_exceeded error instead of sending an oversized prompt to the NPU runtime
  • Streaming & non-streaming responses with proper SSE (Server-Sent Events) format
  • Auto-detection of all .rkllm models in ~/models directory
  • Context length auto-detection from filename patterns (2k/4k/8k/16k/32k)
  • Auto-generated aliases — short names like qwen, phi resolve automatically
  • Multi-turn conversation history — full chat context preserved via KV cache across turns
  • Model hot-switching — request a different model and it loads automatically
  • On-demand loading via /v1/models/select for warm-up
  • Explicit unloading via /v1/models/unload to free NPU memory

Robustness

  • Request tracking with automatic stale-request cleanup (prevents deadlocks)
  • Idle auto-unload — frees NPU memory after configurable idle period (default 5 min)
  • Clean abort — native rkllm_abort() for instant cancellation (no SIGKILL needed)
  • Graceful shutdown on SIGTERM/SIGINT with model cleanup
  • RLock-based locking — prevents model switch deadlock scenarios
  • Repetition loop detection — aborts generation when the model enters paragraph-level repetition loops (configurable window/threshold)
  • SSE heartbeats during prefill — sends keep-alive comments every 15s during long prefill to prevent HTTP proxy/client timeouts
  • Error callback state — detects C library errors and surfaces them as proper HTTP responses

RAG (Retrieval-Augmented Generation)

  • Automatic RAG detection when Open WebUI injects web search or document results
  • Document/PDF RAG — works with Open WebUI's document upload and embedding pipeline
  • Summarization detection — detects "summarize" queries and adds stronger multi-paragraph instructions
  • Smart prompt restructuring — reading comprehension format optimized for small models
  • 5-pass web content cleaning — strips navigation, boilerplate, cookie banners, stale dates
  • Score-based paragraph selection — jusText-inspired content quality scoring
  • Near-duplicate removal — Jaccard similarity deduplication across sources
  • Quality floor — drops irrelevant search results instead of confusing the model
  • Follow-up detection — 3-layer system prevents RAG on conversational replies
  • Response caching — LRU cache eliminates redundant inference for repeated questions
  • Context-dependent thinking — disables reasoning on small context models to save tokens
  • Auto-capability detection — infers model type (thinking, instruct, VL, OCR) from folder names; only reasoning models get <think> enabled

Reasoning Model Support

  • <think> tag parsing for Qwen3 and similar reasoning models
  • reasoning_content field in both streaming deltas and non-streaming responses
  • Streaming state machine handles tags split across output chunks
  • Thinking block stripping<think>...</think> blocks are automatically stripped from assistant history before re-sending to the model (per Qwen3 docs: "historical output should only include the final output")
  • Open WebUI integration — reasoning appears as collapsible thinking blocks

Open WebUI Meta-Task Shortcuts

  • Query generation shortcircuit — Open WebUI asks the model to generate search queries for retrieval; instead of wasting 5s of inference, the server extracts the user's actual question from the chat history and returns it as the query instantly (~0ms). For vague follow-ups ("can you verify that?"), it enriches the query with entities extracted from the assistant's previous response (bold text, quoted strings, capitalized phrases)
  • Title generation shortcircuit — extracts the first user message as the chat title (~0ms instead of 5-10s inference)
  • Tag generation shortcircuit — returns a default tag instantly (~0ms instead of 5-10s inference)
  • Meta-task thinking disabled — auto-detects Open WebUI internal tasks (query gen, title gen, tags, autocomplete) and disables <think> reasoning to avoid wasting 20+ seconds on trivial tasks
  • No JSON leakage — query generation shortcircuit prevents raw JSON from appearing in the chat display

Home Assistant

  • Auto-detection — recognizes Home Assistant requests by system prompt signatures (smart home manager, Available Devices:, execute_services)
  • Thinking auto-disabled — skips <think> reasoning for HA requests, cutting response time in half
  • Compatible with Extended OpenAI Conversation (HACS) — works as a drop-in conversation agent for Assist

Monitoring

  • Prometheus metrics (optional) — rkllm_tokens_generated, rkllm_prefill_duration, rkllm_decode_duration, rkllm_tokens_per_request, rkllm_queue_wait_seconds, rkllm_active_requests, rkllm_model_load_seconds, rkllm_current_model — exposed at /metrics
  • Graceful degradation — metrics are disabled automatically if prometheus-flask-exporter is not installed

Standards Compliance

  • stream_options.include_usage — streaming token counts per OpenAI spec
  • system_fingerprint in all responses
  • max_tokens / max_completion_tokens support
  • Request body size limit (50 MB)
  • Proper error responses matching OpenAI error format

Vision-Language (VL) / Multimodal

  • Dual-model architecture — text model (e.g. Qwen3-1.7B) + VL model (e.g. Qwen3-VL-2B) loaded simultaneously
  • Automatic image routing — requests with images route to VL model, text-only to text model
  • Base64 image support — accepts image_url with data:image/...;base64,... format (Open WebUI compatible)
  • Direct NPU vision encoding via ctypes binding to librknnrt.so (no Python RKNN toolkit needed)
  • Image preprocessing — auto square-pad (128,128,128 background) and resize to encoder input size
  • Multiple VL model support — auto-detects .rknn vision encoder alongside .rkllm decoder
  • Configurable special tokensVL_MODEL_CONFIGS maps model families to their image tokens
  • Multi-turn VL context — follow-up questions, RAG/web search data, and conversation history are included in VL prompts (not just the original image caption)
  • Seamless Open WebUI experience — paste/upload images in chat, responses stream normally

Architecture

┌──────────────┐     HTTP/SSE      ┌──────────────────────────────────┐
│  Open WebUI  │ ◄──────────────── │   api.py (Flask + gunicorn)      │
│  or any      │ ─────────────────►│   gthread, -w 1                  │
│  OpenAI      │                   │                                  │
│  client      │                   │  ┌──────────────────────────┐    │
└──────────────┘                   │  │  VL Auto-Router          │    │
                                   │  │  (image → VL, text → LLM)│    │
        ┌──────────┐               │  └────┬─────────────┬───────┘    │
        │ SearXNG  │ ◄──── Open    │       │ text        │ image      │
        │ (search) │  WebUI injects│       ▼             ▼            │
        └──────────┘  results      │  ┌─────────┐  ┌─────────────┐   │
                                   │  │ Prompt  │  │ Vision Enc. │   │
        ┌──────────┐               │  │ Builder │  │ librknnrt.so│   │
        │  Ollama  │               │  │ + RAG   │  │ (.rknn NPU) │   │
        │  (CPU)   │               │  └────┬────┘  └──────┬──────┘   │
        └──────────┘               │       │              │           │
         optional                  │       ▼              ▼           │
                                   │  ┌──────────────────────────┐   │
                                   │  │  librkllmrt.so v1.2.3    │   │
                                   │  │  Text: RKLLMWrapper      │   │
                                   │  │  VL:   RKLLMWrapper #2   │   │
                                   │  │  C callback → Queue      │   │
                                   │  └────────────┬─────────────┘   │
                                   │               │                 │
                                   │  ┌────────────▼─────────────┐   │
                                   │  │  RK3588 NPU (3 cores)    │   │
                                   │  │  6 TOPS per core         │   │
                                   │  └──────────────────────────┘   │
                                   │                                  │
                                   │  ┌──────────────────────────┐   │
                                   │  │  ThinkTagParser          │   │
                                   │  │  (reasoning_content)     │   │
                                   │  └──────────────────────────┘   │
                                   └──────────────────────────────────┘

Key design decisions:

  1. Plain text only — The rkllm runtime applies chat templates internally using actual token IDs. Special tokens (<|im_start|>, <start_of_turn>, etc.) are stripped from the text vocabulary during model conversion. Sending them as literal text causes the model to see garbage.

  2. Single worker — The NPU can only load one model at a time. The server enforces -w 1 (one gunicorn worker) and rejects concurrent generation with HTTP 503.

  3. ctypes + callback — The C library's rkllm_run() is blocking, so it runs in a worker thread. A C callback pushes tokens to a queue.Queue, which the main thread reads and yields as SSE chunks. This keeps the KV cache in-process across turns.

  4. gthread, not geventrkllm_run() is a blocking C function that freezes gevent's event loop. Using -k gthread with real OS threads avoids this.

  5. Dual-model VL — Text and VL models are loaded simultaneously into separate RKLLMWrapper instances. The vision encoder runs on a third ctypes binding (librknnrt.so). Image requests are auto-routed to the VL pipeline; text requests use the primary model. A shared _token_queue serialized by PROCESS_LOCK prevents interleaving. Default VL model: Qwen3-VL-2B (replaced DeepSeekOCR-3B — smaller, saves ~1.1 GB RAM).


Requirements

Tested System

This project was developed and tested on:

Component Details
Board Orange Pi 5 Plus (16 GB RAM)
SoC Rockchip RK3588 (3 NPU cores)
OS Armbian Pelochus 24.11.0Armbian-Pelochus_24.11.0-OrangePi5-plus_jammy_vendor.7z
Kernel NPU Driver 0.9.8 (included in the Pelochus image — no driver build required)
RKLLM Runtime v1.2.3 (only the runtime library needs to be installed)

Why Pelochus Armbian? The standard Armbian images ship with an older RKNPU driver (0.9.6 or earlier). The Pelochus builds bundle RKNPU driver 0.9.8 in the kernel, so you only need to install the RKLLM runtime — no kernel module compilation required.

Hardware

  • Rockchip RK3588 or RK3576 SBC (Orange Pi 5 Plus, Rock 5B, etc.)
  • NPU driver installed and functional
  • Minimum 8 GB RAM recommended (16 GB for larger models)

Software

  • Linux (ARM64) — tested on Ubuntu/Debian (Armbian)
  • Python 3.8+
  • RKNPU driver ≥ 0.9.6 (0.9.8 recommended — see Installation)
  • RKLLM Runtime ≥ v1.2.0 (tested with v1.2.3) — librkllmrt.so shared library (see Installation)
  • RKLLM models (.rkllm format) placed in ~/models/
  • RKNN Runtime (optional) — librknnrt.so shared library (only needed for VL/multimodal models with .rknn vision encoders)

SDK Version Coupling: The ctypes struct definitions in api.py target the RKLLM SDK v1.2.x C header (rkllm.h). Older SDK versions used a flat 112-byte reserved blob in RKLLMExtendParam and lacked fields like n_keep, n_batch, use_cross_attn, and enable_thinking. Running this server against an older librkllmrt.so (pre-1.2) will cause silent struct-offset misalignment — the parameter block passed to rkllm_init() would be corrupted, producing wrong sampling behaviour rather than a crash. Always use the runtime from the v1.2.x release or later.

Python Dependencies

# Core (required)
pip install flask flask-cors gunicorn

# Prometheus monitoring (optional — metrics at /metrics endpoint)
pip install prometheus-flask-exporter prometheus-client

# VL / multimodal support (optional — needed only for vision-language models)
pip install numpy Pillow

Installation

Automated Setup (Recommended)

A zero-configuration setup script is included that handles everything — system packages, Python venv, RKLLM runtime installation, kernel module/driver verification, udev rules, systemd service, and NPU frequency fix:

git clone https://github.com/GatekeeperZA/RKLLM-API-Server.git
cd RKLLM-API-Server
chmod +x setup.sh
./setup.sh

Do NOT run as root. The script uses sudo internally only where needed (installing system packages, copying libraries, creating the systemd service). User-level files (venv, models directory) are owned by your normal account.

The script is idempotent — safe to run multiple times. It detects what's already installed and skips those steps.

What it installs / verifies:

  • System packages: python3, python3-venv, python3-pip, build-essential, git, git-lfs
  • RKNPU kernel module check (lsmod, modinfo, /dev/rknpu, udev rules, render group)
  • RKLLM Runtime: librkllmrt.so/usr/lib/
  • Python venv (.venv) with flask, flask-cors, gunicorn
  • Systemd services: rkllm-api (API server) + fix-freq (NPU/CPU frequency governor)

After setup, download a model and start the service:

# Download Qwen3-1.7B (recommended)
mkdir -p ~/models/Qwen3-1.7B && cd ~/models/Qwen3-1.7B
git lfs install && git clone https://huggingface.co/GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3 .

# Start the server
sudo systemctl start rkllm-api

# Check status
sudo systemctl status rkllm-api
curl http://localhost:8000/v1/models

Manual Installation

Click to expand manual step-by-step instructions

1. Clone This Repository

git clone https://github.com/GatekeeperZA/RKLLM-API-Server.git
cd RKLLM-API-Server

# Install Python dependencies
pip install flask flask-cors gunicorn

# Create models directory
mkdir -p ~/models

2. RKNPU Driver 0.9.8

The RKNPU kernel driver enables communication with the NPU hardware. Some board images ship with an older driver — you need ≥ 0.9.6 (0.9.8 recommended).

Check your current driver version:

dmesg | grep -i rknpu
# Look for a line like: "RKNPU driver loaded version 0.9.8"
# or:
cat /sys/kernel/debug/rknpu/version 2>/dev/null || echo "Check dmesg"

If you need to update:

The driver source is included in the rknn-llm repository as a pre-built tarball. It must be compiled against your running kernel's headers.

# Clone the rknn-llm repo (if not already done)
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm/rknpu-driver

# Extract the driver source
tar xjf rknpu_driver_0.9.8_20241009.tar.bz2
cd rknpu_driver_0.9.8

# Install kernel headers (required for compilation)
sudo apt update
sudo apt install -y linux-headers-$(uname -r) build-essential

# Build the driver module
make -C /lib/modules/$(uname -r)/build M=$(pwd)/drivers/rknpu modules

# Install the new driver
sudo cp drivers/rknpu/rknpu.ko /lib/modules/$(uname -r)/kernel/drivers/rknpu/
sudo depmod -a

# Load the new driver (or reboot)
sudo modprobe -r rknpu 2>/dev/null  # unload old
sudo modprobe rknpu                  # load new

# Verify
dmesg | tail -5 | grep -i rknpu

Note: Many Armbian and Orange Pi images already include RKNPU driver 0.9.8. Check before building. If dmesg | grep rknpu shows 0.9.8, you're good.

Recommended: The Pelochus Armbian builds ship with RKNPU driver 0.9.8 pre-installed — no manual driver compilation needed. Use Armbian-Pelochus_24.11.0-OrangePi5-plus_jammy_vendor.7z (or the latest release for your board) and skip straight to the runtime setup.

3. RKLLM Runtime ≥ v1.2.0 (tested with v1.2.3)

The RKLLM runtime provides the librkllmrt.so shared library that this API server loads via ctypes. The ctypes struct layouts in api.py require SDK v1.2.0 or later — see Requirements for details on version coupling.

# Clone the rknn-llm repo (if not already done)
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm

# --- Install the runtime library ---
sudo cp rkllm-runtime/Linux/librkllm_api/aarch64/librkllmrt.so /usr/lib/
sudo ldconfig

# Verify the library is findable
ldconfig -p | grep rkllm
# Should show: librkllmrt.so => /usr/lib/librkllmrt.so

Verify everything works:

# Check RKNPU driver
dmesg | grep -i rknpu

# Check runtime library
ldconfig -p | grep rkllm

4. Fix NPU Frequency (Recommended)

For consistent performance, pin the NPU and CPU frequencies. The rknn-llm repo includes scripts for this:

cd rknn-llm/scripts

# RK3588
sudo bash fix_freq_rk3588.sh

# RK3576 (if using that platform)
sudo bash fix_freq_rk3576.sh

Run this after each reboot, or use the setup script which creates a systemd service for automatic frequency pinning.


Pre-Built Models

Ready-to-run .rkllm models converted by the author for RK3588 NPU are available on HuggingFace:

Model Parameters Quant Context Speed RAM Thinking Link
Qwen3-1.7B 1.7B w8a8 4,096 ~13.6 tok/s ~2 GB ✅ Yes Download
Phi-3-mini-4k-instruct 3.82B w8a8 4,096 ~6.8 tok/s ~3.7 GB ❌ No Download

Browse all models: huggingface.co/GatekeeperZA

All models are converted with RKLLM Toolkit v1.2.3, targeting RK3588 (3 NPU cores), and tested on an Orange Pi 5 Plus (16 GB RAM, RKNPU driver 0.9.8).

⚠️ DeepSeek-R1 on NPU — Currently Not Usable

DeepSeek-R1 (including distilled variants like DeepSeek-R1-Distill-Qwen-1.5B) does not work correctly with RKLLM Runtime v1.2.3. The model converts without errors but produces garbage output — repeating [PAD151935] tokens instead of real text (rknn-llm#424). The Airockchip team has acknowledged this is a known issue and stated it will be fixed in a future runtime version.

For NPU reasoning, use Qwen3-1.7B instead — it supports <think> tags, runs at ~13.6 tok/s on the NPU, and works reliably with RKLLM v1.2.3.

If you need DeepSeek-R1, run deepseek-r1:7b via Ollama on CPU — it works correctly (just slower, ~2-3 tok/s on RK3588 ARM cores). See Using Ollama Alongside below.

Quick Download

# Install git-lfs (required for large files)
sudo apt install git-lfs
git lfs install

# Qwen3-1.7B (thinking/reasoning model — recommended)
mkdir -p ~/models/Qwen3-1.7B
cd ~/models/Qwen3-1.7B
git clone https://huggingface.co/GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3 .

# Phi-3-mini (3.8B — strong at math/code, MIT licensed)
mkdir -p ~/models/Phi-3-mini-4k-instruct
cd ~/models/Phi-3-mini-4k-instruct
git clone https://huggingface.co/GatekeeperZA/Phi-3-mini-4k-instruct-w8a8 .

Model Notes

Qwen3-1.7B — Hybrid thinking model. Produces <think>...</think> reasoning blocks that this API server parses into reasoning_content for Open WebUI's collapsible thinking display. Supports English and Chinese.

Phi-3-mini-4k-instruct — Microsoft's 3.8B parameter model excelling at reasoning, math (85.7% GSM8K), and code generation (57.3% HumanEval). English-primary. No thinking mode — this is a standard instruct model. MIT licensed.


Model Setup

Place each .rkllm model in its own subfolder under ~/models/:

~/models/
├── Qwen3-1.7B/
│   └── Qwen3-1.7B-w8a8-rk3588.rkllm
├── Qwen3-4B-Instruct-2507/
│   └── Qwen3-4B-Instruct-16k-w8a8-rk3588.rkllm
├── Gemma-3-4B-IT/
│   └── Gemma-3-4B-IT-w8a8-rk3588.rkllm
└── Phi-3-Mini-4K-Instruct/
    └── Phi-3-Mini-4K-Instruct-w8a8-rk3588.rkllm

VL (Vision-Language) Model Setup

VL models require two files in the same folder: a .rkllm decoder and a .rknn vision encoder.

~/models/
├── Qwen3-1.7B/                          # Text-only model
│   └── Qwen3-1.7B-w8a8-rk3588.rkllm
└── Qwen3-VL-2b/                          # VL model (text + vision)
    ├── Qwen3-VL-2b-w8a8-rk3588.rkllm     # LLM decoder
    └── Qwen3-VL-2b-vision-encoder.rknn   # Vision encoder

How it works:

  1. The server auto-detects .rknn files alongside .rkllm files
  2. The folder name is matched against VL_MODEL_CONFIGS (supports DeepSeekOCR, Qwen2-VL, Qwen2.5-VL, Qwen3-VL, InternVL3, MiniCPM)
  3. When a chat request includes an image (base64 image_url), it auto-routes to the VL model
  4. Text-only requests continue using the text model normally

Supported VL models (model folder name must contain):

Pattern Model Family Notes
qwen3-vl Qwen3-VL Recommended — best OCR, fastest
qwen2.5-vl Qwen2.5-VL Lower 392×392 encoder
qwen2-vl Qwen2-VL Lower 392×392 encoder
internvl3 InternVL3 / InternVL3.5 W8A8 precision loss, poor OCR
deepseekocr DeepSeekOCR Severe hallucination in RKNN conversion
minicpm MiniCPM-V Untested

Requirements:

  • numpy and Pillow Python packages (installed by setup.sh)
  • librknnrt.so (RKNN runtime library, usually at /usr/lib/librknnrt.so)
  • Sufficient RAM for both models (~4.4 GB for Qwen3-1.7B + Qwen3-VL-2B)

Context Length Detection

The server auto-detects context length from the filename or folder name:

Pattern in name Detected context
-2k or _2k 2,048 tokens
-4k or _4k 4,096 tokens
-8k or _8k 8,192 tokens
-16k or _16k 16,384 tokens
-32k or _32k 32,768 tokens
(none found) 4,096 (default)

Model Capabilities Detection

The server auto-detects each model's capabilities from its folder name. This controls whether <think> reasoning is enabled and what metadata is exposed via /v1/models.

Auto-detected capabilities by model family:

Folder pattern Detected capabilities Thinking
qwen3 (not VL) instruct, thinking Yes
deepseek*r1 / deepseek*r2 instruct, thinking Yes
qwq instruct, thinking Yes
deepseekocr instruct, ocr No
qwen*vl instruct, vl No
phi instruct No
gemma instruct No
llama instruct No
mistral / mixtral instruct No
internvl / minicpm instruct No
(contains instruct or -it) instruct No
(no match) base No

Any model with a .rknn vision encoder file automatically gains the vl capability.

Override with model_config.json: Place a JSON file in the model folder to override auto-detection:

// ~/models/MyCustomModel/model_config.json
{
  "capabilities": ["thinking", "instruct"]
}

Effect on thinking: Only models with the thinking capability get enable_thinking=True on the RKLLM runtime. Non-thinking models (Phi-3, Gemma, etc.) always run with enable_thinking=False, preventing wasted tokens on models that don't support <think> blocks.

Auto-Generated Aliases

Model folder names are converted to IDs (lowercase, hyphens). Aliases are auto-generated:

Model ID Auto-Aliases
qwen3-1.7b qwen, qwen3
qwen3-4b-instruct-2507 qwen3-4b, qwen3-4b-instruct
gemma-3-4b-it gemma, gemma-3, gemma-3-4b
phi-3-mini-4k-instruct phi, phi-3, phi-3-mini

Aliases are only created when unambiguous (one model claims the alias). If two models share a prefix, that alias is skipped.


Running the Server

Production (Recommended)

gunicorn -w 1 -k gthread --threads 4 --timeout 300 -b 0.0.0.0:8000 api:app

Critical: Always use -w 1 (single worker). The NPU can only load one model at a time.

Critical: Always use -k gthread, NOT -k gevent. rkllm_run() is a blocking C call that freezes gevent's event loop.

Development

python api.py

This starts Flask's built-in server on 0.0.0.0:8000 with threading enabled.

Systemd Service

The setup script creates this automatically. Manual setup:

[Unit]
Description=RKLLM API Server
After=network.target

[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/RKLLM-API-Server
ExecStart=/path/to/.venv/bin/gunicorn -w 1 -k gthread --threads 4 --timeout 300 -b 0.0.0.0:8000 api:app
Restart=always
RestartSec=5
Environment=RKLLM_API_LOG_LEVEL=INFO

[Install]
WantedBy=multi-user.target
# Start/stop/restart
sudo systemctl start rkllm-api
sudo systemctl stop rkllm-api
sudo systemctl restart rkllm-api

# View logs
sudo journalctl -u rkllm-api -f

# Enable/disable auto-start on boot
sudo systemctl enable rkllm-api
sudo systemctl disable rkllm-api

API Endpoints

GET /v1/models

List all detected models with capabilities and context length.

curl http://localhost:8000/v1/models
{
  "object": "list",
  "data": [
    {
      "id": "qwen3-1.7b",
      "object": "model",
      "created": 1738972800,
      "owned_by": "rkllm",
      "capabilities": ["instruct", "thinking"],
      "context_length": 4096
    },
    {
      "id": "gemma-3-4b-it",
      "object": "model",
      "created": 1738972800,
      "owned_by": "rkllm",
      "capabilities": ["instruct"],
      "context_length": 4096
    },
    {
      "id": "qwen3-vl-2b",
      "object": "model",
      "created": 1738972800,
      "owned_by": "rkllm",
      "capabilities": ["instruct", "vl"],
      "context_length": 4096
    }
  ]
}

Capability values:

Capability Meaning
thinking Native <think> reasoning support (Qwen3, DeepSeek-R1, QwQ)
instruct Instruction-tuned / chat model
vl Vision-language model (image understanding)
ocr Specialised for document OCR
base Base / completion-only model (no chat template)

POST /v1/chat/completions

OpenAI-compatible chat completions (streaming and non-streaming).

# Non-streaming
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-1.7b",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

# Streaming
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-1.7b",
    "stream": true,
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Supported parameters:

Parameter Type Default Description
model string required Model ID or alias
messages array required OpenAI messages format
stream bool false Enable SSE streaming
max_tokens int 2048 Max completion tokens
temperature float (ignored) Accepted but has no effect — rkllm uses model-compiled sampling
top_p float (ignored) Accepted but has no effect
frequency_penalty float (ignored) Accepted but has no effect
presence_penalty float (ignored) Accepted but has no effect
stream_options.include_usage bool false Include token counts in stream

POST /v1/models/select

Pre-load a model without generating (warm-up).

curl -X POST http://localhost:8000/v1/models/select \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-1.7b"}'

POST /v1/models/unload

Explicitly unload the current model to free NPU memory.

curl -X POST http://localhost:8000/v1/models/unload

GET /health

Health check endpoint.

curl http://localhost:8000/health
{
  "status": "ok",
  "current_model": "qwen3-1.7b",
  "model_loaded": true,
  "vl_model": {
    "model": "qwen3-vl-2b",
    "encoder_loaded": true,
    "llm_loaded": true
  },
  "active_request": null,
  "models_available": 4
}

The vl_model field is null when no VL model is loaded.

GET /metrics

Prometheus metrics endpoint (only available when prometheus-flask-exporter is installed).

curl http://localhost:8000/metrics

Exposes counters, histograms, and gauges for tokens generated, prefill/decode duration, tokens per request, queue wait time, active requests, model load time, and current model state.


Open WebUI Configuration

Docker Setup

Open WebUI runs as a Docker container on the same Orange Pi (or any machine on the network). All optimized settings are hardcoded as environment variables so they persist across container recreations.

Option A: Docker Compose (recommended)

A docker-compose.yml is included in this repo with all settings pre-configured:

# Copy docker-compose.yml to the Orange Pi and start:
docker compose up -d

# Update to latest Open WebUI image:
docker compose pull && docker compose up -d

# Update with full backup + rollback (recommended):
# Uses the automated backup script — backs up DB + uploads,
# verifies integrity, pulls latest image, health checks,
# and auto-rolls back on failure. See Backup & Update section below.
/home/armbian/scripts/openwebui_full_backup_update.sh

# Full reset (deletes all data + settings, env vars re-apply):
docker compose down -v && docker compose up -d

Option B: Docker Run

Equivalent single command with all env vars:

docker run -d \
  --name open-webui \
  --restart always \
  --add-host=host.docker.internal:host-gateway \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=sk-unused \
  -e RAG_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 \
  -e RAG_RERANKING_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 \
  -e RAG_EMBEDDING_BATCH_SIZE=10 \
  -e ENABLE_ASYNC_EMBEDDING=True \
  -e RAG_SYSTEM_CONTEXT=True \
  -e RAG_TOP_K=5 \
  -e RAG_TOP_K_RERANKER=3 \
  -e RAG_RELEVANCE_THRESHOLD=0.0 \
  -e ENABLE_RAG_HYBRID_SEARCH=True \
  -e ENABLE_RAG_HYBRID_SEARCH_ENRICHED_TEXTS=True \
  -e RAG_HYBRID_BM25_WEIGHT=0.1 \
  -e CHUNK_SIZE=1000 \
  -e CHUNK_OVERLAP=0 \
  -e CHUNK_MIN_SIZE_TARGET=400 \
  -e ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=True \
  -e ENABLE_RETRIEVAL_QUERY_GENERATION=True \
  -e 'RAG_TEMPLATE=### Task:
Answer the user'"'"'s question using ONLY the provided context. Be thorough and detailed.

### Guidelines:
- If the answer is in the context, provide a comprehensive response with all relevant details.
- If the context doesn'"'"'t contain the answer, say so clearly.
- Respond in the same language as the user'"'"'s query.
- Do not use XML tags in your response.

<context>
{{CONTEXT}}
</context>

<user_query>
{{QUERY}}
</user_query>' \
  -e ENABLE_WEB_SEARCH=True \
  -e WEB_SEARCH_ENGINE=searxng \
  -e SEARXNG_QUERY_URL=http://host.docker.internal:8080/search?q=<query> \
  -e WEB_SEARCH_RESULT_COUNT=5 \
  -e WEB_SEARCH_CONCURRENT_REQUESTS=3 \
  -e BYPASS_WEB_SEARCH_WEB_LOADER=True \
  -e BYPASS_WEB_SEARCH_EMBEDDING_AND_RETRIEVAL=True \
  -e FILE_IMAGE_COMPRESSION_WIDTH=672 \
  -e FILE_IMAGE_COMPRESSION_HEIGHT=672 \
  -e PDF_EXTRACT_IMAGES=True \
  -e ENABLE_CODE_EXECUTION=False \
  -e ENABLE_CODE_INTERPRETER=False \
  -e ENABLE_CHANNELS=True \
  -e ENABLE_MEMORIES=True \
  -e ENABLE_NOTES=True \
  -e ANONYMIZED_TELEMETRY=false \
  -e DO_NOT_TRACK=true \
  ghcr.io/open-webui/open-webui:main

Environment variables explained:

Connection:

Variable Value Reason
OPENAI_API_BASE_URL http://host.docker.internal:8000/v1 Auto-connects to the RKLLM API server on the host. No manual UI setup needed — models appear immediately
OPENAI_API_KEY sk-unused The RKLLM server has no auth, but Open WebUI requires a non-empty key for OpenAI-compatible endpoints

RAG Pipeline:

Variable Value Reason
ENABLE_RETRIEVAL_QUERY_GENERATION True Enables retrieval query generation — Open WebUI sends a query gen request, which the API server shortcircuits instantly (~0ms) with a context-enriched query instead of wasting 5-10s on inference
RAG_SYSTEM_CONTEXT True Injects retrieved document/search content into the system message instead of user message, enabling KV prefix caching for faster follow-up turns
RAG_EMBEDDING_MODEL BAAI/bge-small-en-v1.5 Best retrieval-quality embedding model that runs efficiently on ARM CPU (see Embedding Model below)
RAG_RERANKING_MODEL cross-encoder/ms-marco-MiniLM-L-6-v2 Lightweight cross-encoder reranker (22M params, ~88MB RAM) — re-scores Top K results for much better precision. Open WebUI's sigmoid normalization is specifically designed for MS MARCO models
RAG_EMBEDDING_BATCH_SIZE 10 Processes 10 text chunks per embedding batch — speeds up document ingestion without excessive memory use on ARM
ENABLE_ASYNC_EMBEDDING True Embeds documents asynchronously — prevents blocking the UI during file uploads
ENABLE_RAG_HYBRID_SEARCH True Combines semantic (vector) + keyword (BM25) search for significantly better retrieval than vector-only
RAG_HYBRID_BM25_WEIGHT 0.1 10% keyword / 90% semantic — heavily semantic-leaning since bge-small-en-v1.5 delivers strong retrieval. Higher values dilute precision
ENABLE_RAG_HYBRID_SEARCH_ENRICHED_TEXTS True Enriches BM25 index with document filenames, titles, and section headers — improves keyword recall for metadata-based queries
ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER True Splits documents by Markdown headers (H1-H6) first, preserving document structure. The character splitter only runs as a secondary pass on oversized sections
RAG_RELEVANCE_THRESHOLD 0.0 Critical. Must be 0.0 — higher values filter out valid results because cross-encoder sigmoid scores are often below 0.1 for cross-lingual or loosely-related content. The reranker handles quality filtering instead
CHUNK_SIZE 1000 Maximum characters per chunk. Balanced for 4K context models — large enough for coherent passages, small enough to fit multiple chunks
CHUNK_OVERLAP 0 Zero overlap — Chroma Research showed overlap actively hurts retrieval IoU by returning redundant tokens. With Hybrid Search, overlap is unnecessary
CHUNK_MIN_SIZE_TARGET 400 Merges tiny fragments (<400 chars) with neighbors, preventing low-quality micro-chunks. Works with Markdown Header Splitter to reduce chunk count by up to 90%
RAG_TOP_K 5 Retrieves 5 candidate chunks, then reranker narrows to best 3 (Top K Reranker). Good funnel ratio for 4K context models
RAG_TOP_K_RERANKER 3 Reranker keeps top 3 from the 5 retrieved chunks — only the most relevant content reaches the model
RAG_TEMPLATE (custom) Custom reading-comprehension prompt that instructs the model to answer from context only. See docker-compose.yml for the full template

Web Search (SearXNG):

Variable Value Reason
ENABLE_WEB_SEARCH True Enables the web search toggle in the chat UI
WEB_SEARCH_ENGINE searxng Uses the self-hosted SearXNG instance for privacy and JSON API support
SEARXNG_QUERY_URL http://host.docker.internal:8080/search?q=<query> SearXNG instance URL. Uses host.docker.internal to reach the host-side SearXNG container. Change if your SearXNG is on a different host or port
WEB_SEARCH_RESULT_COUNT 5 Number of search results to fetch. 5 gives good coverage — the API server's quality-floor filtering drops irrelevant results automatically
WEB_SEARCH_CONCURRENT_REQUESTS 3 Limits concurrent web search requests to 3 — prevents overwhelming SearXNG while keeping searches fast
BYPASS_WEB_SEARCH_WEB_LOADER True Uses search engine snippets instead of scraping full pages — cleaner, faster, and more reliable for small models
BYPASS_WEB_SEARCH_EMBEDDING_AND_RETRIEVAL True Sends search snippets directly to the model without embedding/retrieving — the API server builds its own optimized prompt internally

File Upload / Image Compression:

Variable Value Reason
FILE_IMAGE_COMPRESSION_WIDTH 672 Compresses uploaded images to 672px width — matches the active VL encoder resolution. Other options: 448 (faster, less detail), 896 (slower, more detail)
FILE_IMAGE_COMPRESSION_HEIGHT 672 Compresses uploaded images to 672px height — must match the width value. See Vision Encoder Resolution Comparison

Document Processing:

Variable Value Reason
PDF_EXTRACT_IMAGES True Extracts text from scanned images inside PDFs using OCR (Tesseract inside the container)

Code Execution:

Variable Value Reason
ENABLE_CODE_EXECUTION False Small NPU models (1.7B–4B) generate unreliable code — running it wastes time or produces wrong results
ENABLE_CODE_INTERPRETER False Same reason — disable to prevent unreliable code interpretation

Features:

Variable Value Reason
ENABLE_CHANNELS True Enables group chat channels
ENABLE_MEMORIES True Enables persistent user memories across conversations
ENABLE_NOTES True Enables the notes feature for saving snippets

Privacy:

Variable Value Reason
ANONYMIZED_TELEMETRY false Disables telemetry (optional, recommended for privacy)
DO_NOT_TRACK true Disables tracking (optional, recommended for privacy)

Port mapping: 3000:8080 — access Open WebUI at http://<device-ip>:3000. Change 3000 to any port you prefer.

--add-host flag (Linux-specific, required): On Linux, Docker does not resolve host.docker.internal by default — this is a Docker Desktop feature for macOS/Windows only. The --add-host=host.docker.internal:host-gateway flag maps it to the host's gateway IP, allowing the container to reach services running on the host (the RKLLM API server, Ollama, etc.). Without this flag, Open WebUI's default Ollama connection (http://host.docker.internal:11434) and any OpenAI connections using host.docker.internal will fail with ClientConnectorDNSError: Cannot connect to host host.docker.internal.

What's Hardcoded and Why

The goal is minimal user configuration — a fresh install should work correctly out of the box with zero manual settings. The docker-compose.yml file and database setup scripts together achieve this by hardcoding every setting that can be automated.

Three hardcoding layers are used:

Layer How Survives Container Recreate Survives Volume Delete
Docker env vars (docker-compose.yml) PersistentConfig — env var sets the initial default, DB value takes precedence once changed in UI Yes Yes (re-applies)
Database scripts (tests/set_model_prompts.py, tests/fix_owui_models.py) Directly write to webui.db inside the container Yes (data on Docker volume) No (must re-run)
Admin UI only No env var or script — must configure manually Yes (data on Docker volume) No (must redo)

Settings hardcoded via Docker env vars (auto-restore on fresh install):

Category Settings
Connection API base URL, API key
RAG pipeline Embedding model, reranking model, batch size, async embedding, hybrid search, BM25 weight, enriched texts, relevance threshold, top_k, top_k_reranker, custom RAG template
Chunking Chunk size, overlap, min size target, markdown header splitter
Document processing PDF image extraction via OCR (PDF_EXTRACT_IMAGES=True) — extracts text from scanned images inside PDFs
Web search Engine, SearXNG URL, result count, concurrent requests, bypass modes
File upload Image compression (672×672 for VL model — matches active encoder resolution)
Code execution Disabled (ENABLE_CODE_EXECUTION=False) — small NPU models (1.7B–4B) generate unreliable code
Code interpreter Disabled (ENABLE_CODE_INTERPRETER=False) — same reason; wastes time or produces wrong results
Features Channels, memories, notes
Privacy Telemetry disabled

Settings hardcoded via database scripts (must re-run after volume reset):

Script What it sets
tests/set_model_prompts.py System prompt on all models (date/time context)
tests/fix_owui_models.py Model capability flags (vision, image_gen, code_interpreter, etc.)

Settings only configurable via Admin UI (no env var available, must redo manually after volume reset):

Setting Current Value Where to Set
Web search domain filter list !reddit.com, !twitter.com, !x.com, !linkedin.com, !facebook.com, !instagram.com, !tripadvisor.com, !timeanddate.com Admin > Settings > Web Search > Domain Filter List
Model display order qwen3-1.7b, qwen3-4b, phi-3-mini, gemma-3, deepseek-r1:7b, qwen3:8b, qwen3-vl-2b Admin > Settings > Interface > Model Order
Prompt suggestions 16 custom suggestions (study, coding, travel, etc.) Admin > Settings > Interface > Prompt Suggestions

Full recovery after volume reset (docker compose down -v):

# 1. Re-create container (env vars auto-apply)
docker compose up -d

# 2. Create admin account in browser, then run DB scripts:
docker cp tests/set_model_prompts.py open-webui:/tmp/
docker exec open-webui python3 /tmp/set_model_prompts.py
docker cp tests/fix_owui_models.py open-webui:/tmp/
docker exec open-webui python3 /tmp/fix_owui_models.py

# 3. Re-configure Admin UI-only settings manually (domain filters, model order, prompt suggestions)

Connection

The RKLLM API server connection is auto-configured via OPENAI_API_BASE_URL and OPENAI_API_KEY env vars — no manual setup needed. Models appear in the dropdown immediately after container startup.

If you need to change the connection later: Admin > Settings > Connections > edit the OpenAI-compatible endpoint:

Setting Value
API Base URL http://host.docker.internal:8000/v1 (default via env var)
API Key sk-unused (default via env var)

Using Ollama Alongside (CPU Models)

Ollama can be installed on the same board and added as a second connection in Open WebUI. This gives you access to CPU-only models (e.g., larger models that don't have RKLLM conversions) alongside your NPU models — both appear in the model selector.

# Install Ollama on the same ARM board
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull gemma2:2b

Admin > Settings > Connections:

Add Ollama as an additional connection (don't remove the RKLLM one):

Setting Value
Ollama API URL http://localhost:11434

Both backends appear in Open WebUI's model dropdown:

  • NPU models (fast, via this RKLLM API server) — use for everyday chat and web search
  • CPU models (slower, via Ollama) — use for larger models or architectures not yet supported by RKLLM

Note: NPU and CPU inference don't conflict — they use different hardware. You can have an NPU model loaded via this server while Ollama runs a CPU model simultaneously.

Recommended Ollama models for RK3588:

# DeepSeek-R1 reasoning (works on CPU, broken on NPU — see Pre-Built Models note)
ollama pull deepseek-r1:7b

# Other useful CPU models
ollama pull gemma2:2b
ollama pull phi3:3.8b

CPU models (Ollama) do NOT need the NPU-specific settings below. The system prompt, disabled "Builtin Tools", and other restrictions apply only to small NPU models served by this RKLLM API.

System Prompt (Pre-Configured)

The system prompt is set at the model level in the database — it applies to all users automatically with zero user configuration required. New users get the correct prompt immediately without needing to set anything up.

Current prompt (set on all models):

Today is {{CURRENT_DATE}} ({{CURRENT_WEEKDAY}}), {{CURRENT_TIME}}. This is the ONLY correct current date. Ignore any conflicting dates from search results.

Open WebUI resolves the template variables server-side before sending to the model:

  • {{CURRENT_DATE}} → e.g. "February 10, 2026"
  • {{CURRENT_WEEKDAY}} → e.g. "Tuesday"
  • {{CURRENT_TIME}} → e.g. "14:30:00"

How it works: The prompt is stored in each model's params.system field in the database. Open WebUI injects it server-side on every request via apply_system_prompt_to_body(). This is enforced regardless of which user sends the message.

To change the prompt: Edit SYSTEM_PROMPT in tests/set_model_prompts.py and re-run:

# Edit the prompt in the script, then:
docker cp tests/set_model_prompts.py open-webui:/tmp/
docker exec open-webui python3 /tmp/set_model_prompts.py

Users can optionally add their own prompt in Settings > General > System Prompt. If set, both prompts are sent (they stack). Leave the user-level prompt empty to use only the model-level default.

Why model-level instead of user-level? User-level prompts must be configured manually by each user — if a new user joins and forgets to set it, models won't know today's date. Model-level prompts are server-enforced, zero-configuration, and apply to everyone.

Why "Ignore any conflicting dates"? Web search results often contain stale "current date" claims from cached pages (e.g. a time zone site showing "Today is October 26, 2025"). Small models (1.7B) can latch onto these and output the wrong date. This instruction, combined with the API server's date cleanup (see below), significantly reduces false dates.

Date Accuracy for Web Search

When web search results contain date/time information, three layers work together to help the model use the correct date:

  1. System prompt — Explicitly states the current date with "This is the ONLY correct current date"
  2. Stale date cleanup (api.py) — The API server automatically strips misleading "current date is X" and "today is Y" claims from web search snippets before they reach the model. Only factual date references (like DST transition dates) are preserved.
  3. Date anchor injection (api.py) — A [Current date: February 10, 2026. Any conflicting dates below are outdated.] line is prepended to all RAG context, placing the correct date immediately before the web content.

These are pure preprocessing steps — zero inference overhead. They significantly improve date accuracy, especially on larger models (4B+). The 1.7B model may still occasionally get confused with heavily date-laden content; for time-sensitive web search queries, prefer the 4B model.

Web Search (SearXNG)

Web search is auto-configured via Docker env vars — no manual UI setup needed. The search icon appears in the chat UI immediately.

Prerequisite: SearXNG must be running as a Docker container named searxng on the same host (see SearXNG Configuration below). If your SearXNG container has a different name or IP, update the SEARXNG_QUERY_URL env var.

All web search settings are hardcoded (see Docker Setup env var table above). To verify or adjust: Admin > Settings > Web Search.

Setting Value Hardcoded
Web Search ON ENABLE_WEB_SEARCH=True
Search Engine searxng WEB_SEARCH_ENGINE=searxng
SearXNG Query URL http://host.docker.internal:8080/search?q=<query> SEARXNG_QUERY_URL
Result Count 5 WEB_SEARCH_RESULT_COUNT=5
Bypass Web Loader ON BYPASS_WEB_SEARCH_WEB_LOADER=True
Bypass Embedding & Retrieval ON BYPASS_WEB_SEARCH_EMBEDDING_AND_RETRIEVAL=True

Why Bypass Web Loader? Search engine snippets are cleaner and faster than raw page scraping. Small models handle structured snippets better than noisy full-page HTML.

Why Bypass Embedding & Retrieval? The API server builds its own optimized reading-comprehension prompt internally. Sending snippets directly avoids unnecessary embedding overhead.

Embedding Model Recommendation

The embedding model determines how well Open WebUI finds the right document chunks when you ask a question. This runs on CPU (not NPU), so it needs to be small enough for ARM hardware.

Recommended: BAAI/bge-small-en-v1.5 — set via the Docker RAG_EMBEDDING_MODEL env var above.

Model Params MTEB Avg Retrieval Score RAM Verdict
BAAI/bge-small-en-v1.5 33M 62.17 51.68 ~150 MB Best for RAG on ARM
sentence-transformers/all-MiniLM-L6-v2 22.7M 56.08 41.95 ~90 MB Decent but lower retrieval quality
minishlab/potion-base-8M 8M 50.54 31.71 ~30 MB Ultra-fast but poor retrieval — wrong chunks = worse answers
TaylorAI/bge-micro-v2 ~4M ~45 ~28 ~16 MB Too small for reliable RAG
BAAI/bge-base-en-v1.5 109M 63.55 53.25 ~450 MB Marginal gain, 3× more RAM — overkill for ARM

Why not a faster/smaller model? Embedding speed is not the bottleneck — embedding 5 chunks takes <100ms even with transformer models on ARM CPU. The NPU generation at 13 tok/s is the actual bottleneck. Trading retrieval quality for embedding speed is a bad trade: the model gets worse context and gives worse answers.

Changing the model: If you switch embedding models, go to Admin > Settings > Documents > Danger Zone and click Reindex Knowledge Base Vectors to re-embed all existing documents with the new model.

Document RAG Settings (Recommended for PDF/Document Upload)

Admin > Settings > Documents:

These settings control how Open WebUI chunks, embeds, and retrieves uploaded documents. Most are hardcoded via Docker env vars (see Docker Setup) so they persist across container recreations. The values below are tuned for 1.5-4B parameter models on constrained ARM hardware, backed by Chroma Research and Open WebUI best practices:

Setting Value Hardcoded Reason
Text Splitter Default (Character) default RecursiveCharacterTextSplitter outperforms TokenTextSplitter (Chroma Research). Tokenizer-agnostic — no mismatch between tiktoken and BERT tokenizer
Markdown Header Splitter ON ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=True Splits by H1-H6 headers first, preserving document structure. Character splitter only runs on oversized sections
Chunk Size 1000 default ~200-250 tokens. Chroma Research found 200 tokens optimal; 1000 chars is a good character equivalent
Chunk Overlap 0 CHUNK_OVERLAP=0 Overlap actively hurts retrieval IoU (Chroma Research). With Hybrid Search + BM25, overlap is unnecessary
Min Chunk Size Target 400 CHUNK_MIN_SIZE_TARGET=400 Merges tiny fragments with neighbors, reducing chunk count by up to 90% while improving accuracy
Embedding Model BAAI/bge-small-en-v1.5 RAG_EMBEDDING_MODEL 62.17 MTEB avg, 33M params, ~150MB RAM (see Embedding Model)
Reranking Model cross-encoder/ms-marco-MiniLM-L-6-v2 RAG_RERANKING_MODEL 22M params, ~88MB RAM. Re-scores Top K candidates for much better precision. Sigmoid normalization built-in for MS MARCO models
Top K 5 RAG_TOP_K=5 Retrieves 5 chunks, reranker narrows to best 3. Good funnel ratio
Top K Reranker 3 default Keeps the 3 highest-scored chunks after reranking
Full Context Mode OFF default Injecting the entire document overflows the 4K context window
Hybrid Search ON ENABLE_RAG_HYBRID_SEARCH=True Combines semantic (vector) + keyword (BM25) search
Enrich Hybrid Search Text ON ENABLE_RAG_HYBRID_SEARCH_ENRICHED_TEXTS=True Enriches BM25 index with filenames, titles, and section headers
BM25 Weight 0.1 RAG_HYBRID_BM25_WEIGHT=0.1 10% keyword / 90% semantic — heavily semantic-leaning since bge-small-en-v1.5 delivers strong retrieval
Relevance Threshold 0 RAG_RELEVANCE_THRESHOLD=0.0 Must be 0. Cross-encoder sigmoid scores are often below 0.1 for valid content — any threshold filters out real results. Let the reranker handle quality

RAG Template: Use the default template (clear the field) — it includes inline citation support with [id] format and comprehensive guidelines. The API server's RAG pipeline works with the default template.

Image Compression: Set to 672x672 to match the active VL encoder resolution. Available resolutions: 448 (fast/low detail), 672 (balanced, default), 896 (slow/high detail). Change via Admin > Settings > Documents or the tests/owui_set_compression.py script.

After changing settings: Click "Reindex Knowledge Base Vectors" at the bottom of the Documents page to rebuild all embeddings with the new chunking/embedding configuration.

VL / Image Upload Settings

For vision-language (VL) models like Qwen3-VL-2B to work with Open WebUI image uploads and OCR:

Admin > Settings > Images:

Setting Value Reason
Image Generation (Engine) OFF (leave unset) Do NOT enable image generation — it interferes with image upload routing. VL/OCR is handled by the chat API, not the image generation pipeline

Workspace > Models > Edit (for all NPU models):

Setting Value Reason
Vision capability ON Enables the image upload button in chat for every model. The API server auto-routes image requests to the VL pipeline regardless of which model is selected — users don't need to manually switch to the VL model
Builtin Tools OFF Small NPU models (1.5B–4B) cannot do function-calling

How VL works: When you upload an image in chat, Open WebUI sends it as a base64-encoded image_url in the OpenAI multimodal content array format. The RKLLM API server auto-detects image content and routes the request to the VL model pipeline (vision encoder → NPU embedding → LLM decoder). No special configuration is needed beyond enabling the Vision capability on the model.

Supported image formats: JPEG, PNG, WebP, BMP, GIF (first frame). Images are automatically resized to the VL encoder's input resolution.

Tip: For OCR tasks (extracting text from screenshots, documents, photos), use the VL model (Qwen3-VL-2B). Simply upload an image in chat — it auto-routes to the VL pipeline.

Per-Model Capabilities (Required)

Workspace > Models > Edit > Capabilities — configure for each NPU model:

Setting Value Reason
Vision ON Enables image upload button in chat. The API server auto-routes images to the VL model — set this on all models, not just the VL model
Builtin Tools OFF ⚠️ Required. Small NPU models (1.5B-4B) cannot do function-calling. Leaving this on injects tool-use instructions that confuse the model
File Context ON Enables document and search result injection

Interface Settings (Recommended)

Admin > Settings > Interface > Generation Settings:

Setting Value Reason
Show Generation Settings OFF The RKLLM runtime handles sampling internally. UI sliders are ignored by the API

Home Assistant Integration

The API server works as a conversation agent for Home Assistant via the Extended OpenAI Conversation HACS integration. This enables voice and text control of smart home devices using your local NPU.

Setup

  1. Install HACS in Home Assistant if not already installed
  2. Install Extended OpenAI Conversation from HACS
  3. Add the integration in Settings > Devices & Services:
Field Value
Name RKLLM Orange Pi (or any name)
API Key sk-no-key-required (any dummy value)
Base Url http://<ORANGE_PI_IP>:8000/v1
Skip Authentication Checked
Api Provider OpenAI
  1. Configure the conversation agent:
Setting Recommended Value
chat_model qwen3-1.7b (fast) or qwen3-4b-instruct-2507 (smarter)
Max tokens 2048
Temperature 0.3 (low for reliable device control)
Top P 0.9
Max function calls 3
Context Threshold 3500 (for 1.7B) or 13000 (for 4B)
  1. Create a Voice Assistant in Settings > Voice Assistants using the new conversation agent
  2. Expose entities in the Expose tab (start with 5-10 lights/switches)

Model Choice

Model Prefill Generate Total Best For
qwen3-1.7b ~3s ~0.5s ~3.5s Simple commands, fast response
qwen3-4b-instruct-2507 ~19s ~8s ~27s Complex queries, more entities

The 1.7B model is recommended for HA — simple commands like "turn off the living room light" work reliably and respond in under 5 seconds. Keep exposed entities under 20 to stay within the 4096 context window.

Automatic Optimizations

The API server automatically detects Home Assistant requests and:

  • Disables thinking — skips <think> reasoning tokens, cutting latency significantly
  • Detection is based on system prompt signatures (smart home manager of home assistant, available devices:, execute_services function)
  • This does not affect Open WebUI or other clients

Limitations

  • Complex multi-step commands may fail on 1.7B (e.g., "turn off all lights except the kitchen")
  • Entity count affects prompt size — more entities = slower prefill and less room for response
  • No native tool calling — relies on Extended OpenAI Conversation's prompt-based function calling

SearXNG Configuration

The included settings.yml is optimized for Open WebUI on ARM hardware. Key settings:

use_default_settings:
  engines:
    keep_only:
      - google
      - google news
      - duckduckgo
      - bing
      - brave
      - wikipedia

search:
  formats:
    - html
    - json    # REQUIRED for Open WebUI API access

Installation:

cp settings.yml ~/Downloads/searxng-docker/searxng/settings.yml
cd ~/Downloads/searxng-docker
docker compose down && docker compose up -d

Backup & Update

Automated Weekly Backup + Update

A cron job runs every Sunday at 3 AM to back up Open WebUI data and pull the latest image:

0 3 * * 0  /home/armbian/scripts/openwebui_full_backup_update.sh >> /var/log/openwebui_backup.log 2>&1

What the script does:

  1. Stops the Open WebUI container
  2. Backs up webui.db (SQLite) and uploads/ + vector_db/ directories
  3. Verifies backup integrity (sqlite3 PRAGMA integrity_check + tar -tzf)
  4. Rotates backups — keeps the last 5 (oldest deleted automatically)
  5. Pulls the latest Open WebUI image
  6. Recreates the container via docker compose up -d
  7. Runs an HTTP health check (up to 180 s timeout)
  8. On failure: auto-rollback — restores DB + files, tags the old image, restarts with it

Backup location: /home/armbian/backups/openwebui/

Manual trigger:

/home/armbian/scripts/openwebui_full_backup_update.sh

Rollback Support

The docker-compose.yml uses an environment variable for the image tag so the rollback path can override it:

image: ${OPEN_WEBUI_IMAGE:-ghcr.io/open-webui/open-webui:main}

During normal operation OPEN_WEBUI_IMAGE is unset, so Docker pulls main. During rollback the script sets OPEN_WEBUI_IMAGE to the previously-tagged image and runs docker compose up -d, which picks up the old version.

Model Enforcement

A separate hourly cron runs enforce-owui-models.sh to ensure the Open WebUI model list stays in sync with the RKLLM API server:

0 * * * *  /home/armbian/scripts/enforce-owui-models.sh >> /var/log/enforce-owui-models.log 2>&1

RAG Pipeline

When Open WebUI performs a web search or retrieves document chunks, the results are injected into the system message as <source> tags (or via a custom RAG template). The server detects this and activates a specialized RAG pipeline:

Processing Steps

  1. Detection<source> tags in the system message trigger RAG mode
  2. Extraction — Content extracted from between <source>...</source> tags
  3. Web Content Cleaning (5-pass):
    • Pass 0: Strip misleading "current date/time" claims from cached web pages
    • Pass 1: Remove known boilerplate phrases (cookies, sign-in, privacy policy, etc.)
    • Pass 2: Remove navigation patterns (CamelCase runs, title-case-heavy lines, URL clusters)
    • Pass 3: Collapse consecutive short-line menus (4+ short lines = navigation)
    • Pass 4: Keep only lines with data signals (digits, prose punctuation, ≥40 chars)
  4. Deduplication — Exact prefix key + Jaccard word-similarity removal
  5. Score-based selection — jusText-inspired paragraph scoring:
    • Stopword density (prose ≥ 30%, boilerplate < 15%)
    • Length, sentence count, data presence
    • Query keyword matching (3x weight)
    • Negative signals: short fragments, navigation patterns, boilerplate keywords
  6. Quality floor — If best paragraph scores below threshold, RAG is dropped entirely
  7. Prompt construction — SQuAD-style reading comprehension format:
    {reference data}
    
    According to the above, {question}. Answer in detail with specific facts and examples
    
  8. Summarization boost — When the query contains "summarize", "summary", "overview", or "outline", a stronger instruction is appended: "Cover all major points, sections, and key details. Use multiple paragraphs."

Follow-Up Detection (3 Layers)

Open WebUI searches SearXNG with the raw user message. Short follow-ups produce garbage results:

Layer Trigger Example
Layer 0: Document-referential bypass Query contains document-related words/phrases — forces RAG mode "summarize this", "the attached file"
Layer 1: Word list Exact match to known conversational words "yes", "thanks", "tell me more"
Layer 2: Zero-overlap check Zero query content-words found in reference text (w/ conversation history) Off-topic follow-up after RAG

Layer 0 fires first and overrides the other layers (document-referential queries always use RAG). When Layer 1 or 2 fires, RAG is skipped and the model uses normal conversation mode.

Multi-Turn Conversation History

The server preserves full conversation context across turns within a chat session. Open WebUI sends the entire message history (system, user, and assistant messages) with each request, and the server formats them into a multi-turn prompt:

User: What is the capital of France?
Assistant: The capital of France is Paris.
User: What is its population?

The model sees all previous turns and can answer follow-up questions in context (e.g., "its" refers to Paris). With KV cache incremental mode, only the new user message is prefilled — prior turns are already in the NPU's KV cache.

Response Cache

RAG responses are cached in an LRU cache (key: model + question hash) to avoid redundant NPU inference:

Setting Default Description
RAG_CACHE_TTL 300s Cache lifetime
RAG_CACHE_MAX_ENTRIES 50 Max cached responses

Reasoning Models

Models like Qwen3 output chain-of-thought wrapped in <think>...</think> tags.

The server:

  • Parses these tags from the token stream using a state machine (handles tags split across chunks)
  • Sends reasoning_content in streaming deltas (Open WebUI displays these as collapsible thinking blocks)
  • Returns reasoning_content in non-streaming responses
  • Thinking blocks stripped from history — prior assistant responses have <think>...</think> removed before re-sending to the model, per Qwen3 docs ("historical output should only include the final output part"). This saves tokens and prevents the model from mimicking its own chain-of-thought
  • Context-dependent thinking for RAG: On small context models (< 8k), thinking is disabled via enable_thinking = false to save tokens for the actual answer

Note: DeepSeek-R1 is currently not usable on the NPU with RKLLM Runtime v1.2.3 (produces [PAD] garbage tokens). Use Qwen3-1.7B for NPU reasoning, or run deepseek-r1:7b via Ollama on CPU. See the Pre-Built Models section for details.


KV Cache Strategy

The NPU runtime maintains an internal KV cache. With keep_history=1, prior conversation turns are preserved, so follow-up messages only need to prefill the new tokens:

Scenario Strategy Prefill Time What's Sent
New conversation clear_kv_cache() + keep_history=1 ~90ms (full) Full prompt
Follow-up turn keep_history=1 ~50ms (incremental) Only new user message
RAG query keep_history=0 ~90ms (full) RAG context + question
Model switch New model loaded ~90ms (full) Full prompt

How It Works

  1. First turn — The server calls rkllm_clear_kv_cache() then sends the full prompt with keep_history=1. After generation, the KV cache contains the full conversation.
  2. Follow-up turns — The server compares the list of prior user messages against what the KV cache already contains. If the lists match (same conversation, same model), only the new user message is sent with keep_history=1. The NPU appends to the existing KV cache.
  3. New conversation — List mismatch triggers rkllm_clear_kv_cache() + full prompt resend.
  4. RAG queries — Always use keep_history=0 (standalone, no history needed).

This makes multi-turn conversations significantly faster — Turn 2+ take ~50ms to prefill regardless of total conversation length.

Prompt Cache Preloading

When PROMPT_CACHE_ENABLED = True (default), the server saves the KV state to disk after the first inference on a freshly loaded model. On subsequent model loads (e.g., after a model swap or service restart), this cache is restored automatically, pre-populating the system prompt tokens so the first turn starts faster.

  • Cache file: <model_dir>/prompt_cache.bin (e.g., ~/models/Qwen3-1.7B/prompt_cache.bin)
  • Save: Triggered on the first KV reset (new conversation) after model load, if no cache file exists yet
  • Load: Called automatically during load_model() if a cache file is found
  • Uses the RKLLM SDK's rkllm_load_prompt_cache() / rkllm_release_prompt_cache() API
  • Graceful fallback: if the SDK version doesn't support the cache API, the feature is silently disabled

Context-Aware Sliding Window

When conversation history exceeds the model's context window, the server automatically trims the oldest turns to make room. This prevents context overflow errors while preserving the most recent exchange:

  • Reserves HISTORY_CONTEXT_RESERVE (35%) of context for the current turn + generation output
  • Caps each prior assistant message at ASSISTANT_HISTORY_CAP (1500 chars) to prevent single long responses from dominating history
  • Trims from the oldest turn first, always keeping at least the most recent user/assistant pair
  • Strips <think>...</think> blocks from assistant history before inclusion (per Qwen3 guidelines)

Configuration Reference

All configuration is at the top of api.py:

Timeouts

Variable Default Description
GENERATION_TIMEOUT 600s Max total generation time
FIRST_TOKEN_TIMEOUT 300s Max wait for first token (includes prefill)
FALLBACK_SILENCE 20s Max silence between tokens after first
REPETITION_WINDOW 200 chars Sliding window size for repetition loop detection
REPETITION_MAX_HITS 2 Abort after this many repeated windows detected

Defaults

Variable Default Description
MAX_TOKENS_DEFAULT 2048 Default max completion tokens
CONTEXT_LENGTH_DEFAULT 4096 Fallback when not detected from filename

RAG Controls

Variable Default Description
RAG_MIN_QUALITY_SCORE 2 Minimum score for paragraph inclusion
RAG_MAX_PARAGRAPHS 10 Max paragraphs (prevents "lost in the middle")
RAG_QUALITY_FLOOR_THRESHOLD 3 Below this, RAG is dropped entirely
RAG_DEDUP_SIMILARITY 0.70 Jaccard threshold for near-duplicate removal
RAG_CACHE_TTL 300 Cache lifetime in seconds (0 to disable)
RAG_CACHE_MAX_ENTRIES 50 Max cached responses
DISABLE_THINK_FOR_RAG_BELOW_CTX 8192 Disable thinking for RAG when context < this

VL (Vision-Language) Limits

Variable Default Description
VL_RAG_CONTEXT_CAP 2000 chars Max RAG reference text in VL multi-turn prompts
VL_ASSISTANT_HISTORY_CAP 500 chars Max chars per prior assistant answer in VL prompts

History & Sliding Window

Variable Default Description
HISTORY_CONTEXT_RESERVE 0.35 Fraction of context reserved for current turn + output
ASSISTANT_HISTORY_CAP 1500 chars Max chars per prior assistant answer in text prompts

Prompt Cache

Variable Default Description
PROMPT_CACHE_ENABLED True Enable/disable KV state save/load on model init

Sampling Profiles

Model-aware sampling is configured via MODEL_SAMPLING_PROFILES in api.py. Each model family gets tuned defaults:

Family top_k top_p temp repeat_penalty presence_penalty
qwen3 20 0.8 0.7 1.1 1.5
gemma 40 0.95 0.7 1.1 0.0
phi 40 0.9 0.6 1.1 0.0
deepseek 20 0.9 0.6 1.1 1.0

Override per model: Add a "sampling" key in model_config.json inside the model directory:

{
  "sampling": {
    "top_k": 30,
    "temperature": 0.5
  }
}

Only specified fields are overridden; unset fields use the family profile defaults.

Process Management

Variable Default Description
REQUEST_STALE_TIMEOUT 180s Auto-clear tracked request after this idle time
MONITOR_INTERVAL 10s Health check / idle monitoring frequency
IDLE_UNLOAD_TIMEOUT 300s Auto-unload text model after idle (0 to disable)
VL_IDLE_UNLOAD_TIMEOUT 300s Auto-unload VL model after idle (0 to disable)

Environment Variables

Variable Description
RKLLM_LIB_PATH Path to librkllmrt.so (auto-detected from /usr/lib/ by default)
RKNN_LIB_PATH Path to librknnrt.so for VL vision encoder (auto-detected from /usr/lib/ by default)
RKLLM_API_LOG_LEVEL Python API log level: DEBUG, INFO, WARNING, ERROR

Logging

Logs are written to both stderr and a rotating log file (api.log in the script directory):

  • Max file size: 10 MB
  • Backup count: 3 rotated files
  • Default level: DEBUG (set RKLLM_API_LOG_LEVEL=INFO for production)

Log Examples

2026-02-08 17:45:12 [INFO] Detected: qwen3-1.7b (context=4096)
2026-02-08 17:45:12 [INFO] Models: ['qwen3-1.7b']
2026-02-08 17:45:12 [INFO] Aliases: {'qwen': 'qwen3-1.7b', 'qwen3': 'qwen3-1.7b'}
2026-02-08 17:45:30 [INFO] >>> NEW REQUEST chatcmpl-a1b2c3d4e5f6
2026-02-08 17:45:30 [INFO] Resolved alias 'qwen' -> 'qwen3-1.7b'
2026-02-08 17:45:30 [INFO] Loading model: qwen3-1.7b
2026-02-08 17:45:33 [INFO] Model loaded in 3.2s
2026-02-08 17:45:33 [DEBUG] KV incremental: sending only new user message (hash match)
2026-02-08 17:45:33 [DEBUG] First token in 0.05s
2026-02-08 17:45:45 [INFO] Request ENDED: chatcmpl-a1b2c3d4e5f6

Security

This server has NO authentication. It is designed to run on a trusted local network.

  • Binds to 0.0.0.0:8000 — accessible from all network interfaces
  • No API key validation (any non-empty string works for Open WebUI)
  • Request body limited to 50 MB
  • Do NOT expose directly to the public internet
  • Place behind a reverse proxy (nginx, Caddy) with authentication if external access is needed

Troubleshooting

"Model not found"

  • Ensure the .rkllm file is inside a subfolder of ~/models/ (not directly in ~/models/)
  • Check folder naming — spaces become hyphens, underscores become hyphens
  • Run curl http://localhost:8000/v1/models to see detected models

"Failed to load model"

  • Check that librkllmrt.so is in /usr/lib/: ldconfig -p | grep rkllm
  • Verify NPU driver is loaded: dmesg | grep -i npu
  • Check api.log for init failure messages — may indicate corrupt .rkllm file or version mismatch

"Another request is currently being processed" (503)

  • NPU is single-task — only one request at a time
  • Previous request may be stuck — check /health endpoint
  • Stale requests auto-clear after 180s idle

Streaming stops mid-response

  • Check FALLBACK_SILENCE timeout (default 20s) — increase if model is slow
  • Large prompts near context limit may cause long prefill — increase FIRST_TOKEN_TIMEOUT (default 300s)
  • Check /health endpoint for status

Server freezes on requests

  • Ensure you are using -k gthread, not -k gevent. rkllm_run() is a blocking C call that freezes gevent's event loop
  • Check gunicorn command: gunicorn -w 1 -k gthread --threads 4 --timeout 300 -b 0.0.0.0:8000 api:app

RAG returns irrelevant answers

  • Verify SearXNG is returning JSON: add json to search.formats in SearXNG settings
  • Check if "Bypass Web Loader" is ON in Open WebUI
  • Set RAG_QUALITY_FLOOR_THRESHOLD higher to drop poor search results
  • Check logs for "RAG SKIP" and "Quality floor triggered" messages

High memory usage

  • Set IDLE_UNLOAD_TIMEOUT to auto-unload after idle periods
  • Use /v1/models/unload to manually free NPU memory
  • Smaller quantized models (W4A16) use less memory

Testing

Four test suites verify every code path against a live server — from unit-level parsing to full end-to-end integration across all models.

Diagnostic Test (tests/diagnostic_test.py)

Section-by-section diagnostic covering 17 areas of the codebase — 108 tests total. Designed for copy-paste output analysis.

python tests/diagnostic_test.py                # Run all 17 sections
python tests/diagnostic_test.py --skip-vl      # Skip VL tests (faster)
python tests/diagnostic_test.py --section 4    # Run only section 4
Section Coverage
1 Server connectivity & health endpoint structure
2 Model detection, /v1/models listing, response format
3 Alias generation & model name resolution
4 Error handling: bad body, empty messages, invalid types, bad base64
5 Text generation (non-streaming): response structure, usage stats
6 Text generation (streaming): SSE format, chunk structure, include_usage
7 Think tag parsing (reasoning_content in SSE)
8 KV cache tracking & incremental mode (multi-turn memory)
9 Model select, unload, switch, idle state
10 Concurrent request rejection (single-NPU guard)
11 RAG pipeline: <source> extraction, boilerplate cleaning, skip detection
12 RAG response cache (generate vs cached timing)
13 Content normalization (multimodal arrays with text only)
14 VL auto-routing, image processing, streaming, model name in response
15 Text-after-VL (dual-model isolation)
16 Route variants (/chat/completions vs /v1/...), edge cases
17 Final system state consistency

Integration Test (tests/vl_test.py)

Focused integration tests across 17 categories — 68 assertions. Tests text generation, VL multimodal, streaming, error handling, model lifecycle, and concurrent rejection.

python tests/vl_test.py all          # Run all tests
python tests/vl_test.py complete     # Non-streaming tests only
python tests/vl_test.py stream       # Streaming tests only

Test Results (Orange Pi 5 Plus, March 2026)

Suite Total Pass Fail Warn Time
tests/diagnostic_test.py 91 91 0 0 ~12 min
tests/vl_test.py 68 68 0 0 ~5 min
tests/e2e_test.py 78 78 0 0 ~11 min
tests/deep_diagnostic.py 84 84 0 1 ~7 min
tests/realworld_smoke.py 47 47 0 0 ~4 min

Real-World Smoke Test (tests/realworld_smoke.py)

Ad-hoc real-world smoke test with 15 scenarios — 47 checks. Exercises the API with natural prompts to verify end-to-end behavior including Q&A, streaming, reasoning, multi-turn context, RAG, shortcircuits, date awareness, Home Assistant, model switching, concurrent rejection, long output, and error handling.

python tests/realworld_smoke.py              # Run all 15 scenarios

End-to-End Test (tests/e2e_test.py)

Full integration test that exercises every model with real inference — 78 checks across 9 sections, run against all 4 text models. Covers streaming, non-streaming, shortcircuits, RAG, cache, web search, KV multi-turn, prompt building, Open WebUI database config, and API compliance. WebUI/SearXNG URLs are auto-derived from RKLLM_API host.

python tests/e2e_test.py                # Run all 9 sections against all models
python tests/e2e_test.py --section 3    # Run only section 3
python tests/e2e_test.py --fast          # Skip slow models (gemma, phi)
Section Coverage
1 Per-model text generation (streaming + non-streaming, all 4 models)
2 Shortcircuit detection (title gen, tag gen, query gen)
3 RAG pipeline with real document context
4 RAG response cache (hit/miss timing)
5 Web search flow (SearXNG integration)
6 KV cache multi-turn memory
7 Prompt building & detection (date, HA, summarization)
8 Open WebUI database config
9 API compliance & edge cases (alias resolution, stream_options)

Deep Diagnostic Test (tests/deep_diagnostic.py)

Targeted deep-dive into 12 under-tested areas identified via gap analysis — 72 checks covering protocol compliance, edge cases, and stress scenarios.

python tests/deep_diagnostic.py                # Run all 12 sections
python tests/deep_diagnostic.py --section 4    # Run only section 4
Section Coverage
1 SSE stream format strict compliance (Content-Type, JSON validity, [DONE], role, finish_reason)
2 Concurrent request rejection (503 on second request, recovery after)
3 Model hot-swap correctness (swap timing, response model field, health state)
4 Unicode / special character handling (emoji, CJK, Arabic, HTML entities in stream)
5 ThinkTagParser edge cases (partial tags, char-by-char, multi-block, regex parity, 10K blocks)
6 CORS headers & preflight (OPTIONS, Access-Control-Allow-Origin, methods)
7 Token usage accuracy (prompt+completion=total, streaming include_usage, ranges)
8 Select / unload endpoints (invalid model, double unload, invalid JSON)
9 VL / OCR pipeline (image inference, streaming, invalid base64, URL-based image rejection)
10 Context overflow / large input (22K chars, 50-turn conversation, server recovery)
11 Message normalization (list content, integer coercion, null content, non-dict filtering)
12 Shortcircuit streaming SSE compliance (chunk count, usage block, system_fingerprint)

All suites default to http://localhost:8000. To target a remote server, set the RKLLM_API environment variable:

RKLLM_API=http://192.168.x.x:8000 python tests/diagnostic_test.py

Benchmark Tool (tests/benchmark_test.py)

Automated NPU benchmark tool that measures cold load time, warm TTFT, generation speed (tok/s), and NPU memory usage. Fetches real perf stats from the server log via SSH.

python tests/benchmark_test.py                                     # Benchmark all models
python tests/benchmark_test.py --models qwen3-1.7b phi-3-mini-4k-instruct  # Specific models only
python tests/benchmark_test.py --skip-vl                            # Skip VL model tests
python tests/benchmark_test.py --runs 3                             # Average over 3 runs
python tests/benchmark_test.py --remote-log                         # Fetch NPU perf via SSH

Results are saved to tests/benchmark_results.json and printed as formatted markdown tables.


File Structure

RKLLM-API-Server/
├── api.py                          # Main API server (ctypes, v2.0)
├── docker-compose.yml              # Open WebUI Docker config (all settings hardcoded)
├── setup.sh                        # Zero-config installer (762 lines)
├── settings.yml                    # SearXNG configuration for Open WebUI
├── README.md                       # This file
├── tests/
│   ├── diagnostic_test.py          # Section-by-section diagnostic (17 sections, 108 tests)
│   ├── e2e_test.py                 # End-to-end integration (9 sections, 85 checks, all models)
│   ├── deep_diagnostic.py          # Deep diagnostic (12 sections, 72 checks, edge cases)
│   ├── vl_test.py                  # Integration test suite (17 categories, 68 tests)
│   ├── realworld_smoke.py           # Real-world smoke test (15 scenarios, 47 checks)
│   ├── benchmark_test.py           # NPU model benchmark tool (tok/s, TTFT, memory)
│   ├── benchmark_results.json      # Latest benchmark results
│   ├── set_model_prompts.py        # Set system prompts on all OWUI models (DB script)
│   ├── fix_owui_models.py          # Set model capabilities: vision, tools, etc. (DB script)
│   ├── remove_stale_models.py      # Mark old/removed models as inactive in OWUI DB
│   ├── dump_owui_models_quick.py   # Quick dump of all OWUI model records
│   ├── dump_owui_settings.py       # Dump all OWUI admin settings from DB
│   ├── owui_set_compression.py     # Set OWUI image compression (DB + runtime API)
│   ├── vl_multi_image_test.py      # Multi-image VL model integration test
│   └── vl_multiturn_test.py        # VL multi-turn context + RAG integration test
├── archive/
│   ├── api_v1_subprocess.py        # Original subprocess version (archived)
│   └── CTYPES_MIGRATION_PLAN.md    # V1→V2 migration planning document
└── .gitignore

V1 (Subprocess) vs V2 (ctypes) — Why We Migrated

The original server (archive/api_v1_subprocess.py) worked by spawning a separate C++ binary and communicating via stdin/stdout pipes. While functional, this architecture had significant limitations. The current version (api.py) uses direct ctypes bindings to the shared library, eliminating the process boundary entirely.

Architecture Comparison

Aspect V1 — Subprocess V2 — ctypes (current)
NPU communication Pipes stdin/stdout to a C++ binary Direct C library calls via ctypes
Token delivery Parse stdout line-by-line C callback pushes to queue.Queue
KV cache Lost on every turn (binary restarts) Preserved across turns (keep_history=1)
Prefill (Turn 2+) ~500ms (re-process entire conversation) ~50ms (only new user message)
Abort / cancel SIGKILL the process rkllm_abort() — clean, instant
Performance stats Parsed from stdout text Native RKLLMResult.perf struct
Thinking mode toggle Append /no_think to prompt text RKLLMInput.enable_thinking flag
Error handling Detect process crash / timeout C return codes + error callback state
Process management ~500 lines (spawn, monitor, kill, restart) 0 lines (no process to manage)
VL / multimodal Not supported Dual-model architecture with RKNN vision encoder
Code size 2682 lines ~3700 lines (text + VL + RAG)

Why the Change Matters

The biggest win is KV cache retention. In the subprocess architecture, every turn killed and restarted the C++ binary, destroying the NPU's key-value cache. This meant the model had to re-prefill the entire conversation history from scratch on every single message — growing linearly with conversation length.

With ctypes, the library stays loaded in-process. The KV cache persists between calls. On a 10-turn conversation, Turn 1 takes ~90ms to prefill. All subsequent turns take ~50ms regardless of conversation length, because only the new message is processed.

Performance impact (measured on Orange Pi 5 Plus, Qwen3-1.7B):

Metric V1 (Subprocess) V2 (ctypes) Improvement
Turn 1 prefill ~90ms ~90ms Same
Turn 2 prefill ~500ms ~50ms 10x faster
Turn 5 prefill ~1200ms ~50ms 24x faster
Turn 10 prefill ~2000ms+ ~50ms 40x faster
Model switch ~5s (kill + restart + reload) ~3s (destroy + init) ~40% faster
Cancel generation ~1s (SIGKILL + wait) instant (rkllm_abort()) Near-instant

V1 Subprocess Code (Archived)

The original subprocess version is preserved at archive/api_v1_subprocess.py (2682 lines, fully functional). You can also access it via the git tag:

# View the last working subprocess version
git checkout v1.0-subprocess -- api.py

# Return to current ctypes version
git checkout main -- api.py

The V1 code may be useful as a reference if:

  • You need to run on a system where ctypes binding is not possible
  • You want to see how stdout parsing / process management was implemented
  • You're porting to a different inference runtime that only provides a CLI binary

Tested Hardware

Board RAM NPU Driver Runtime Status
Orange Pi 5 Plus 16 GB 0.9.8 v1.2.3 Fully tested, production use

Tested Models

Text Models

Model Quantization Context File Size Speed Status
Qwen3-1.7B W8A8 4K ~1.7 GB 13.0 tok/s avg Fully benchmarked
Phi-3-Mini-4K-Instruct W8A8 4K ~3.8 GB 6.8 tok/s avg Fully benchmarked
Qwen3-4B-Instruct W8A8 16K ~4 GB ~6 tok/s Tested
Gemma-3-4B-IT W8A8 4K ~4 GB ~6 tok/s Tested

Pre-built RKLLM models available on HuggingFace: Qwen3-1.7B-RKLLM-v1.2.3 · Phi-3-mini-4k-instruct-w8a8 · Qwen3-VL-2B-Instruct-RKLLM-v1.2.3

VL (Vision-Language) Models

Model Quantization Encoder Res Decode Speed Encoder Time Peak RAM Status
Qwen3-VL-2B W8A8 672×672 ~15 tok/s ~4s ~6.5 GB Active (recommended)
Qwen3-VL-2B W8A8 448×448 ~15 tok/s ~2s ~5.5 GB Available (default export)
Qwen3-VL-2B W8A8 896×896 ~15 tok/s ~12s ~8.5 GB Available (high-detail)
InternVL3.5-2B W8A8 448×448 ~12.1 tok/s ~2.0s ~3.0 GB Tested — poor OCR accuracy
DeepSeekOCR-3B W8A8 448×448 ~31.8 tok/s ~2.1s ~3.0 GB Tested — severe hallucination
Qwen2.5-VL-3B W8A8 392×392 ~8.7 tok/s ~2.9s ~5.3 GB Supported (lower resolution)
Qwen2-VL-2B W8A8 392×392 ~16.6 tok/s ~3.3s ~3.0 GB Supported
InternVL3-1B W8A8 448×448 ~TBD ~TBD ~TBD Supported
MiniCPM-V-2.6 W8A8 448×448 ~TBD ~TBD ~TBD Supported

Qwen3-VL-2B is the recommended VL model with the vision encoder re-exported at 672×672 for 2.25× more visual detail than the default 448×448. Three encoder resolutions (448/672/896) are available — see Vision Encoder Resolution Comparison. To switch encoders, rename the .rknn files (only one should have the .rknn extension; others use .rknn.alt).

All Qwen3-VL-2B files (LLM + all 3 vision encoders) are on HuggingFace: GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3. Pre-converted models for other architectures available in the RKLLM official model zoo (fetch code: rkllm).


Benchmarks

Measured on Orange Pi 5 Plus (16 GB) — RK3588, 3 NPU cores, RKNPU driver 0.9.8, librkllmrt.so v1.2.3. All measurements are server-side NPU timing from the RKLLMResult.perf struct (not client-side estimates). Both text and VL models remain loaded in NPU memory simultaneously during normal operation.

Qwen3-1.7B (W8A8, ctx=4096)

Prompt Prefill Prefill Tokens Generate Time Output Tokens tok/s Cold TTFT
Short Q&A 95 ms 15 39.6s 539 13.6 4.1s
Medium explanation 128 ms 26 145.3s 1,829 12.6 4.1s
Long generation 176 ms 49 134.6s 1,701 12.6 4.1s
Reasoning (step-by-step) 140 ms 33 67.0s 889 13.3 4.2s
Average 135 ms 31 1,240 13.0 4.1s

Phi-3-Mini-4K-Instruct (W8A8, ctx=4096)

Prompt Prefill Prefill Tokens Generate Time Output Tokens tok/s Cold TTFT
Short Q&A 204 ms 10 9.4s 68 7.2 6.8s
Medium explanation 258 ms 25 141.3s 913 6.5 6.9s
Long generation 454 ms 47 149.9s 963 6.4 6.6s
Reasoning (step-by-step) 263 ms 29 29.4s 207 7.0 6.9s
Average 295 ms 28 538 6.8 6.8s

Key Observations

  • Qwen3-1.7B generates ~2× faster than Phi-3-Mini despite similar quantization — smaller parameter count means fewer operations per token
  • Prefill scales linearly with input token count (~6–10 ms per token for Qwen3, ~10–15 ms for Phi-3)
  • Cold TTFT includes model load — Qwen3 loads in 2.8s, Phi-3 in ~4.5s. Warm TTFT (model already loaded) is ~1.3s for Qwen3 and ~2.0s for Phi-3
  • Generation speed is stable across prompt lengths — the NPU maintains consistent tok/s regardless of output length
  • Reasoning prompts generate fewer tokens but at slightly higher tok/s (less KV cache pressure from shorter context)

VL Model Evaluation

We tested all available pre-converted VL models to find the best option for real-world OCR tasks (reading gas meters from phone photos). All models use the same 448×448 (or lower) vision encoder, which crushes high-resolution phone photos (14-15 MP) down to a tiny thumbnail — a fundamental bottleneck for OCR accuracy.

Test Setup

  • Hardware: Orange Pi 5 Plus (16 GB), RK3588, RKNPU driver 0.9.8, rkllm-runtime v1.2.3
  • Test images: Two real gas meter photos (14-15 MB JPEG, taken with phone camera)
  • Task: Read the numeric meter display and identify the brand name printed on the meter

Results

Model Source Meter Reading Brand Detection Speed Verdict
Qwen3-VL-2B RKLLM model zoo Produces plausible numbers (wrong but in range) Detected similar text 5-10s Best available
InternVL3.5-2B happyme531/InternVL3_5-2B-RKLLM Completely wrong ("200.0", "48") Hallucinated ("Bundix") ~30s Not usable for OCR
DeepSeekOCR-3B RKLLM model zoo Completely wrong ("1234") or empty Hallucinated entire scenes (car dashboards, industrial panels) 5-15s Not usable — severe hallucination
Qwen2.5-VL-3B vuong1/Qwen2.5-VL-3B-Instruct-RKLLM Not tested (lower 392×392 resolution) ~8.7 tok/s Rejected — lower resolution than current

Key Findings

  1. Qwen3-VL-2B is the best pre-converted option — while not perfectly accurate for OCR, it produces contextually plausible results and is the fastest
  2. InternVL3.5-2B has a larger language model (Qwen2.5-1.5B) but W8A8 quantization destroyed its vision capability — also generates Chinese chain-of-thought gibberish
  3. DeepSeekOCR-3B was specifically designed for OCR but the RKNN conversion is fundamentally broken — it hallucinates entirely different scenes
  4. All models share the same root problem: the 448×448 (or 392×392) vision encoder crushes phone photos too aggressively for reliable text/number reading
  5. The real solution is re-exporting the vision encoder at higher resolution — see Vision Encoder Resolution Comparison for results

Vision Encoder Resolution Comparison

After identifying that the 448×448 default vision encoder was the bottleneck, we re-exported the Qwen3-VL-2B vision encoder at 672×672 and 896×896 using the rknn-llm export scripts on an x86 host (Ubuntu, 15GB RAM + 36GB swap, CPU-only — no GPU required).

File Layout on Orange Pi

~/models/Qwen3-VL-2b/
    qwen3-vl-2b-instruct_w8a8_rk3588.rkllm   # LLM decoder (shared)
    qwen3-vl-2b_vision_672_rk3588.rknn        # Active encoder (672×672)
    qwen3-vl-2b_vision_448_rk3588.rknn.alt    # Fast, low-detail (inactive)
    qwen3-vl-2b_vision_896_rk3588.rknn.alt    # Slow, high-detail (inactive)

To switch encoder: rename the active .rknn to .rknn.alt and the desired one from .rknn.alt to .rknn, then sudo systemctl restart rkllm-api.

Resolution Benchmark Results

Tested on Orange Pi 5 Plus (16GB) with real 14-15MB JPEG gas meter photos:

Resolution Visual Tokens RKNN Size Encoder Time Total Response Peak RAM
448×448 196 (14×14) 812 MB ~2s 5–10s ~5.5 GB
672×672 441 (21×21) 854 MB ~4s 9–11s ~6.5 GB
896×896 784 (28×28) 923 MB ~12s 25–28s ~8.5 GB

OCR Test Results (Gas Meter Photos)

Resolution meter1.jpg meter2.jpg Consistency
448×448 975648 3700211 Unique readings
672×672 37866 3709217
896×896 37866 57709217 672 & 896 agree on meter1

Key conclusions:

  • 672×672 is the sweet spot — 2.25× more visual detail with only ~1s extra latency vs 448
  • 896×896 is 3× slower (25-28s vs 9-11s) for marginal benefit
  • The 2B LLM is the accuracy bottleneck, not the vision encoder — different resolutions produce different (all incorrect) readings, suggesting the model size limits OCR reliability
  • Text decode speed is unaffected (~15 tok/s) — only the vision encode + prefill time increases
  • All files are on HuggingFace: GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3

Re-Exporting VL Models at Higher Resolution

The RKNN export scripts accept --height and --width parameters, so you can re-export the Qwen3-VL-2B vision encoder at a higher resolution (e.g., 672×672 or 896×896) to improve OCR accuracy. This only affects the .rknn vision encoder — the .rkllm language model stays the same.

Requirements

  • x86 Linux machine with Python 3.9-3.12 (the toolkits do not run on ARM)
  • No GPU required — CPU-only export works (tested on Ubuntu 22.04, 15GB RAM)
  • ~20 GB RAM+swap for 672×672, ~35 GB for 896×896 (use fallocate + mkswap for extra swap)
  • rknn-toolkit2 v2.3.2 (pip install rknn-toolkit2)
  • torch==2.4.0, torchvision==0.19.0
  • transformers>=4.57.0, onnx>=1.18.0

Step-by-Step: Re-Export Qwen3-VL at Higher Resolution

# Clone the export scripts
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm/examples/multimodal_model_demo

# Install dependencies
pip install transformers==4.57.0 torch rkllm-toolkit rknn-toolkit2

# Download the original Qwen3-VL-2B HuggingFace model
# (needed as source for the vision encoder weights)
git clone https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct

# Step 1: Export vision encoder to ONNX at custom resolution
# Height/width must be divisible by (merge_size * patch_size) = 2 * 16 = 32 for Qwen3-VL
python export/export_vision.py \
  --path ./Qwen3-VL-2B-Instruct \
  --model_name qwen3-vl \
  --height 672 --width 672
# Output: ./onnx/qwen3-vl_vision.onnx

# Step 2: Convert ONNX to RKNN for RK3588
python export/export_vision_rknn.py \
  --path ./onnx/qwen3-vl_vision.onnx \
  --model_name qwen3-vl \
  --target-platform rk3588 \
  --height 672 --width 672
# Output: ./rknn/qwen3-vl_vision_rk3588.rknn

Resolution Options for Qwen3-VL

Qwen3-VL uses patch_size=16 and merge_size=2, so resolution must be divisible by 32:

Resolution Visual Tokens Encoder Time (measured) RAM Impact Notes
448×448 196 ~2s Baseline (5.5 GB) Default from model zoo
672×672 441 ~4s +1 GB (6.5 GB) Recommended — 2.25× more pixels
896×896 784 ~12s +3 GB (8.5 GB) 4× more pixels, noticeably slower
1120×1120 1225 ~20s+ +5 GB+ May OOM on 16 GB devices

Note: The .rkllm language model file does NOT need to be re-exported — only the .rknn vision encoder changes. Copy the new .rknn file to the model folder on the Orange Pi alongside the existing .rkllm file.

Important Constraints

  • Height and width must be divisible by patch_size × merge_size (32 for Qwen3-VL, 28 for Qwen2/2.5-VL)
  • Higher resolution = more visual tokens = longer prefill time and more NPU memory
  • The vision encoder runs on the NPU — very large resolutions may cause OOM on 16 GB devices
  • The RKLLM LLM decoder has a fixed max_context_len — ensure visual tokens + text tokens fit within it

Git Tags & Branches

Tag / Branch Description
v1.0-subprocess-stable Last working subprocess version (V1)
v1.1-ctypes-text-only Text-only ctypes version before VL additions
subprocess-legacy Branch preserving the subprocess architecture
main Current: ctypes + VL multimodal + meta-task shortcircuits + context-enriched query gen + document RAG + model-aware sampling + prompt cache + sliding window + NPU benchmarks + full test suites (321 checks, 0 failures)

License

This project is provided as-is for personal and educational use. The rkllm runtime and model files are subject to their respective licenses from Rockchip and model authors.

Acknowledgements

About

RKLLM API Server - Open WebUI Managed Version

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 94.6%
  • Shell 5.4%