Jetson-first LLM runtime for local inference.
reflex-llm owns model loading, tokenizer/runtime state, memory budgeting,
KV-cache lifecycle, sampling, CLI/server surfaces, and integration with local
applications. CUDA kernel specialization is being split into reflex-infer,
which should become the reusable Jetson kernel library.
Status: rebrand in progress from the original genie-ai-runtime codebase.
The internal C++ namespace and environment variables still use jllm /
JLLM_* while the public repo, binaries, and architecture are moved to the
Reflex naming.
Initial target:
- Hardware: Jetson Orin Nano 8GB.
- Model: Qwen 3.5B or Qwen3 4B class GGUF model.
- Precision: GGUF
Q4_K_Mfirst, matching llama.cpp baselines. - Runtime mode: standalone CUDA first.
- Baselines:
llama.cppandtensorrt-edge-llm.
Future targets:
- Jetson Orin NX 16GB.
- Jetson AGX Orin 64GB.
- Jetson Thor.
- More model families, including Phi-4 and other compact edge models.
The intended architecture is:
reflex-llm: full LLM runtime and product integration layer.reflex-infer: Jetson-optimized CUDA kernel library, similar in spirit to FlashInfer but scoped to NVIDIA Jetson devices.
reflex-llm should use reflex-infer as an external dependency once the kernel
library has a stable C/C++ API. Until then, selected kernels may remain in this
repo as legacy implementation while their interfaces are extracted.
The split is documented in docs/reflex-infer-integration.md.
Prereqs on Jetson:
- JetPack 6 / L4T R36.x.
- CUDA 12.x from JetPack.
- CMake 3.20 or newer.
- C++17 compiler.
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)Outputs:
build/reflex-llm: CLI.libreflex_llm_core.a: embeddable runtime library.
The OpenAI-compatible HTTP server is opt-in:
cmake -B build -DCMAKE_BUILD_TYPE=Release -DREFLEX_LLM_BUILD_SERVER=ON
cmake --build build -j$(nproc)Server output:
build/reflex-llm-server
To wire the sibling reflex-infer checkout:
cmake -B build -DCMAKE_BUILD_TYPE=Release -DREFLEX_LLM_USE_REFLEX_INFER=ON
cmake --build build -j$(nproc)By default CMake looks for ../reflex-infer. Override with
-DREFLEX_INFER_SOURCE_DIR=/path/to/reflex-infer, or install reflex-infer
as a CMake package.
With REFLEX_LLM_USE_REFLEX_INFER=ON, the GGUF K-quant Q4 GEMV/GEMM wrappers
route through the reflex-infer dispatcher. Until external kernels move over,
reflex-llm registers its existing in-repo CUDA implementation as the fallback
backend.
Single prompt:
./build/reflex-llm -m /path/to/model.gguf -p "Hello"Interactive chat:
./build/reflex-llm -m /path/to/model.gguf -i --chatServer:
./build/reflex-llm-server -m /path/to/model.gguf -p 8080| Module | Responsibility |
|---|---|
src/engine/ |
GGUF loading, transformer forward pass, tokenizer, sampling |
src/memory/ |
memory budget, OOM guard, KV-cache pool, scratch pool |
src/jetson/ |
power, thermal, system probe, live stats |
src/kernels/ |
legacy in-repo CUDA kernels to migrate into reflex-infer |
src/persistence/ |
persistent KV-cache file format |
src/server/ |
optional OpenAI-compatible HTTP server |
Master header: include/jllm.h
| Doc | Purpose |
|---|---|
| docs/reflex-infer-integration.md | two-repo architecture and extraction contract |
| docs/architecture.md | runtime architecture |
| docs/kernels.md | current in-repo CUDA kernel notes |
| docs/memory.md | memory and KV-cache design |
| docs/jetson-hal.md | power and thermal handling |
| docs/qwen3-vs-our-runtime.md | Qwen bring-up notes |
| docs/performance.md | historical performance notes |
cd build
ctest --output-on-failureThis repository is Jetson-only. The top-level CMake configuration rejects non-aarch64 hosts.
MIT. See LICENSE.