reflex-llm

Jetson-first LLM runtime for local inference.

reflex-llm owns model loading, tokenizer/runtime state, memory budgeting, KV-cache lifecycle, sampling, CLI/server surfaces, and integration with local applications. CUDA kernel specialization is being split into reflex-infer, which should become the reusable Jetson kernel library.

Status: rebrand in progress from the original genie-ai-runtime codebase. The internal C++ namespace and environment variables still use jllm / JLLM_* while the public repo, binaries, and architecture are moved to the Reflex naming.

Target

Initial target:

Hardware: Jetson Orin Nano 8GB.
Model: Qwen 3.5B or Qwen3 4B class GGUF model.
Precision: GGUF Q4_K_M first, matching llama.cpp baselines.
Runtime mode: standalone CUDA first.
Baselines: llama.cpp and tensorrt-edge-llm.

Future targets:

Jetson Orin NX 16GB.
Jetson AGX Orin 64GB.
Jetson Thor.
More model families, including Phi-4 and other compact edge models.

Two-Repo Split

The intended architecture is:

reflex-llm: full LLM runtime and product integration layer.
reflex-infer: Jetson-optimized CUDA kernel library, similar in spirit to FlashInfer but scoped to NVIDIA Jetson devices.

reflex-llm should use reflex-infer as an external dependency once the kernel library has a stable C/C++ API. Until then, selected kernels may remain in this repo as legacy implementation while their interfaces are extracted.

The split is documented in docs/reflex-infer-integration.md.

Build

Prereqs on Jetson:

JetPack 6 / L4T R36.x.
CUDA 12.x from JetPack.
CMake 3.20 or newer.
C++17 compiler.

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Outputs:

build/reflex-llm: CLI.
libreflex_llm_core.a: embeddable runtime library.

The OpenAI-compatible HTTP server is opt-in:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DREFLEX_LLM_BUILD_SERVER=ON
cmake --build build -j$(nproc)

Server output:

build/reflex-llm-server

To wire the sibling reflex-infer checkout:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DREFLEX_LLM_USE_REFLEX_INFER=ON
cmake --build build -j$(nproc)

By default CMake looks for ../reflex-infer. Override with -DREFLEX_INFER_SOURCE_DIR=/path/to/reflex-infer, or install reflex-infer as a CMake package.

With REFLEX_LLM_USE_REFLEX_INFER=ON, the GGUF K-quant Q4 GEMV/GEMM wrappers route through the reflex-infer dispatcher. Until external kernels move over, reflex-llm registers its existing in-repo CUDA implementation as the fallback backend.

Run

Single prompt:

./build/reflex-llm -m /path/to/model.gguf -p "Hello"

Interactive chat:

./build/reflex-llm -m /path/to/model.gguf -i --chat

Server:

./build/reflex-llm-server -m /path/to/model.gguf -p 8080

Current Modules

Module	Responsibility
`src/engine/`	GGUF loading, transformer forward pass, tokenizer, sampling
`src/memory/`	memory budget, OOM guard, KV-cache pool, scratch pool
`src/jetson/`	power, thermal, system probe, live stats
`src/kernels/`	legacy in-repo CUDA kernels to migrate into `reflex-infer`
`src/persistence/`	persistent KV-cache file format
`src/server/`	optional OpenAI-compatible HTTP server

Master header: include/jllm.h

Documentation

Doc	Purpose
docs/reflex-infer-integration.md	two-repo architecture and extraction contract
docs/architecture.md	runtime architecture
docs/kernels.md	current in-repo CUDA kernel notes
docs/memory.md	memory and KV-cache design
docs/jetson-hal.md	power and thermal handling
docs/qwen3-vs-our-runtime.md	Qwen bring-up notes
docs/performance.md	historical performance notes

Tests

cd build
ctest --output-on-failure

This repository is Jetson-only. The top-level CMake configuration rejects non-aarch64 hosts.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
deploy/systemd		deploy/systemd
docs		docs
include		include
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
TESTING.md		TESTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reflex-llm

Target

Two-Repo Split

Build

Run

Current Modules

Documentation

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

reflex-llm

Target

Two-Repo Split

Build

Run

Current Modules

Documentation

Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages