Skip to content

lfm 0.1.0 — Rust ONNX inference for LFM2.5-VL with llmtask Task contract#1

Merged
uqio merged 2 commits into
mainfrom
0.1.0
May 10, 2026
Merged

lfm 0.1.0 — Rust ONNX inference for LFM2.5-VL with llmtask Task contract#1
uqio merged 2 commits into
mainfrom
0.1.0

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 9, 2026

First publishable cut of the crate. Rust ONNX Runtime port of LiquidAI/LFM2.5-VL-450M, with schema-constrained sampling via llguidance and the full engine-agnostic llmtask::Task surface.

Three layers

  • lfm::Engine — sync, single-threaded; built on ort 2.0. Engine::generate(messages, images, opts) is the unconstrained free-form path. Engine::run<T: llmtask::Task>(task, messages, images, opts) is the constrained path: any Task whose Grammar is JSON Schema, Lark, or Regex routes through llguidance and the result is decoded by the task's parse impl.
  • lfm::ImageAnalysisTask — built-in image-analysis preset that produces the canonical llmtask::ImageAnalysis output type, sharing the schema and resilient parser with qwen3-vl.
  • lfm::preproc — wasm-friendly image preprocessing surface (Preprocessor, TileGrid, EXIF-aware decode helpers). Compiles under --no-default-features --features decoders for use in contexts that don't need the inference runtime.

llmtask-driven generic engine

The whole inference path takes &impl Task<Value = Value>, so a Task written once against the llmtask contract runs through lfm (llguidance) and qwen3-vl (mistralrs) without translation. Because lfm's backend is llguidance, all three Grammar variants (JSON Schema, Lark, Regex) are accepted; engines that only speak JSON Schema reject the others via UnsupportedGrammar and the caller can route to lfm.

Strict from_dir + bundled escape hatch

  • Engine::from_dir byte-validates the supplied tokenizer.json, chat_template.jinja, preprocessor_config.json, and text_config.max_position_embeddings field of config.json against the bundled blobs. A model directory whose ONNX shapes pass but whose tokenizer/template/preprocessor drifted would silently corrupt prompts; this fail-closed check forces the drift into a clear load-time error.
  • Engine::from_onnx_dir (under bundled feature) accepts an ONNX-only directory; the bundled tokenizer / chat template / configs are written to a temp file on first use.
  • Engine::from_paths is the unchecked escape hatch for advanced callers pairing custom tokenizers with custom ONNX.

Architecture

Per-image vision encoding → text+image embedding splice → hybrid KV/conv-state cache decoder loop → optional schema-constrained sampling.

Graph Role
vision_encoder.onnx SigLIP2 encoder, single image per call (multi-image batching produces silently-wrong embeddings).
embed_tokens.onnx Token embedding lookup.
decoder_model_merged.onnx LFM2 hybrid LM: 10 conv-state + 6 KV-attn layers at sparse indices. The Decoder manages the non-contiguous cache layout transparently.

Compatibility with the current LFM2.5-VL-450M-ONNX exports

Two fixes were needed against the published HF repo:

  1. Cross-axis pad for pixel_values. The SigLIP2 NaFlex pos_embed Resize-target is computed as (max_h, max_w) = ReduceMax(spatial_shapes, axis=0) per axis and reshaped to [max_h * max_w, dim]. So pixel_values.shape[1] must equal max_h * max_w (cross-axis product), not the per-entry max(h * w). flatten_to_patches now pads accordingly; per-entry spatial_shapes and pixel_attention_mask still describe each entry's actual layout.
  2. Empty KV cache via allocator path. ort 2.0's Tensor::from_array rejects any zero-dim shape with Invalid dimension #N; all dimensions must be >= 1. This broke Decoder::new_cache initialising the empty [1, 8, 0, 64] attn cache. Routed the empty cache through Tensor::<f32>::new(allocator, shape) (ONNX Runtime allocator path), which accepts zero-element shapes.

Admission-control DoS guards

Bounded request-shape cap (max messages, max content parts), text-size cap, image-count lower bound from min_image_tokens, header-time decoded-buffer cap, and a special-token denylist seeded from the live tokenizer's added_vocabulary. All run BEFORE any image decode or template render.

Hardware backends

ORT execution providers gated behind feature flags: cuda, tensorrt, directml, rocm, coreml. None are required for CPU inference; each requires its vendor SDK at build time.

Test plan

  • cargo test --lib --all-features — 138 lib tests, 0 failures
  • cargo check --no-default-features — bare wasm-friendly preproc surface compiles
  • cargo check --no-default-features --features decoders — pure preprocessing build (no ort, no tokenizers)
  • Integration suite vs LiquidAI/LFM2.5-VL-450M-ONNX (HEAD revision): 9/9 tests pass, including a cross-engine ImageAnalysis comparison against the airport thumbnails shared with qwen3-vl.
  • Codex adversarial review approved on round 1.

🤖 Generated with Claude Code

First publishable cut of the crate. Rust ONNX Runtime port of
LiquidAI/LFM2.5-VL-450M, with schema-constrained sampling via
llguidance and the full engine-agnostic `llmtask::Task` surface.

## Three layers

- **`lfm::Engine`** — sync, single-threaded; built on `ort` 2.0.
  `Engine::generate(messages, images, opts)` is the unconstrained
  free-form path. `Engine::run<T: llmtask::Task>(task, messages,
  images, opts)` is the constrained path: any `Task` whose
  `Grammar` is JSON Schema, Lark, or Regex routes through
  llguidance and the result is decoded by the task's `parse` impl.
- **`lfm::ImageAnalysisTask`** — built-in image-analysis preset
  that produces the canonical `llmtask::ImageAnalysis` output
  type, sharing the schema and resilient parser with `qwen3-vl`.
- **`lfm::preproc`** — wasm-friendly image preprocessing surface:
  `Preprocessor`, `TileGrid`, EXIF-aware decode helpers. Compiles
  under `--no-default-features --features decoders` for use in
  contexts that don't need the inference runtime.

## llmtask-driven generic engine

The whole inference path takes `&impl Task<Value = Value>`, so a
`Task` written once against the `llmtask` contract runs through
`lfm` (llguidance) and `qwen3-vl` (mistralrs) without translation.
Because lfm's backend is llguidance, all three `Grammar` variants
(JSON Schema, Lark, Regex) are accepted; engines that only speak
JSON Schema reject the others via `UnsupportedGrammar` and the
caller can route to lfm.

## Strict from_dir + bundled escape hatch

- `Engine::from_dir` byte-validates the supplied `tokenizer.json`,
  `chat_template.jinja`, `preprocessor_config.json`, and the
  `text_config.max_position_embeddings` field of `config.json`
  against the bundled blobs. A model directory whose ONNX shapes
  pass but whose tokenizer/template/preprocessor drifted would
  silently corrupt prompts; this fail-closed check forces the
  drift into a clear load-time error.
- `Engine::from_onnx_dir` (under `bundled` feature) accepts an
  ONNX-only directory; the bundled tokenizer / chat template /
  configs are written to a temp file on first use.
- `Engine::from_paths` is the unchecked escape hatch for advanced
  callers pairing custom tokenizers with custom ONNX.

## Architecture (per-image vision encoding)

Per-image vision encoding → text+image embedding splice → hybrid
KV/conv-state cache decoder loop → optional schema-constrained
sampling. Three ONNX graphs:

- `vision_encoder.onnx` — SigLIP2 encoder, single image per call
  (multi-image batching produces silently-wrong embeddings).
- `embed_tokens.onnx` — token embedding lookup.
- `decoder_model_merged.onnx` — LFM2 hybrid LM (10 conv-state +
  6 KV-attn layers at sparse indices). The `Decoder` manages the
  non-contiguous cache layout transparently.

## Compatibility with the current LFM2.5-VL-450M-ONNX exports

Two fixes were needed against the published HF repo:

1. The SigLIP2 NaFlex `pos_embed` Resize-target is computed as
   `(max_h, max_w) = ReduceMax(spatial_shapes, axis=0)` per axis
   and reshaped to `[max_h * max_w, dim]`. So `pixel_values.shape[1]`
   must equal `max_h * max_w` (cross-axis product), not the per-
   entry `max(h * w)`. `flatten_to_patches` now pads accordingly;
   per-entry `spatial_shapes` and `pixel_attention_mask` still
   describe each entry's actual layout.
2. `ort 2.0`'s `Tensor::from_array` rejects any zero-dim shape with
   `Invalid dimension #N; all dimensions must be >= 1`. This broke
   `Decoder::new_cache` initialising the empty `[1, 8, 0, 64]` attn
   cache. Routed the empty cache through `Tensor::<f32>::new(
   allocator, shape)` (ONNX Runtime allocator path), which accepts
   zero-element shapes.

## Admission-control DoS guards

Bounded request-shape cap (max messages, max content parts),
text-size cap, image-count lower bound from `min_image_tokens`,
header-time decoded-buffer cap, and a special-token denylist
seeded from the live tokenizer's `added_vocabulary`. All run
BEFORE any image decode or template render.

## Hardware backends

ORT execution providers gated behind feature flags: `cuda`,
`tensorrt`, `directml`, `rocm`, `coreml`. None are required for
CPU inference; each requires its vendor SDK at build time.

## Verification

- `cargo test --lib --all-features` — 138 lib tests, 0 failures
- `cargo check --no-default-features` — bare wasm-friendly preproc
  surface compiles
- `cargo check --no-default-features --features decoders` — pure
  preprocessing build (no `ort`, no `tokenizers`)
- Integration suite vs `LiquidAI/LFM2.5-VL-450M-ONNX` (HEAD
  revision): 9/9 tests pass, including a cross-engine
  `ImageAnalysis` comparison against the airport thumbnails
  shared with `qwen3-vl`.
- Codex adversarial review approved (round 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio changed the title 0.1.0 lfm 0.1.0 — Rust ONNX inference for LFM2.5-VL with llmtask Task contract May 10, 2026
… mismatch

The Windows CI test job fails to link with:

    libesaxx_rs … error LNK2038: mismatch detected for 'RuntimeLibrary':
      value 'MT_StaticRelease' doesn't match value 'MD_DynamicRelease'
      in libort_sys
    fatal error LNK1319: 1 mismatches detected

`esaxx-rs` is the C++ suffix-array trainer for tokenizers' Unigram
training. It's pulled in by tokenizers' default `esaxx_fast` feature
and built with the static CRT `/MT`, which conflicts with `ort_sys`
built with the dynamic CRT `/MD`.

lfm only uses `tokenizers::Tokenizer::from_file` + encode/decode at
inference time — the trainer paths are dead code for us. Drop
default features (`progressbar`, `onig`, `esaxx_fast`) and re-enable
just `fancy-regex`, the pure-Rust regex backend that JSON-defined
BPE tokenizers need at runtime. This matches the feature set the
sibling `toktrie_hf_tokenizers` 1.7 already uses on tokenizers 0.21
transitively, so the dep tree stays consistent.

138/138 lib tests pass locally with the new feature set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio merged commit 0e957d1 into main May 10, 2026
2 of 11 checks passed
@uqio uqio deleted the 0.1.0 branch May 10, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant