feat(llama): multi-GPU layer distribution + KV cache placement by MegalithOfficial · Pull Request #69 · LettuceAI/app

MegalithOfficial · 2026-06-30T12:55:07Z

Summary

Adds full multi-GPU support to the llama.cpp local runtime: users can split a model
across multiple discrete GPUs with a choice of distribution strategy, control where the
KV cache lives, and (per model) assign exact per-GPU and CPU layer counts. Builds on the
initial multi-GPU pass (db3deaf) and makes it actually configurable.

What's included

Distribution strategies (the "offloader mode"):

Balanced — even layer split across the selected GPUs
Proportional to VRAM — weighted by each GPU's free memory (good for mismatched cards)
Priority fill — fill the first GPU up to a VRAM limit, then overflow to the next
Manual per-GPU — explicit layer-count input per GPU plus a CPU layers input, with
live GPU + CPU = total layers validation (per-model only)

KV cache placement: Auto / Split with layers / System RAM / Pin to one GPU
(maps to offload_kqv + main_gpu).

Guardrails:

Integrated GPUs are never used — excluded from enumeration, selection, validation, and
VRAM accounting (scoped to the multi-GPU paths; single-GPU iGPU-only hosts are unaffected)
The Multi-GPU toggle is disabled when fewer than 2 discrete GPUs are present, on both the
per-model editor and the global runtime-defaults page
Only layer split is supported — row/tensor split were removed; they require a fast
interconnect (NVLink) that's rare, while layer split is correct for PCIe consumer GPUs

Where it lives:

Per-model: EditModelPage gets the full controls (strategy, devices, manual counts, KV)
Global: LocalRuntimeDefaultsPage sets the inheritable defaults (strategy, priority limit,
KV placement); per-model values fall back to these via the session → model → settings chain

i18n: the multi-GPU runtime-defaults strings (previously English-only) are now translated
across all 19 non-English locales.

Implementation notes

No llama-cpp-rs fork change needed. The loader builds the raw llama_model_params
struct, so tensor_split (per-device proportions) and main_gpu are set directly. The
tensor_split buffer is held at function scope so it outlives llama_load_model_from_file.
New plan_multi_gpu_distribution (offload.rs) turns a strategy + per-device free VRAM +
model layer count into a concrete (n_gpu_layers, tensor_split, main_gpu).
Manual mode pins a fixed total and bypasses the smart backoff ladder; auto strategies keep
the ladder (proportions stay fixed while the total shrinks on VRAM-pressure fallback).
model_params_key includes the new fields so a strategy change forces a reload.

Testing

bun run check (tsc + cargo check) passes clean, no warnings.
Verified locally: single-GPU regression (controls hidden, behavior unchanged), toggle
disabled with <2 discrete GPUs, manual-sum validation, context-info estimates, and all 19
locales (25 keys each, placeholders intact).
⚠️ Actual cross-GPU split, priority-fill overflow, and KV pin/system-RAM placement need QA
on real 2+ discrete-GPU hardware — the dev machine has a single discrete GPU.

…aults page

MegalithOfficial added 4 commits June 30, 2026 15:46

Add llama.cpp multi-GPU runtime controls

a9a0ea3

feat(llama): add multi-GPU layer distribution and KV placement controls

8c2dbd0

feat(llama): expose multi-GPU offloader defaults on local runtime def…

bfaf0f0

…aults page

i18n(runtime-defaults): translate multi-GPU settings across all locales

95692ab

MegalithOfficial merged commit 6d71bc7 into main Jun 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama): multi-GPU layer distribution + KV cache placement#69

feat(llama): multi-GPU layer distribution + KV cache placement#69
MegalithOfficial merged 4 commits into
mainfrom
feature/llama-multi-gpu

MegalithOfficial commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MegalithOfficial commented Jun 30, 2026

Summary

What's included

Implementation notes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant