Skip to content

feat(llama): multi-GPU layer distribution + KV cache placement#69

Merged
MegalithOfficial merged 4 commits into
mainfrom
feature/llama-multi-gpu
Jun 30, 2026
Merged

feat(llama): multi-GPU layer distribution + KV cache placement#69
MegalithOfficial merged 4 commits into
mainfrom
feature/llama-multi-gpu

Conversation

@MegalithOfficial

Copy link
Copy Markdown
Contributor

Summary

Adds full multi-GPU support to the llama.cpp local runtime: users can split a model
across multiple discrete GPUs with a choice of distribution strategy, control where the
KV cache lives, and (per model) assign exact per-GPU and CPU layer counts. Builds on the
initial multi-GPU pass (db3deaf) and makes it actually configurable.

What's included

Distribution strategies (the "offloader mode"):

  • Balanced — even layer split across the selected GPUs
  • Proportional to VRAM — weighted by each GPU's free memory (good for mismatched cards)
  • Priority fill — fill the first GPU up to a VRAM limit, then overflow to the next
  • Manual per-GPU — explicit layer-count input per GPU plus a CPU layers input, with
    live GPU + CPU = total layers validation (per-model only)

KV cache placement: Auto / Split with layers / System RAM / Pin to one GPU
(maps to offload_kqv + main_gpu).

Guardrails:

  • Integrated GPUs are never used — excluded from enumeration, selection, validation, and
    VRAM accounting (scoped to the multi-GPU paths; single-GPU iGPU-only hosts are unaffected)
  • The Multi-GPU toggle is disabled when fewer than 2 discrete GPUs are present, on both the
    per-model editor and the global runtime-defaults page
  • Only layer split is supported — row/tensor split were removed; they require a fast
    interconnect (NVLink) that's rare, while layer split is correct for PCIe consumer GPUs

Where it lives:

  • Per-model: EditModelPage gets the full controls (strategy, devices, manual counts, KV)
  • Global: LocalRuntimeDefaultsPage sets the inheritable defaults (strategy, priority limit,
    KV placement); per-model values fall back to these via the session → model → settings chain

i18n: the multi-GPU runtime-defaults strings (previously English-only) are now translated
across all 19 non-English locales.

Implementation notes

  • No llama-cpp-rs fork change needed. The loader builds the raw llama_model_params
    struct, so tensor_split (per-device proportions) and main_gpu are set directly. The
    tensor_split buffer is held at function scope so it outlives llama_load_model_from_file.
  • New plan_multi_gpu_distribution (offload.rs) turns a strategy + per-device free VRAM +
    model layer count into a concrete (n_gpu_layers, tensor_split, main_gpu).
  • Manual mode pins a fixed total and bypasses the smart backoff ladder; auto strategies keep
    the ladder (proportions stay fixed while the total shrinks on VRAM-pressure fallback).
  • model_params_key includes the new fields so a strategy change forces a reload.

Testing

  • bun run check (tsc + cargo check) passes clean, no warnings.
  • Verified locally: single-GPU regression (controls hidden, behavior unchanged), toggle
    disabled with <2 discrete GPUs, manual-sum validation, context-info estimates, and all 19
    locales (25 keys each, placeholders intact).
  • ⚠️ Actual cross-GPU split, priority-fill overflow, and KV pin/system-RAM placement need QA
    on real 2+ discrete-GPU hardware — the dev machine has a single discrete GPU.

@MegalithOfficial MegalithOfficial merged commit 6d71bc7 into main Jun 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant