feat(llama): multi-GPU layer distribution + KV cache placement#69
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full multi-GPU support to the llama.cpp local runtime: users can split a model
across multiple discrete GPUs with a choice of distribution strategy, control where the
KV cache lives, and (per model) assign exact per-GPU and CPU layer counts. Builds on the
initial multi-GPU pass (db3deaf) and makes it actually configurable.
What's included
Distribution strategies (the "offloader mode"):
live
GPU + CPU = total layersvalidation (per-model only)KV cache placement: Auto / Split with layers / System RAM / Pin to one GPU
(maps to
offload_kqv+main_gpu).Guardrails:
VRAM accounting (scoped to the multi-GPU paths; single-GPU iGPU-only hosts are unaffected)
per-model editor and the global runtime-defaults page
interconnect (NVLink) that's rare, while layer split is correct for PCIe consumer GPUs
Where it lives:
EditModelPagegets the full controls (strategy, devices, manual counts, KV)LocalRuntimeDefaultsPagesets the inheritable defaults (strategy, priority limit,KV placement); per-model values fall back to these via the
session → model → settingschaini18n: the multi-GPU runtime-defaults strings (previously English-only) are now translated
across all 19 non-English locales.
Implementation notes
llama_model_paramsstruct, so
tensor_split(per-device proportions) andmain_gpuare set directly. Thetensor_splitbuffer is held at function scope so it outlivesllama_load_model_from_file.plan_multi_gpu_distribution(offload.rs) turns a strategy + per-device free VRAM +model layer count into a concrete
(n_gpu_layers, tensor_split, main_gpu).the ladder (proportions stay fixed while the total shrinks on VRAM-pressure fallback).
model_params_keyincludes the new fields so a strategy change forces a reload.Testing
bun run check(tsc + cargo check) passes clean, no warnings.disabled with <2 discrete GPUs, manual-sum validation, context-info estimates, and all 19
locales (25 keys each, placeholders intact).
on real 2+ discrete-GPU hardware — the dev machine has a single discrete GPU.