Compute buffer and KV-cache aware layer distribution for multi-GPU inference #14484
+312
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Compute buffer and KV-cache aware layer distribution for multi-GPU inference. Solves the problem of attempting to run setups with heterogeneous GPU VRAMs (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer.
Modifications include:
TESTING DETAILS:
Primary server node:
./llama.cpp/build/bin/llama-server -m ./llama.cpp/models/YOUR_LLM_MODEL.gguf --rpc WORKER_IP_1:PORT1,WORKER_IP_2:PORT2,WORKER_IP_3:PORT3 --host 0.0.0.0 --port LLAMACPP_SERVER_PORT -ngl NUMER_OF_LAYERS_TO_DISTRIBUTE_TO_GPUS -c DESIRED_CONTEXT_LENGTH
Worker node (run separately per GPU ID, even if on the same machine):
cd /YOUR_PATH_TO/llama.cpp && CUDA_VISIBLE_DEVICES=GPU_ID_NUMBER ./build/bin/rpc-server --host 0.0.0.0 --port WORKER_PORT
LLMs tested:
Devstral-Small-2505-Q5_K_M.gguf
DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
gemma-3-27b-it-q4_0.gguf
Architecture tested:
Primary server machine: Ubuntu 24.042 GPUs (NVIDIA RTX3090 (24GB) + NVIDIA RTX2060 (6GB))
Worker host machine: Proxmox + 2VMs:
Various -c context lengths tested for each LLM model; properly excludes small GPUs when KV cache and compute buffer don't fit. If Model + Context length is too large to fit in the setup, launch fails (as expected). Have not tested with offload to CPU + system RAM.