Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ learning_objectives:
- Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines

prerequisites:
- Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
- Three AWS c8g.4xlarge instances with at least 500 GB of EBS storage
- Python 3 installed on each instance
- Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
- Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,23 @@ layout: learningpathall

## Overview

This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
This example runs on three AWS Graviton4 `c8g.4xlarge` instances. Each instance has 16 cores, 32 GB of RAM, and 200 GB of disk storage to store the downloaded and quantized model weights.

In this Learning Path, you will:

- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
- Download Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
- Convert Meta's `safetensors` files to a single GGUF file.
- Quantize the 16-bit GGUF weights file to 4-bit weights.
- Load and run the model.

{{% notice Note %}}
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take 1-2 hours. If you already have a quantized GGUF file, you can skip the download and quantization.
{{% /notice %}}

## Set up dependencies

Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
Before you start, make sure you have permission to access Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).

{{% notice Note %}}
You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
Expand All @@ -34,7 +34,7 @@ You must repeat the install steps on each device. However, only run the download

```bash
apt update
apt install python3.12-venv
apt install -y python3.12-venv
python3 -m venv myenv
source myenv/bin/activate
```
Expand All @@ -58,7 +58,6 @@ The build output is placed in the `build-rpc/bin` directory.
Verify that the build succeeded by running the help command:

```bash
cd build-rpc
bin/llama-cli -h
```

Expand All @@ -73,6 +72,7 @@ pip3 install huggingface_hub
Create a new Python file named `download.py`:

```bash
cd ../..
vi download.py
```

Expand All @@ -81,8 +81,7 @@ Add the following code:
```python
import os
from huggingface_hub import snapshot_download

model_id = "meta-llama/Llama-3.1-405B"
model_id = "meta-llama/Llama-3.1-70B"
local_dir = "llama-hf"

# Create the directory if it doesn't exist
Expand Down Expand Up @@ -120,10 +119,10 @@ Quantize the model to 4-bit weights:

```bash
cd llama.cpp/build-rpc
bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.GGUF Q4_0
bin/llama-quantize ../../llama-hf/Llama-3.1-70B-F16.gguf Q4_0
```

You can rename the output file to `model.GGUF` for easier use.
You can rename the output file to `model.gguf` for easier use.

Check available quantization options:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ layout: learningpathall

Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine.

In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
In this Learning Path, you’ll explore how to run a 70B parameter model on Arm-based CPUs.

For this demonstration, the experimental setup includes:

- Number of instances: 3
- Instance type: `c8g.16xlarge`
- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
- Total number of instances: 3
- Instance type: c8g.4xlarge
- Model: model.gguf (Llama-3.1-70B_Q4_0, ~38GB when quantized to 4 bits)

One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.

## Set up the worker nodes

Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute. Because all three devices in this setup are identical, you can select any two to serve as backend workers.

Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ Escape character is '^]'.
Run distributed inference using `llama-cli`:

```bash
bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
bin/llama-cli -m ../../model.gguf -p "Here's a knock knock joke for kids:" -n 128 --rpc "$worker_ips" -ngl 999
```

{{% notice Note %}}
Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
It will take a significant amount of time (~10 minutes) for inference to run.
{{% /notice %}}
## Understand the command flags

Expand All @@ -50,25 +50,25 @@ Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensor
## Review example output

```output
build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
build: 6209 (fb22dd07) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device RPC[172.31.110.11:50052] (RPC[172.31.110.11:50052]) - 126497 MiB free
llama_model_load_from_file_impl: using device RPC[172.31.110.12:50052] (RPC[172.31.110.12:50052]) - 126497 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 1138 tensors from /home/ubuntu/Llama-3.1-405B_Q4_0.gguf (version GGUF V3 (latest))
llama_model_load_from_file_impl: using device RPC[172.31.27.42:50052] (RPC[172.31.27.42:50052]) - 31491 MiB free
llama_model_load_from_file_impl: using device RPC[172.31.20.38:50052] (RPC[172.31.20.38:50052]) - 31491 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 724 tensors from model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama Hf
llama_model_loader: - kv 3: general.size_label str = 406B
llama_model_loader: - kv 3: general.size_label str = 71B
llama_model_loader: - kv 4: general.license str = llama3.1
llama_model_loader: - kv 5: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 6: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 7: llama.block_count u32 = 126
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 16384
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 53248
llama_model_loader: - kv 11: llama.attention.head_count u32 = 128
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Expand All @@ -87,27 +87,31 @@ llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv 27: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: general.file_type u32 = 2
llama_model_loader: - type f32: 254 tensors
llama_model_loader: - type q4_0: 883 tensors
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_0: 561 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 213.13 GiB (4.51 BPW)
print_info: file size = 37.22 GiB (4.53 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 16384
print_info: n_layer = 126
print_info: n_head = 128
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
Expand All @@ -116,7 +120,7 @@ print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 53248
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
Expand All @@ -127,8 +131,8 @@ print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 405.85 B
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama Hf
print_info: vocab type = BPE
print_info: n_vocab = 128256
Expand All @@ -143,68 +147,67 @@ print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
....................................................................................................
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: RPC[172.31.27.42:50052] model buffer size = 18821.56 MiB
load_tensors: RPC[172.31.20.38:50052] model buffer size = 18725.42 MiB
load_tensors: CPU_Mapped model buffer size = 563.62 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: kv_unified = true
llama_context: kv_unified = false
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.49 MiB
llama_kv_cache_unified: RPC[172.31.110.11:50052] KV buffer size = 800.00 MiB
llama_kv_cache_unified: RPC[172.31.110.12:50052] KV buffer size = 784.00 MiB
llama_kv_cache_unified: CPU KV buffer size = 432.00 MiB
llama_kv_cache_unified: size = 2016.00 MiB ( 4096 cells, 126 layers, 1/ 1 seqs), K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: RPC[172.31.110.11:50052] compute buffer size = 1160.00 MiB
llama_context: RPC[172.31.110.12:50052] compute buffer size = 1160.00 MiB
llama_context: CPU compute buffer size = 1160.01 MiB
llama_context: graph nodes = 4668
llama_context: graph splits = 4
llama_kv_cache_unified: RPC[172.31.27.42:50052] KV buffer size = 656.00 MiB
llama_kv_cache_unified: RPC[172.31.20.38:50052] KV buffer size = 624.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_context: RPC[172.31.27.42:50052] compute buffer size = 588.01 MiB
llama_context: RPC[172.31.20.38:50052] compute buffer size = 588.01 MiB
llama_context: CPU compute buffer size = 28.01 MiB
llama_context: graph nodes = 2806
llama_context: graph splits = 3
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 64
main: llama threadpool init, n_threads = 16

system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |

sampler seed: 4077122424
sampler seed: 3485539003
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1

Tell me a joke! (or a funny story)
Thread starter Fiver
This thread is for any jokes you may want to share with other members. Please keep them clean!
Reactions: Fiver
A duck walks into a bar, and asks the bartender, "Have you got any bread?"
The bartender says, "No, we don't have any bread."
The duck leaves.
A few minutes later, the duck returns, and asks the bartender, "Have you got any bread?"
The bartender says, "No, I told you, we don't have any bread."
A few minutes later, the duck returns, and asks the bartender,

llama_perf_sampler_print: sampling time = 9.48 ms / 133 runs ( 0.07 ms per token, 14032.50 tokens per second)
llama_perf_context_print: load time = 1796754.73 ms
llama_perf_context_print: prompt eval time = 1925.98 ms / 5 tokens ( 385.20 ms per token, 2.60 tokens per second)
llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609.68 ms per token, 1.64 tokens per second)
llama_perf_context_print: total time = 79394.06 ms / 132 tokens
llama_perf_context_print: graphs reused = 0
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1

Here's a knock knock joke for kids: Knock, knock. Who's there? The interrupting cow. The interrupting cow wh- Mooooooo!
A: He had a little lamb.
Q: What do you get if you cross an elephant and a rhinoceros?
Q: What's the difference between a cat and a comma?
A:

llama_perf_sampler_print: sampling time = 5.42 ms / 74 runs ( 0.07 ms per token, 13643.07 tokens per second)
llama_perf_context_print: load time = 489542.78 ms
llama_perf_context_print: prompt eval time = 1854.82 ms / 10 tokens ( 185.48 ms per token, 5.39 tokens per second)
llama_perf_context_print: eval time = 36101.93 ms / 63 runs ( 573.05 ms per token, 1.75 tokens per second)
llama_perf_context_print: total time = 37989.35 ms / 73 tokens
llama_perf_context_print: graphs reused = 60
```
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality.

That's it! You have successfully run the llama-3.1-70B model on CPUs with the power of llama.cpp RPC functionality.

The following table provides brief description of the metrics from `llama_perf`:

Expand All @@ -215,16 +218,4 @@ The following table provides brief description of the metrics from `llama_perf`:
| load time | Time required to load the model into memory and initialize weights and buffers |
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) |
| eval time | Time to generate output tokens by forward-passing through the model |
| total time | Total time for both prompt processing and token generation (excludes model load) |

## Run distributed inference with llama-server

Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
```bash
bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
```
At the very end of the output to the above command, you will see something like the following:
```output
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
```
| total time | Total time for both prompt processing and token generation (excludes model load) |