Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,14 @@ weight: 2
layout: learningpathall
---

## Profiling LLMs on Arm CPUs with Streamline
## Profile LLMs on Arm CPUs with Streamline

Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution. While larger models may benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone.
Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution for many applications. While larger models can benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone by reducing model precision to save memory.

Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provide a convenient way to run LLMs, but it also comes with a certain level of complexity.
Frameworks such as [llama.cpp](https://github.com/ggml-org/llama.cpp) provide a convenient way to run LLMs. However, understanding their performance characteristics requires specialized analysis tools. To optimize LLM execution on Arm platforms, you need both a basic understanding of transformer architectures and the right profiling tools to identify bottlenecks.

To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools.
This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. You'll gain insights into token generation performance at both the Prefill and Decode stages. You'll also understand how individual tensor operations contribute to overall execution time, and evaluate multi-threaded performance across multiple CPU cores.

This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs.
You will run the Qwen1_5-0_5b-chat-q4_0.gguf model using `llama-cli` on Arm Linux and use Streamline for detailed performance analysis. The same methodology can also be applied on Android systems.

You will learn how to:
- Profile token generation at the Prefill and Decode stages
- Profile execution of individual tensor nodes and operators
- Profile LLM execution across multiple threads and cores

You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis.

The same method can also be used on Android.
By the end of this Learning Path, you'll understand how to profile LLM inference, identify performance bottlenecks, and analyze multi-threaded execution patterns on Arm CPUs.
Original file line number Diff line number Diff line change
@@ -1,86 +1,89 @@
---
title: Understand llama.cpp
title: Explore llama.cpp architecture and the inference workflow
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Understand llama.cpp
## Key concepts and architecture overview

llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference.
llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. This Learning Path focuses specifically on inference performance on Arm CPUs.

This Learning Path focuses on inference on Arm CPUs.
The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. It supports text generation, chat mode, and grammar-constrained output directly from the terminal.

The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine.
It supports text generation, chat mode, and grammar-constrained output directly from the terminal.
{{% notice Note %}}
These are some key terms used in this Learning Path:
- *Inference*: the process of generating text from a trained model
- *GGUF format*: a file format optimized for storing and loading LLM models efficiently
- *Tokenization*: converting text into numerical tokens that the model can process
{{% /notice %}}

![text#center](images/llama_structure.png "Figure 1. llama-cli Flow")
## The llama-cli workflow

### What does the Llama CLI do?
The following diagram shows the high-level workflow of llama-cli during inference:

Here are the steps performed by `llama-cli`:
![Workflow diagram showing llama-cli inference pipeline with input prompt processing through model loading, tokenization, parallel Prefill stage, and sequential Decode stage for token generation alt-text#center](images/llama_structure.png "The llama-cli inference workflow")

1. Load and interpret LLMs in GGUF format
The workflow begins when you provide an input prompt to `llama-cli`. The tool loads the specified GGUF model file and tokenizes your prompt. It then processes the prompt through two distinct stages:

2. Build a compute graph based on the model structure
- Prefill stage: the entire prompt is processed in parallel to generate the first output token
- Decode stage: additional tokens are generated sequentially, one at a time

The graph can be divided into subgraphs, each assigned to the most suitable backend device, but in this Learning Path all operations are executed on the Arm CPU backend.
This process continues until the model generates a complete response or reaches a stopping condition.

3. Allocate memory for tensor nodes using the graph planner
## How does llama-cli process requests?

4. Execute tensor nodes in the graph during the `graph_compute` stage, which traverses nodes and forwards work to backend devices
Here are the steps performed by `llama-cli` during inference:

Steps 2 to 4 are wrapped inside the function `llama_decode`.
During Prefill and Decode, `llama-cli` repeatedly calls `llama_decode` to generate tokens.
- Load and interpret LLMs in GGUF format

The parameter `llama_batch` passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions.
- Build a compute graph based on the model structure:
- A compute graph defines the mathematical operations required for inference
- The graph is divided into subgraphs to optimize execution across available hardware backends
- Each subgraph is assigned to the most suitable backend device; in this Learning Path, all subgraphs are assigned to the Arm CPU backend

- Allocate memory for tensor nodes using the graph planner
- Tensor nodes represent data and operations in the compute graph

### What are the components of llama.cpp?
- Execute tensor nodes in the graph during the `graph_compute` stage
- This stage traverses nodes and forwards work to backend devices

The components of llama.cpp include:
The compute graph building and tensor node execution stages are wrapped inside the function `llama_decode`. During both Prefill and Decode stages, `llama-cli` repeatedly calls `llama_decode` to generate tokens. The parameter `llama_batch` passed to `llama_decode` differs between stages. It contains input tokens, their count, and their positions.

![text#center](images/llama_components.jpg "Figure 2. llama.cpp components")
## What are the components of llama.cpp?

llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, and `OpenCL`.
The architecture of llama.cpp includes several key components that work together to provide efficient LLM inference, as shown in the diagram:

For the CPU backend, it provides an optimized `ggml-cpu` library, mainly utilizing CPU vector instructions.
![Architecture diagram showing llama.cpp components including backends, ggml-cpu library, and KleidiAI integration alt-text#center](images/llama_components.jpg "llama.cpp components")

For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages 8-bit integer multiply (i8mm) instructions for acceleration.
llama.cpp provides optimized support for Arm CPUs through its `ggml-cpu` library, which leverages Arm-specific vector instructions such as NEON and SVE, and includes an AArch64 trait that accelerates inference using 8-bit integer multiply (i8mm) instructions. The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. In addition to Arm CPU support, llama.cpp offers backends for GPU, CUDA, and OpenCL to enable inference on a variety of hardware platforms.

The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait.
## Prefill and Decode in autoregressive LLMs

### Prefill and Decode in autoregressive LLMs
An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token based on all the previously-generated tokens. A token represents a word or word piece in the sequence.

An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token (word or word piece) in a sequence based on all the previously generated tokens.
The term *autoregressive* means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. For example, when generating the sentence "The cat sat on the...", an autoregressive LLM takes the input prompt as context and predicts the next most likely token, such as "mat". The model then uses the entire sequence including "mat" to predict the following token, continuing this process token by token until completion, which is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one).

The term "autoregressive" means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process.

For example, when generating the sentence "The cat sat on the", an autoregressive LLM:
1. Takes the input prompt as context
2. Predicts the next most likely token (e.g., "mat")
3. Uses the entire sequence including "mat" to predict the following token
4. Continues this process token by token until completion

This sequential nature is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one).

Most autoregressive LLMs are Decoder-only models. This refers to the transformer architecture they use, which consists only of decoder blocks from the original Transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation.
Most autoregressive LLMs are decoder-only models. This refers to the transformer architecture, which consists only of decoder blocks from the original transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation.

Decoder-only models like LLaMA have become dominant for text generation because they are simpler to train at scale, can handle both understanding and generation tasks, and are more efficient for text generation.

Here is a brief introduction to Prefill and Decode stages of autoregressive LLMs.
![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stages")
This diagram introduces the idea of Prefill and Decode stages of autoregressive LLMs:
![Diagram illustrating the two stages of autoregressive LLM inference: Prefill stage processing input tokens and Decode stage generating output tokens sequentially alt-text#center](images/llm_prefill_decode.jpg "Prefill and Decode stages")

The Prefill stage is shown below, and as you can see, multiple input tokens of the prompt are processed simultaneously.

At the Prefill stage, multiple input tokens of the prompt are processed.
In the context of Large Language Models (LLMs), a *matrix* is a two-dimensional array of numbers representing data such as model weights or token embeddings, while a *vector* is a one-dimensional array often used to represent a single token or feature set.

It mainly performs GEMM (a matrix is multiplied by another matrix) operations to generate the first output token.
This stage mainly performs GEMM operations (General Matrix Multiply; where one matrix is multiplied by another matrix) to generate the first output token.

![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")
![Diagram showing the Prefill stage processing multiple input tokens in parallel through transformer blocks using GEMM operations alt-text#center](images/transformer_prefill.jpg "Prefill stage")

At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (a vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
At the Decode stage, the model utilizes the [KV cache](https://huggingface.co/blog/not-lain/kv-caching) (Key-Value cache; which is stored attention information from previous tokens). This stage mainly performs GEMV operations (General Matrix-Vector multiply - where a vector is multiplied by a matrix) to generate subsequent output tokens one by one.

![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")
![Diagram showing the Decode stage generating tokens one by one using KV cache and GEMV operations alt-text#center](images/transformer_decode.jpg "Decode stage")

In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations.
## Summary

You will see this highlighted during the Streamline performance analysis.
In this section, you learned about llama.cpp architecture and its inference workflow. The framework uses a two-stage process where the Prefill stage is compute-bound and dominated by large GEMM operations that process multiple tokens in parallel, while the Decode stage is memory-bound and dominated by KV cache access and GEMV operations that process one token at a time. You will see this distinction between Prefill and Decode stages reflected in the performance metrics and visualizations. In the next section, you'll integrate Streamline annotations into llama.cpp to enable detailed performance profiling of these stages.
Loading