Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions assets/contributors.csv
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,4 @@ Ker Liu,,,,,
Rui Chang,,,,,
Alejandro Martinez Vicente,Arm,,,,
Mohamad Najem,Arm,,,,
Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,,
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Overview
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Overview: Profiling LLMs on Arm CPUs with Streamline

Large Language Models (LLMs) run efficiently on Arm CPUs.
Frameworks that run LLMs, such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provides a convenient framework for running LLMs, it also comes with a certain level of complexity.

To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools.

This learning path demonstrates how to use the **llama-cli** application from llama.cpp together with **Arm Streamline** to analyze the efficiency of LLM inference on Arm CPUs.

In this guide you will learn how to:
- Profile token generation at the **Prefill** and **Decode** stages
- Profile execution of individual tensor nodes and operators
- Profile LLM execution across **multiple threads and cores**

You will run the **Qwen1_5-0_5b-chat-q4_0.gguf** model with llama-cli on **Arm64 Linux** and use Streamline for analysis.
The same method can also be applied to **Arm64 Android** platforms.

## Prerequisites
Before starting this guide, you should be familiar with:
- Basic understanding of llama.cpp
- Understanding of transformer model
- Knowledge of Streamline usage
- An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: Understand the llama.cpp
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Understand the llama.cpp

**llama.cpp** is an open-source LLM framework implemented in C++ that supports both training and inference.
This learning path focuses only on **inference on the CPU**.

The **llama-cli** tool provides a command-line interface to run LLMs with the llama.cpp inference engine.
It supports text generation, chat mode, and grammar-constrained output directly from the terminal.

![text#center](images/llama_structure.png "Figure 1. llama-cli Flow")

### What llama-cli does
- Load and interpret LLMs in **.gguf** format
- Build a **compute graph** based on the model structure
- The graph can be divided into subgraphs, each assigned to the most suitable backend device
- In this guide, all operators are executed on the **CPU backend**
- Allocate memory for tensor nodes using the **graph planner**
- Execute tensor nodes in the graph during the **graph_compute** stage, which traverses nodes and forwards work to backend devices

Step2 to Step4 are wrapped inside the function **`llama_decode`**.
During **Prefill** and **Decode**, `llama-cli` repeatedly calls `llama_decode` to generate tokens.
The parameter **`llama_batch`** passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions.

### Components of llama.cpp
The components of llama.cpp include:
![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components")

llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, `OpenCL` etc.

For the CPU backend, it provides an optimized `ggml-cpu` library (mainly utilizing CPU vector instructions).
For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages the new **I8MM** instructions for acceleration.
The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait.

### Prefill and Decode in autoregressive LLMs
Most autoregressive LLMs are Decoder-only model.
Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs.
![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage")

At the Prefill stage, multiple input tokens of the prompt are processed.
It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token.
![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")

At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")

Therefore,
- **Prefill** is **compute-bound**, dominated by large GEMM operations
- **Decode** is **memory-bound**, dominated by KV cache access and GEMV operations

This can be seen in the subsequent analysis with Streamline.
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
title: Integrating Streamline Annotations into llama.cpp
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Integrating Streamline Annotations into llama.cpp

To visualize token generation at the **Prefill** and **Decode** stages, we use **Streamline’s Annotation Marker** feature.
This requires integrating annotation support into the **llama.cpp** project.
More information about the Annotation Marker API can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en).

{{% notice Note %}}
You can either build natively on an **Arm platform**, or cross-compile on another architecture using an Arm cross-compiler toolchain.
{{% /notice %}}

### Step 1: Build Streamline Annotation library

Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first.

Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*.

For installation guidance, refer to the [Streamline installation guide](https://learn.arm.com/install-guides/streamline/).

Clone the gator repository that matches your Streamline version and build the `Annotation support library`.

The installation step is depends on your developement machine.

For Arm native build, you can use following insturction to install the packages.
For other machine, you need to set up the cross compiler environment by install [aarch64 gcc compiler toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads).
You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for Cross-compiler installation.

{{< tabpane code=true >}}
{{< tab header="Arm Native Build" language="bash">}}
apt-get update
apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
cd ~
git clone https://github.com/ARM-software/gator.git
cd gator
./build-linux.sh

cd annotate
make
{{< /tab >}}
{{< tab header="Cross Compiler" language="bash">}}
apt-get update
apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
cd ~
git clone https://github.com/ARM-software/gator.git

cd gator
make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool
{{< /tab >}}
{{< /tabpane >}}

Once complete, the static library **libstreamline_annotate.a** will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file at: `gator/annotate/streamline_annotate.h`

### Step 2: Integrate Annotation Marker into llama.cpp

Next, we need to install **llama.cpp** to run the LLM model.
To make the following performance profiling content easier to follow, this Learning Path will use a specific release version of llama.cpp to ensure the steps and results remain consistent.

Before the build **llama.cpp**, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the folder.

```bash
cd ~
wget https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz
tar -xvzf b6202.tar.gz
mv llama.cpp-b6202 llama.cpp
cd ./llama.cpp
mkdir streamline_annotation
cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation
```

To link `libstreamline_annotate.a` library when building llama-cli, adding following lines in the end of `llama.cpp/tools/main/CMakeLists.txt`.

```makefile
set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a")
target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_annotation")
target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}")
```

To add Annotation Markers to llama-cli, change the llama-cli code **llama.cpp/tools/main/main.cpp** by adding

```c
#include "streamline_annotate.h"
```

After the call to common_init(), add the setup macro:

```c
common_init();
//Add the Annotation setup code
ANNOTATE_SETUP;
```

Finally, add an annotation marker inside the main loop:

```c
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
int n_eval = (int) embd.size() - i;
if (n_eval > params.n_batch) {
n_eval = params.n_batch;
}

LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str());

// Add annotation marker code for Streamline
{
char printf_buf[200];
sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval );
ANNOTATE_MARKER_STR(printf_buf);
}
// End of annotation marker

if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
LOG_ERR("%s : failed to eval\n", __func__);
return 1;
}
```

A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed.

### Step 3: Build llama-cli

For convenience, llama-cli is **static linked**.

Firstly, create a new directory `build` understand llama.cpp root directory and go into it.

```bash
cd ~/llama.cpp
mkdir ./build & cd ./build
```

Then configure the project by running

{{< tabpane code=true >}}
{{< tab header="Arm Native Build" language="bash">}}
cmake .. \
-DGGML_NATIVE=ON \
-DLLAMA_F16C=OFF \
-DLLAMA_GEMM_ARM=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_EXE_LINKER_FLAGS="-static -g" \
-DGGML_OPENMP=OFF \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
-DGGML_CPU_KLEIDIAI=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_CURL=OFF
{{< /tab >}}
{{< tab header="Cross Compiler" language="bash">}}
cmake .. \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=arm \
-DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc \
-DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ \
-DLLAMA_NATIVE=OFF \
-DLLAMA_F16C=OFF \
-DLLAMA_GEMM_ARM=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_EXE_LINKER_FLAGS="-static -g" \
-DGGML_OPENMP=OFF \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
-DGGML_CPU_KLEIDIAI=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_CURL=OFF
{{< /tab >}}
{{< /tabpane >}}


Set `CMAKE_C_COMPILER` and `DCMAKE_CXX_COMPILER` to your cross compiler path. Make sure that **-march** in `DCMAKE_C_FLAGS` and `CMAKE_CXX_FLAGS` matches your Arm CPU hardware.


In this learning path, we run llama-cli on an Arm CPU that supports **NEON Dotprod** and **I8MM** instructions.
Therefore, we specify: **armv8.2-a+dotprod+i8mm**.

We also specify **-static** and **-g** options:
- **-static**: produces a statically linked executable, so it can run on different Arm64 Linux/Android environments without needing shared libraries.
- **-g**: includes debug information, which makes source code and function-level profiling in Streamline much easier.

so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android.

Now you can build the project by running:

```bash
cd ~/llama.cpp/build
cmake --build ./ --config Release
```

After the building process, you should find the llama-cli will be generated at **~/llama.cpp/build/bin/** directory.
Loading
Loading