diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md index ffcbe3a19b..2c62928a51 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md @@ -6,21 +6,14 @@ weight: 2 layout: learningpathall --- -## Profiling LLMs on Arm CPUs with Streamline +## Profile LLMs on Arm CPUs with Streamline -Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution. While larger models may benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone. +Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution for many applications. While larger models can benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone by reducing model precision to save memory. -Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provide a convenient way to run LLMs, but it also comes with a certain level of complexity. +Frameworks such as [llama.cpp](https://github.com/ggml-org/llama.cpp) provide a convenient way to run LLMs. However, understanding their performance characteristics requires specialized analysis tools. To optimize LLM execution on Arm platforms, you need both a basic understanding of transformer architectures and the right profiling tools to identify bottlenecks. -To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools. +This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. You'll gain insights into token generation performance at both the Prefill and Decode stages. You'll also understand how individual tensor operations contribute to overall execution time, and evaluate multi-threaded performance across multiple CPU cores. -This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. +You will run the Qwen1_5-0_5b-chat-q4_0.gguf model using `llama-cli` on Arm Linux and use Streamline for detailed performance analysis. The same methodology can also be applied on Android systems. -You will learn how to: -- Profile token generation at the Prefill and Decode stages -- Profile execution of individual tensor nodes and operators -- Profile LLM execution across multiple threads and cores - -You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis. - -The same method can also be used on Android. +By the end of this Learning Path, you'll understand how to profile LLM inference, identify performance bottlenecks, and analyze multi-threaded execution patterns on Arm CPUs. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md index 2c1e4129f0..75bf788463 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md @@ -1,86 +1,89 @@ --- -title: Understand llama.cpp +title: Explore llama.cpp architecture and the inference workflow weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Understand llama.cpp +## Key concepts and architecture overview -llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. +llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. This Learning Path focuses specifically on inference performance on Arm CPUs. -This Learning Path focuses on inference on Arm CPUs. +The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. It supports text generation, chat mode, and grammar-constrained output directly from the terminal. -The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. -It supports text generation, chat mode, and grammar-constrained output directly from the terminal. +{{% notice Note %}} +These are some key terms used in this Learning Path: +- *Inference*: the process of generating text from a trained model +- *GGUF format*: a file format optimized for storing and loading LLM models efficiently +- *Tokenization*: converting text into numerical tokens that the model can process +{{% /notice %}} -![text#center](images/llama_structure.png "Figure 1. llama-cli Flow") +## The llama-cli workflow -### What does the Llama CLI do? +The following diagram shows the high-level workflow of llama-cli during inference: -Here are the steps performed by `llama-cli`: +![Workflow diagram showing llama-cli inference pipeline with input prompt processing through model loading, tokenization, parallel Prefill stage, and sequential Decode stage for token generation alt-text#center](images/llama_structure.png "The llama-cli inference workflow") -1. Load and interpret LLMs in GGUF format +The workflow begins when you provide an input prompt to `llama-cli`. The tool loads the specified GGUF model file and tokenizes your prompt. It then processes the prompt through two distinct stages: -2. Build a compute graph based on the model structure +- Prefill stage: the entire prompt is processed in parallel to generate the first output token +- Decode stage: additional tokens are generated sequentially, one at a time - The graph can be divided into subgraphs, each assigned to the most suitable backend device, but in this Learning Path all operations are executed on the Arm CPU backend. +This process continues until the model generates a complete response or reaches a stopping condition. -3. Allocate memory for tensor nodes using the graph planner +## How does llama-cli process requests? -4. Execute tensor nodes in the graph during the `graph_compute` stage, which traverses nodes and forwards work to backend devices +Here are the steps performed by `llama-cli` during inference: -Steps 2 to 4 are wrapped inside the function `llama_decode`. -During Prefill and Decode, `llama-cli` repeatedly calls `llama_decode` to generate tokens. +- Load and interpret LLMs in GGUF format -The parameter `llama_batch` passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions. +- Build a compute graph based on the model structure: + - A compute graph defines the mathematical operations required for inference + - The graph is divided into subgraphs to optimize execution across available hardware backends + - Each subgraph is assigned to the most suitable backend device; in this Learning Path, all subgraphs are assigned to the Arm CPU backend + +- Allocate memory for tensor nodes using the graph planner + - Tensor nodes represent data and operations in the compute graph -### What are the components of llama.cpp? +- Execute tensor nodes in the graph during the `graph_compute` stage + - This stage traverses nodes and forwards work to backend devices -The components of llama.cpp include: +The compute graph building and tensor node execution stages are wrapped inside the function `llama_decode`. During both Prefill and Decode stages, `llama-cli` repeatedly calls `llama_decode` to generate tokens. The parameter `llama_batch` passed to `llama_decode` differs between stages. It contains input tokens, their count, and their positions. -![text#center](images/llama_components.jpg "Figure 2. llama.cpp components") +## What are the components of llama.cpp? -llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, and `OpenCL`. +The architecture of llama.cpp includes several key components that work together to provide efficient LLM inference, as shown in the diagram: -For the CPU backend, it provides an optimized `ggml-cpu` library, mainly utilizing CPU vector instructions. +![Architecture diagram showing llama.cpp components including backends, ggml-cpu library, and KleidiAI integration alt-text#center](images/llama_components.jpg "llama.cpp components") -For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages 8-bit integer multiply (i8mm) instructions for acceleration. +llama.cpp provides optimized support for Arm CPUs through its `ggml-cpu` library, which leverages Arm-specific vector instructions such as NEON and SVE, and includes an AArch64 trait that accelerates inference using 8-bit integer multiply (i8mm) instructions. The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. In addition to Arm CPU support, llama.cpp offers backends for GPU, CUDA, and OpenCL to enable inference on a variety of hardware platforms. -The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. +## Prefill and Decode in autoregressive LLMs -### Prefill and Decode in autoregressive LLMs +An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token based on all the previously-generated tokens. A token represents a word or word piece in the sequence. -An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token (word or word piece) in a sequence based on all the previously generated tokens. +The term *autoregressive* means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. For example, when generating the sentence "The cat sat on the...", an autoregressive LLM takes the input prompt as context and predicts the next most likely token, such as "mat". The model then uses the entire sequence including "mat" to predict the following token, continuing this process token by token until completion, which is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one). -The term "autoregressive" means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. - -For example, when generating the sentence "The cat sat on the", an autoregressive LLM: -1. Takes the input prompt as context -2. Predicts the next most likely token (e.g., "mat") -3. Uses the entire sequence including "mat" to predict the following token -4. Continues this process token by token until completion - -This sequential nature is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one). - -Most autoregressive LLMs are Decoder-only models. This refers to the transformer architecture they use, which consists only of decoder blocks from the original Transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation. +Most autoregressive LLMs are decoder-only models. This refers to the transformer architecture, which consists only of decoder blocks from the original transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation. Decoder-only models like LLaMA have become dominant for text generation because they are simpler to train at scale, can handle both understanding and generation tasks, and are more efficient for text generation. -Here is a brief introduction to Prefill and Decode stages of autoregressive LLMs. -![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stages") +This diagram introduces the idea of Prefill and Decode stages of autoregressive LLMs: +![Diagram illustrating the two stages of autoregressive LLM inference: Prefill stage processing input tokens and Decode stage generating output tokens sequentially alt-text#center](images/llm_prefill_decode.jpg "Prefill and Decode stages") + +The Prefill stage is shown below, and as you can see, multiple input tokens of the prompt are processed simultaneously. -At the Prefill stage, multiple input tokens of the prompt are processed. +In the context of Large Language Models (LLMs), a *matrix* is a two-dimensional array of numbers representing data such as model weights or token embeddings, while a *vector* is a one-dimensional array often used to represent a single token or feature set. -It mainly performs GEMM (a matrix is multiplied by another matrix) operations to generate the first output token. +This stage mainly performs GEMM operations (General Matrix Multiply; where one matrix is multiplied by another matrix) to generate the first output token. -![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage") +![Diagram showing the Prefill stage processing multiple input tokens in parallel through transformer blocks using GEMM operations alt-text#center](images/transformer_prefill.jpg "Prefill stage") -At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (a vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. +At the Decode stage, the model utilizes the [KV cache](https://huggingface.co/blog/not-lain/kv-caching) (Key-Value cache; which is stored attention information from previous tokens). This stage mainly performs GEMV operations (General Matrix-Vector multiply - where a vector is multiplied by a matrix) to generate subsequent output tokens one by one. -![text#center](images/transformer_decode.jpg "Figure 5. Decode stage") +![Diagram showing the Decode stage generating tokens one by one using KV cache and GEMV operations alt-text#center](images/transformer_decode.jpg "Decode stage") -In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations. +## Summary -You will see this highlighted during the Streamline performance analysis. \ No newline at end of file +In this section, you learned about llama.cpp architecture and its inference workflow. The framework uses a two-stage process where the Prefill stage is compute-bound and dominated by large GEMM operations that process multiple tokens in parallel, while the Decode stage is memory-bound and dominated by KV cache access and GEMV operations that process one token at a time. You will see this distinction between Prefill and Decode stages reflected in the performance metrics and visualizations. In the next section, you'll integrate Streamline annotations into llama.cpp to enable detailed performance profiling of these stages. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md index cdb90f1223..f6288f05ce 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md @@ -6,32 +6,33 @@ weight: 4 layout: learningpathall --- -## Integrate Streamline Annotations into llama.cpp +## Set up performance annotation markers -To visualize token generation at the Prefill and Decode stages, you can use Streamline's Annotation Marker feature. +To visualize token generation at the Prefill and Decode stages, you can use Streamline's Annotation Marker feature. -This requires integrating annotation support into the llama.cpp project. +{{% notice Note %}} +*Annotation markers* are code markers that you insert into your application to identify specific events or time periods during execution. When Streamline captures performance data, these markers appear in the timeline, making it easier to correlate performance data with specific application behavior. +{{% /notice %}} -More information about the Annotation Marker API can be found in the [Streamline User Guide](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). +This requires integrating annotation support into the llama.cpp project. More information about the Annotation Marker API can be found in the [Streamline User Guide](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). {{% notice Note %}} You can either build natively on an Arm platform, or cross-compile on another architecture using an Arm cross-compiler toolchain. {{% /notice %}} -### Step 1: Build Streamline Annotation library +## Build the Streamline annotation library Download and install [Arm Performance Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads) on your development machine. {{% notice Note %}} You can also download and install [Arm Development Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio#Downloads), as it also includes Streamline. - {{% /notice %}} -Streamline Annotation support code is in the Arm Performance Studio installation directory in the `streamline/gator/annotate` directory. +Streamline Annotation support code is located in the Arm Performance Studio installation directory under `streamline/gator/annotate`. -Clone the gator repository that matches your Streamline version and build the `Annotation support library`. You can build it on your current machine using the native build instructions and you can cross compile it for another Arm computer using the cross compile instructions. +Clone the gator repository that matches your Streamline version and build the Annotation support library. You can build it natively on your current machine or cross-compile it for another Arm computer. -If you need to set up a cross compiler you can review the [GCC install guide](/install-guides/gcc/cross/). +If you need to set up a cross-compiler, you can review the [GCC install guide](/install-guides/gcc/cross/). {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} @@ -56,7 +57,7 @@ If you need to set up a cross compiler you can review the [GCC install guide](/i Once complete, the static library `libstreamline_annotate.a` will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file is at `gator/annotate/streamline_annotate.h`. -### Step 2: Integrate Annotation Marker into llama.cpp +## Integrate annotation marker into llama.cpp Next, you need to install llama.cpp to run the LLM model. @@ -64,7 +65,7 @@ Next, you need to install llama.cpp to run the LLM model. To make the performance profiling content easier to follow, this Learning Path uses a specific release version of llama.cpp to ensure the steps and results remain consistent. {{% /notice %}} -Before building llama.cpp, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the new directory. +Before building llama.cpp, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the new directory: ```bash cd ~ @@ -76,7 +77,7 @@ mkdir streamline_annotation cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation ``` -To link the `libstreamline_annotate.a` library when building llama-cli, use an editor to add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`. +To link the `libstreamline_annotate.a` library when building llama-cli, use an editor to add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`: ```makefile set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -84,15 +85,15 @@ target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_ann target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}") ``` -To add Annotation Markers to `llama-cli`, edit the file `llama.cpp/tools/main/main.cpp` and make 3 modification. +To add Annotation Markers to `llama-cli`, edit the file `llama.cpp/tools/main/main.cpp` and make three modifications. -First, add the include file at the top of `main.cpp` with the other include files. +First, add the include file at the top of `main.cpp` with the other include files: ```c #include "streamline_annotate.h" ``` -Next, the find the `common_init()` call in the `main()` function and add the Streamline setup macro below it so that the code looks like: +Next, the find the `common_init()` call in the `main()` function and add the Streamline setup macro below it so that the code looks like this: ```c common_init(); @@ -127,7 +128,7 @@ Finally, add an annotation marker inside the main loop. Add the complete code in A string is added to the Annotation Marker to record the position of input tokens and number of tokens to be processed. -### Step 3: Build llama-cli +## Compile llama-cli with annotation support For convenience, llama-cli is statically linked. @@ -138,7 +139,7 @@ cd ~/llama.cpp mkdir build && cd build ``` -Next, configure the project. +Next, configure the project: {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} @@ -194,4 +195,6 @@ cmake --build ./ --config Release -j $(nproc) After the building process completes, you can find the `llama-cli` in the `~/llama.cpp/build/bin/` directory. -You now have an annotated version of `llama-cli` ready for Streamline. \ No newline at end of file +## Summary + +You have successfully integrated Streamline annotations into llama.cpp and built an annotated version of `llama-cli`. The annotation markers you added will help identify token generation events during profiling. In the next section, you'll use this instrumented executable to capture performance data with Streamline and analyze the distinct characteristics between Prefill and Decode stages during LLM inference. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md index 00472c5863..58838ce1dc 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md @@ -1,34 +1,40 @@ --- -title: Run llama-cli and analyze the data with Streamline +title: Analyze token generation performance with Streamline profiling weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Run llama-cli and analyze the data with Streamline +## Set up the profiling environment -After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. This can be your development machine or another Arm system. +After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. This can be on your development machine or another Arm system. You'll configure the gator daemon for performance data collection and prepare your target system with the necessary executables and model files. This setup enables comprehensive performance analysis of both the compute-intensive Prefill stage and memory-bound Decode operations during LLM inference. -### Set up the gator daemon +## Set up the gator daemon -The gator daemon, `gatord`, is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data. + Start with setting up the gator daemon. The setup process depends on your llama.cpp build method. -Depending on how you built llama.cpp: - -For the cross-compiled build flow: +{{% notice Note %}} +The daemon must be running on your target device before you can capture performance data. +{{% /notice %}} - - Copy the `llama-cli` executable to your Arm target. - - Copy the `gatord` binary from the Arm Performance Studio release. If you are targeting Linux, take it from `streamline\bin\linux\arm64` and if you are targeting Android take it from `streamline\bin\android\arm64`. +### For cross-compiled builds: +Copy the required files to your Arm target system: +- Transfer the `llama-cli` executable to your target device +- Copy the `gatord` binary from your Arm Performance Studio installation: + - Linux targets: Use `streamline\bin\linux\arm64\gatord` + - Android targets: Use `streamline\bin\android\arm64\gatord` -Put both of these programs in your home directory on the target system. +Place both programs in your home directory on the target system. -For the native build flow: - - Use the `llama-cli` from your local build in `llama.cpp/build/bin` and the `gatord` you compiled earlier at `~/gator/build-native-gcc-rel/gatord`. +### For native builds: +Use the locally built binaries: +- The `llama-cli` executable from `llama.cpp/build/bin` +- The `gatord` binary you compiled earlier at `~/gator/build-native-gcc-rel/gatord` -You now have the `gatord` and the `llama-cli` on the computer you want to run and profile. +Both programs are now ready for profiling on your target Arm system. -### Download a lightweight model +## Download a lightweight model You can download the LLM model to the target platform. @@ -39,7 +45,7 @@ cd ~ wget https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_0.gguf ``` -### Run the Gator daemon +## Run the gator daemon Start the gator daemon on your Arm target: @@ -56,115 +62,114 @@ Copyright (c) 2010-2025 Arm Limited. All rights reserved. Gator ready ``` -### Connect Streamline +## Connect Streamline Next, you can use Streamline to set up the collection of CPU performance data. -If you're accessing the Arm server via SSH, you need to forward port `8080` from the host platform to your local machine. +If you're accessing the Arm server via SSH, you need to forward port `8080` from the host platform to your local machine: ``` bash ssh -i user@arm-server -L 8080:localhost:8080 -N ``` -Append `-L 8080:localhost:8080 -N` to your original SSH command to enable local port forwarding, this allows Arm Streamline on your local machine to connect to the Arm server. +Append `-L 8080:localhost:8080 -N` to your original SSH command to enable local port forwarding. This allows Arm Streamline on your local machine to connect to the Arm server. -Then launch the Streamline application on your host machine, connect to the gatord running on your Arm target with either TCP or ADB connection. +Then launch the Streamline application on your host machine and connect to the gatord running on your Arm target with either TCP or ADB connection. You can select PMU events to be monitored at this point. {{% notice Note %}} If you are using ssh port forwarding, you need to select TCP `127.0.0.1:8080`. {{% /notice %}} -![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ") +![Screenshot of Arm Streamline application showing the capture configuration interface with connection settings and PMU event selection alt-text#center](images/streamline_capture.png "Streamline start capture") -Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis. -![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path") +Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis: +![Screenshot showing Streamline image path configuration for llama-cli executable debug information alt-text#center](images/streamline_capture_image.png "Streamline image path") -Click `Start Capture` button on Streamline to start collecting data from the Arm target. +Click the **Start Capture** button on Streamline to start collecting data from the Arm target. {{% notice Note %}} -This guide is not intended to introduce how to use Streamline, if you encounter any issues with gatord or Streamline, please refer to the [Streamline User Guide](https://developer.arm.com/documentation/101816/latest/?lang=en) +This Learning Path focuses on analyzing llama.cpp performance data. If you encounter issues with gatord or Streamline setup, check the [Streamline User Guide](https://developer.arm.com/documentation/101816/latest/?lang=en) for detailed troubleshooting steps. {{% /notice %}} -### Run llama-cli +## Run llama-cli -Run the `llama-cli` executable as below: +Run the `llama-cli` executable as shown below: ``` bash cd ~/llama.cpp/build/bin ./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1 ``` -After a while, you can stop the Streamline data collection by clicking the `Stop` button on Streamline. +After a while, you can stop the Streamline data collection by clicking the **Stop** button on Streamline. Streamline running on your host PC will start the data analysis. -### Analyze the data with Streamline +## Analyze the data with Streamline -From the timeline view of Streamline, you can see some Annotation Markers. Since an Annotation Marker is added before the llama_decode function, each Annotation Marker marks the start time of a token generation. -![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker") +From the timeline view of Streamline, you can see some annotation markers. Since an Annotation Marker is added before the llama_decode function, each Annotation Marker marks the start time of a token generation. +![Screenshot of Streamline timeline view showing annotation markers indicating token generation start points during llama.cpp execution alt-text#center](images/annotation_marker_1.png "Annotation marker") -The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example, -![text#center](images/annotation_marker_2.png "Figure 9. Annotation String") +You can view the annotation details by clicking on any Annotation Marker in the timeline. This displays the marker string with token position and processing information: -The number after `past` indicates the position of input tokens, the number after `n_eval` indicates the number of tokens to be processed this time. +![Screenshot showing detailed annotation marker information with token position and count data displayed in Streamline alt-text#center](images/annotation_marker_2.png "Annotation string") + +The number after **past** indicates the position of input tokens, the number after **n_eval** indicates the number of tokens to be processed this time. By checking the string of Annotation Marker, the first token generation at Prefill stage has `past 0, n_eval 78`, which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. -You can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. +You can see that the first token generated at the Prefill stage takes more time since 78 input tokens have to be processed at the Prefill stage, performing lots of GEMM operations. At the Decode stage, tokens are generated one by one at mostly equal speed; one token takes less time than that of the Prefill stage, thanks to the effect of KV cache. At the Decode stage, it performs many GEMV operations. -You can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. +You can further investigate it with PMU event counters that are captured by Streamline. At the Prefill stage, the amount of computation, which is indicated by PMU event counters that count the number of Advanced SIMD (NEON), floating-point, and integer data processing instructions, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of the Decode stage. -At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher. +At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refills/misses increases significantly. -![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event") +![Graph showing PMU backend stall cycles analysis comparing memory stalls between Prefill and Decode stages alt-text#center](images/annotation_pmu_stall.png "Backend stall PMU event") You can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage. -Now, you can further profile the code execution with Streamline. In the Call Paths view of Streamline, you can see the percentage of running time of functions that are organized in form of call stack. +Now, you can further profile the code execution with Streamline. In the **Call Paths** view of Streamline, you can see the percentage of running time of functions that are organized in form of call stack. -![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack") +![Screenshot of Streamline Call Paths view showing function execution hierarchy and performance distribution alt-text#center](images/annotation_prefill_call_stack.png "Call stack") In the Functions view of Streamline, you can see the overall percentage of running time of functions. -![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view") +![Screenshot of Streamline Functions view displaying execution time percentages for different functions during llama.cpp execution alt-text#center](images/annotation_prefill_functions.png "Functions view") + +As you can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With the `Qwen1_5-0_5b-chat-q4_0` model, the computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by `ggml_cpu_extra_compute_forward`. KleidiAI microkernels implemented with NEON dot product and i8mm vector instructions accelerate the computation. + +At the Prefill stage, `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes advantage of i8mm instructions. Since the Prefill stage only takes a small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if you focus only on the Prefill stage with Samplings view in Timeline, you see `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` takes the largest portion of the Prefill stage. -As you can see, the function, graph_compute, takes the largest portion of the running time. +![Screenshot showing Streamline analysis focused on Prefill stage execution with KleidiAI GEMM operations highlighted alt-text#center](images/prefill_only.png "Prefill only view") -It shows that large amounts of GEMM and GEMV operations take most of the time. +At the Decode stage, `kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod` KleidiAI ukernel is used for GEMV operators. It takes advantage of dot product instructions. If you look only at the Decode stage, you can see this function takes the second largest portion. -With the `Qwen1_5-0_5b-chat-q4_0` model, the computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. +![Screenshot showing Streamline analysis focused on Decode stage execution with KleidiAI GEMV operations highlighted alt-text#center](images/decode_only.png "Decode only view") -The computation is forwarded to KleidiAI trait by `ggml_cpu_extra_compute_forward`. KleidiAI microkernels implemented with NEON dot product and i8mm vector instructions accelerate the computation. +There is a `result_output` linear layer in the Qwen1_5-0_5b-chat-q4_0 model where the weights use Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, so it is handled by the `ggml_vec_dot_q6_K_q8_K` function in the ggml-cpu library. -At the Prefill stage, `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes advantage of i8mm instructions. Since the Prefill stage only takes a small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if you focus only on the Prefill stage with Samplings view in Timeline, you see `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` takes the largest portion of the Prefill stage. +The tensor nodes for Multi-Head attention computation are represented as three-dimensional matrices with FP16 data type (KV cache also holds FP16 values). These are computed by the `ggml_vec_dot_f16` function in the ggml-cpu library. -![text#center](images/prefill_only.png "Figure 14. Prefill only view") +The computation of RoPE, Softmax, and RMSNorm layers does not take a significant portion of the running time. -At the Decode stage, `kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod` KleidiAI ukernel is used for GEMV operators. It takes advantage of dot product instructions. If you look only at the Decode stage, you can see this function takes the second largest portion. +## Analyze results + +The profiling data reveals clear differences between Prefill and Decode stages: + +- Annotation Markers show token generation start points. The Prefill stage shows `past 0, n_eval 78`, indicating 78 input tokens processed simultaneously. During Decode, tokens are generated one at a time. -![text#center](images/decode_only.png "Figure 15. Decode only view") +- Performance characteristics differ significantly between stages. Prefill demonstrates compute-bound behavior with high SIMD, floating-point, and integer instruction counts but relatively few L3 cache misses. Decode shows memory-bound behavior with lighter compute workloads but frequent L3 cache accesses. -There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the weights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. +- PMU events confirm this analysis. Backend stall cycles due to memory account for only ~10% of total stalls during Prefill, but increase to ~50% during Decode. This pattern indicates efficient compute utilization during Prefill and memory bottlenecks during Decode. -The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. +| Stage | Main Operations | Bottleneck | Key Observations | +|---------|----------------|----------------|-------------------------------------------------| +| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | +| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | -The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. +The results demonstrate how KV caching transforms the computational profile from matrix-matrix operations during Prefill to vector-matrix operations during Decode, fundamentally changing the performance characteristics. -### Analyzing results -- Annotation Markers show token generation start points. -- Prefill stage: past 0, n_eval 78 → compute-bound (large GEMM). -- Decode stage: one token at a time → memory-bound (KV cache, GEMV). -- PMU events: SIMD/FP/INT instructions high in Prefill, L3 cache misses high in Decode. -- Backend stalls: ~10% memory stalls in Prefill vs ~50% in Decode. +## Summary -| Stage | Main Ops | Bottleneck | Observations | -|---------|----------|----------------|--------------------------------------------------| -| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | -| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | -| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | -| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | -|---------|----------|----------------|--------------------------------------------------| -| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | -| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | +You have successfully captured and analyzed LLM inference performance using Streamline. Use this data to optimize your applications by identifying the distinct characteristics between Prefill (compute-bound) and Decode (memory-bound) stages. Leverage the function execution time data and PMU event correlations to pinpoint performance bottlenecks in your inference pipeline. Apply these insights to make informed decisions about hardware selection and code optimization strategies. Take advantage of this foundation to dive deeper into operator-level analysis in the next section, where you'll gain even more granular control over your LLM performance optimization efforts. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md index 8b2ebf8eb0..4ac130aafa 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md @@ -1,16 +1,18 @@ --- -title: Deep dive into operators +title: Implement operator-level performance analysis with Annotation Channels weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Deep dive into operators +## Overview of Annotation Channels -You can use Streamline Annotation Channels to analyze the execution time of each node in the compute graph. More details on Annotation Channels can be found in the [Group and Channel annotations](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en) section of the Streamline User Guide. +You can use Streamline Annotation Channels to analyze the execution time of each node in the compute graph, which is especially valuable for understanding and optimizing performance on Arm-based systems. Annotation Channels are specialized annotations that group related operations into separate visual channels in Streamline. Unlike simple markers, channels allow you to track multiple concurrent operations and see their relationships over time. -## Integrating Annotation Channels into llama.cpp +More details on Annotation Channels can be found in the [Group and Channel annotations](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en) section of the Streamline User Guide. + +## Integrate Annotation Channels into llama.cpp In llama.cpp, tensor nodes are executed in the CPU backend inside the function `ggml_graph_compute_thread()` in the file `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. @@ -23,15 +25,11 @@ for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort ggml_compute_forward(¶ms, node); ``` -To monitor operator execution time, you can create annotation channels for each type of operators such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE` and `GGML_OP_MUL`. - -Since `GGML_OP_MUL_MAT` including both GEMM and GEMV operation takes significant portion of execution time, two dedicated annotation channels are created for GEMM and GEMV respectively. - -The annotation starts at the beginning of `ggml_compute_forward()` and stops at the end, so that the computation of tensor node/operator can be monitored. +To monitor operator execution time, you can create annotation channels for each type of operators such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE`, and `GGML_OP_MUL`. Matrix operations (`GGML_OP_MUL_MAT`) take a significant portion of execution time. These operations include both GEMM (General Matrix Multiply) and GEMV (General Matrix-Vector multiply) operations. You'll create two dedicated annotation channels for GEMM and GEMV respectively to analyze their performance separately. The annotation starts at the beginning of `ggml_compute_forward()` and stops at the end. This approach allows you to monitor the computation time of each tensor node/operator. -### Step 1: Add annotation code +## Add annotation code to monitor operators -First, add Streamline annotation header file to the file `ggml-cpu.c`: +First, add the Streamline annotation header file to `ggml-cpu.c`: ```c #include "streamline_annotate.h" @@ -39,7 +37,7 @@ First, add Streamline annotation header file to the file `ggml-cpu.c`: Edit the `ggml_graph_compute_thread()` function in the file `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. -Add following code in front and after the `ggml_compute_forward(¶ms, node)`. +Add the following code in front and after the `ggml_compute_forward(¶ms, node)`. Your code now looks like: @@ -79,9 +77,9 @@ for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort // --- End Annotation Channel for Streamline ``` -### Step 2: Add tensor shape info (optional) +## Include tensor shape information (optional) -You can also add information of the shape and size of source tensor by replace sprintf function as follow: +You can also add information about the shape and size of source tensors by replacing the sprintf function as follows: ```c sprintf(printf_buf,"%s %s %d_%d_%d %d_%d_%d", node->name, ggml_get_name(node), \ @@ -94,9 +92,9 @@ You can also add information of the shape and size of source tensor by replace s ); ``` -### Step 3: Update CMakeLists +## Update build configuration -Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annotation header file and `libstreamline_annotate.a` library by adding the lines: +Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include the Streamline Annotation header file and `libstreamline_annotate.a` library by adding these lines: ```bash set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -106,29 +104,27 @@ Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annota Then, build `llama-cli` again. -### Analyze the data with Streamline +## Examine operator performance patterns -Run `llama-cli` and collect profiling data with Streamline as you did in the previous session. +Run `llama-cli` and collect profiling data with Streamline as you did in the previous section. -String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view. +Arm Streamline displays string annotations as text overlays in the relevant channels in the Timeline view, such as Channel 0, as shown in the following screenshot. -For example, inside Channel 0 in the following screenshot. +![Screenshot of Streamline annotation channels displaying operator execution timing with channel indicators alt-text#center](images/deep_dive_1.png "Annotation channel") -![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel") +The letter `A` is displayed in the process list to indicate the presence of annotations. -The letter A is displayed in the process list to indicate the presence of annotations. +String annotations are also displayed in the **Message** column in the Log view. -String annotations are also displayed in the Message column in the Log view. +![Screenshot of Streamline Log view showing annotation messages with operator details and timing information alt-text#center](images/deep_dive_2.png "Annotation log") -![text#center](images/deep_dive_2.png "Figure 17. Annotation log") +## Compare GEMM operations during prefill -### View the individual operators at Prefill stage +The screenshot of annotation channel view at prefill stage is shown as below: -The screenshot of annotation channel view at Prefill stage is shown as below: +![Screenshot showing Streamline annotation channels during Prefill stage with operator categorization and timing visualization alt-text#center](images/prefill_annotation_channel.png "Annotation channel at Prefill stage") -![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage") - -The name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. +The operator name in the screenshot above is manually edited. If you want the operator name to be shown instead of the Channel number by Streamline, you can add ANNOTATE_NAME_CHANNEL to the `ggml_graph_compute_thread` function. This annotation macro is defined as: @@ -136,7 +132,7 @@ This annotation macro is defined as: ANNOTATE_NAME_CHANNEL(channel, group, string) ``` -For example, +For example: ```c ANNOTATE_NAME_CHANNEL(0, 0, "MUL_MAT_GEMV"); @@ -145,9 +141,9 @@ For example, The code above sets the name of annotation channel 0 as `MUL_MAT_GEMV` and channel 1 as `MUL_MAT_GEMM`. -By zooming into the timeline view, you can see more details: +Zoom into the timeline view to examine additional details: -![text#center](images/prefill_annotation_channel_2.png "Figure 19. Annotation Channel at Prefill stage") +![Detailed view of Streamline annotation channels showing individual operator execution blocks during Prefill stage alt-text#center](images/prefill_annotation_channel_2.png "Annotation channel at Prefill stage") When moving the cursor over an annotation channel, Streamline shows: @@ -155,9 +151,9 @@ When moving the cursor over an annotation channel, Streamline shows: - The operator type - The shape and size of the source tensors -![text#center](images/prefill_annotation_channel_3.png "Figure 20. Annotation Channel Zoom in") +![Close-up screenshot of annotation channel tooltip showing tensor node details including operator type and tensor dimensions alt-text#center](images/prefill_annotation_channel_3.png "Annotation channel zoom in") -In the example above, you see a `GGML_OP_MUL_MAT` operator for the `FFN_UP` node. +The example above shows a `GGML_OP_MUL_MAT` operator for the `FFN_UP` node. The source tensors have shapes [1024, 2816] and [1024, 68]. This view makes it clear that: @@ -165,18 +161,17 @@ This view makes it clear that: - There is also a large `MUL_MAT GEMV` operation in the `result_output` linear layer. - Other operators, such as MUL, Softmax, Norm, RoPE, consume only a small portion of execution time. -### View of individual operators at Decode stage +## Analyze GEMV operations during Decode The annotation channel view for the Decode stage is shown below: -![text#center](images/decode_annotation_channel.png "Figure 21. Annotation Channel at Decode stage") +![Screenshot showing Streamline annotation channels during Decode stage highlighting GEMV operations and reduced computation time alt-text#center](images/decode_annotation_channel.png "Annotation channel at Decode stage") Zooming in provides additional details: -![text#center](images/decode_annotation_channel_2.png "Figure 22. Annotation Channel string") +![Detailed view of Decode stage annotation channels showing shorter execution blocks compared to Prefill stage alt-text#center](images/decode_annotation_channel_2.png "Annotation channel string") +This view reveals that the majority of time in Decode is spent on `MUL_MAT GEMV` operations within the attention and FFN layers. Unlike the Prefill stage, no GEMM operations are executed in these layers during Decode. The `result_output` linear layer contains a large GEMV operation that takes an even larger proportion of runtime in Decode compared to Prefill. This pattern is expected since each token generation at Decode is shorter due to KV cache reuse, making the `result_output` layer more dominant in the overall execution profile. + +## Summary -From this view, you can see: -- The majority of time in Decode is spent on `MUL_MAT GEMV` operations in the attention and FFN layers. -- In contrast to Prefill, **no GEMM operations** are executed in these layers. -- The `result_output` linear layer has a large GEMV operation, which takes an even larger proportion of runtime in Decode. -- This is expected, since each token generation at Decode is shorter due to KV cache reuse, making the result_output layer more dominant. +You have successfully implemented Annotation Channels to analyze individual operators within llama.cpp. This detailed view reveals how different operators contribute to overall execution time and shows the stark differences between Prefill (GEMM-dominated) and Decode (GEMV-dominated) stages. The next section will explore how these operations utilize multiple CPU cores and threads. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md index 00b0c4bf00..0d3afbc47e 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md @@ -1,63 +1,60 @@ --- -title: Analyze multi-threaded performance +title: Examine multi-threaded performance patterns in llama.cpp weight: 7 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Analyze multi-threaded performance +## Understand llama.cpp multi-threading architecture -The CPU backend in llama.cpp uses multiple cores and threads to accelerate operator execution. +The CPU backend in llama.cpp uses multiple cores and threads to accelerate operator execution. Understanding how work is distributed across threads helps you optimize performance on Arm processors. -It creates a threadpool, where: -- The number of threads is controlled by the `-t` option -- If `-t` is not specified, it defaults to the number of CPU cores in the system +llama.cpp creates a threadpool where the number of threads is controlled by the `-t` option. If `-t` is not specified, it defaults to the number of CPU cores in the system. The `-C` option controls thread affinity, which determines which specific cores threads run on. -The entrypoint for secondary threads is the function `ggml_graph_compute_secondary_thread()`. +The entry point for secondary threads is the function `ggml_graph_compute_secondary_thread()`. When computing a tensor node/operator with a large workload, llama.cpp splits the computation into multiple parts and distributes these parts across threads for parallel execution. -When computing a tensor node/operator with a large workload, llama.cpp splits the computation into multiple parts and distributes them across threads. - -### Example: MUL_MAT Operator +## Example: MUL_MAT operator parallelization For the MUL_MAT operator, the output matrix C can be divided across threads: -![text#center](images/multi_thread.jpg "Figure 23. Multi-Thread") +![Diagram illustrating how MUL_MAT operator computation is distributed across multiple threads, with each thread computing a portion of the output matrix alt-text#center](images/multi_thread.jpg "Multi-thread") -In this example, four threads each compute one quarter of matrix C. +In this example, four threads each compute one quarter of matrix C. -### Observing thread execution with Streamline +## Profile thread execution with Streamline -The execution of multiple threads on CPU cores can be observed using Core Map and Cluster Map modes in the Streamline Timeline. +The execution of multiple threads on CPU cores can be observed using Core Map and Cluster Map modes in the Streamline Timeline. These visualization modes show how threads are distributed across CPU cores and help identify performance bottlenecks in parallel execution. Learn more about these modes in the [Core Map and Cluster Map modes](https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes) section of the Streamline User Guide. -Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1. +## Configure thread affinity for analysis + +Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1. Thread affinity ensures threads run on specific cores, making performance analysis more predictable. ```bash ./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 2 -C 0x3 ``` -### Streamline results +## Analyze Streamline results Collect profiling data with Streamline, then select Core Map and Cluster Map modes in the Timeline view. -![text#center](images/multi_thread_core_map.png "Figure 24. Multi-Thread") +![Screenshot of Streamline Core Map view showing thread execution across CPU cores with thread affinity mapping alt-text#center](images/multi_thread_core_map.png "Multi-thread core map") + +In the screenshot above, you can observe that two threads are created and they are running on CPU core0 and CPU core1, respectively. This confirms that the thread affinity configuration is working correctly. + +You can also use the Annotation Channel view to analyze operator execution on a per-thread basis. Each thread generates its own annotation channel independently, allowing you to see how work is distributed across parallel execution units. -In the screenshot above: -- Two threads are created -- They are running on CPU core0 and CPU core1, respectively +![Screenshot showing Streamline annotation channels with multiple threads executing the same tensor node simultaneously alt-text#center](images/multi_thread_annotation_channel.png "Multi-thread annotation channels") -In addition, you can use the Annotation Channel view to analyze operator execution on a per-thread basis. Each thread generates its own annotation channel independently. +In the screenshot above, at the highlighted time, both threads are executing the same node. In this particular case, the node is the result_output linear layer. You can see how the workload is distributed across threads, with each thread processing a different portion of the matrix computation. This visualization helps identify load balancing issues and optimization opportunities in parallel execution. -![text#center](images/multi_thread_annotation_channel.png "Figure 25. Multi-Thread") +## Summary -In the screenshot above, at the highlighted time: -- Both threads are executing the same node -- In this case, the node is the result_output linear layer +You have successfully completed the walkthrough of profiling an LLM model on an Arm CPU using advanced multi-threading analysis techniques. -You have completed the walkthrough of profiling an LLM model on an Arm CPU! +You now understand how to integrate Streamline annotations into LLM inference code for detailed profiling, capture and analyze performance data showing the distinct characteristics of Prefill and Decode stages, and use Annotation Channels to analyze individual operators and their execution patterns. Additionally, you can configure thread affinity and examine multi-threaded execution patterns across CPU cores while identifying performance bottlenecks and work distribution issues in parallel execution. -By combining Arm Streamline with a solid understanding of llama.cpp, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. +These skills enable you to optimize LLM performance on Arm CPUs by understanding where computational resources are spent and how to leverage multi-core parallelism effectively. By combining Arm Streamline with a solid understanding of llama.cpp threading architecture, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. -Keep in mind that adding annotation code to llama.cpp and gatord may introduce a small performance overhead, so profiling results should be interpreted with this in mind. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md index 1f97f58d6e..840ec69ccb 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md @@ -1,19 +1,15 @@ --- -title: Analyze llama.cpp with KleidiAI LLM performance using Streamline +title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels -draft: true -cascade: - draft: true +minutes_to_complete: 60 -minutes_to_complete: 50 - -who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners who want to run llama.cpp on Arm-based CPUs, learn how to use Arm Streamline to capture and analyze performance data, and understand how LLM inference behaves at the Prefill and Decode stages. +who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners who want to optimize llama.cpp performance on Arm-based CPUs. learning_objectives: - - Describe the architecture of llama.cpp and the role of the Prefill and Decode stages + - Profile llama.cpp architecture and identify the role of the Prefill and Decode stages - Integrate Streamline Annotations into llama.cpp for fine-grained performance insights - Capture and interpret profiling data with Streamline - - Use Annotation Channels to analyze specific operators during token generation + - Analyze specific operators during token generation using Annotation Channels - Evaluate multi-core and multi-thread execution of llama.cpp on Arm CPUs prerequisites: @@ -36,7 +32,6 @@ tools_software_languages: - Arm Streamline - C++ - llama.cpp - - KleidiAI - Profiling operatingsystems: - Linux @@ -45,16 +40,24 @@ operatingsystems: further_reading: - resource: title: llama.cpp project - link: https://github.com/ggml-org/llama.cpp - type: source code + link: https://github.com/ggml-org/llama.cpp + type: website - resource: - title: Qwen1_5-0_5b-chat-q4_0.gguf - link: https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/blob/main/qwen1_5-0_5b-chat-q4_0.gguf - type: LLM model + title: Build and run llama.cpp on Arm servers + link: /learning-paths/servers-and-cloud-computing/llama-cpu/ + type: website + - resource: + title: Run a Large Language Model chatbot with PyTorch using KleidiAI + link: /learning-paths/servers-and-cloud-computing/pytorch-llama/ + type: website - resource: title: Arm Streamline User Guide link: https://developer.arm.com/documentation/101816/9-7 type: website + - resource: + title: KleidiAI project + link: https://github.com/ARM-software/kleidiai + type: website ### FIXED, DO NOT MODIFY # ================================================================================