diff --git a/assets/contributors.csv b/assets/contributors.csv index ef6f06ea90..6d9123056d 100644 --- a/assets/contributors.csv +++ b/assets/contributors.csv @@ -102,3 +102,4 @@ Ker Liu,,,,, Rui Chang,,,,, Alejandro Martinez Vicente,Arm,,,, Mohamad Najem,Arm,,,, +Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,, diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md new file mode 100644 index 0000000000..790f5c66bd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md @@ -0,0 +1,31 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview: Profiling LLMs on Arm CPUs with Streamline + +Large Language Models (LLMs) run efficiently on Arm CPUs. +Frameworks that run LLMs, such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provides a convenient framework for running LLMs, it also comes with a certain level of complexity. + +To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools. + +This learning path demonstrates how to use the **llama-cli** application from llama.cpp together with **Arm Streamline** to analyze the efficiency of LLM inference on Arm CPUs. + +In this guide you will learn how to: +- Profile token generation at the **Prefill** and **Decode** stages +- Profile execution of individual tensor nodes and operators +- Profile LLM execution across **multiple threads and cores** + +You will run the **Qwen1_5-0_5b-chat-q4_0.gguf** model with llama-cli on **Arm64 Linux** and use Streamline for analysis. +The same method can also be applied to **Arm64 Android** platforms. + +## Prerequisites +Before starting this guide, you should be familiar with: +- Basic understanding of llama.cpp +- Understanding of transformer model +- Knowledge of Streamline usage +- An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md new file mode 100644 index 0000000000..addcdd28b4 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md @@ -0,0 +1,57 @@ +--- +title: Understand the llama.cpp +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Understand the llama.cpp + +**llama.cpp** is an open-source LLM framework implemented in C++ that supports both training and inference. +This learning path focuses only on **inference on the CPU**. + +The **llama-cli** tool provides a command-line interface to run LLMs with the llama.cpp inference engine. +It supports text generation, chat mode, and grammar-constrained output directly from the terminal. + +![text#center](images/llama_structure.png "Figure 1. llama-cli Flow") + +### What llama-cli does +- Load and interpret LLMs in **.gguf** format +- Build a **compute graph** based on the model structure + - The graph can be divided into subgraphs, each assigned to the most suitable backend device + - In this guide, all operators are executed on the **CPU backend** +- Allocate memory for tensor nodes using the **graph planner** +- Execute tensor nodes in the graph during the **graph_compute** stage, which traverses nodes and forwards work to backend devices + +Step2 to Step4 are wrapped inside the function **`llama_decode`**. +During **Prefill** and **Decode**, `llama-cli` repeatedly calls `llama_decode` to generate tokens. +The parameter **`llama_batch`** passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions. + +### Components of llama.cpp +The components of llama.cpp include: +![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components") + +llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, `OpenCL` etc. + +For the CPU backend, it provides an optimized `ggml-cpu` library (mainly utilizing CPU vector instructions). +For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages the new **I8MM** instructions for acceleration. +The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. + +### Prefill and Decode in autoregressive LLMs +Most autoregressive LLMs are Decoder-only model. +Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs. +![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage") + +At the Prefill stage, multiple input tokens of the prompt are processed. +It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token. +![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage") + +At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. +![text#center](images/transformer_decode.jpg "Figure 5. Decode stage") + +Therefore, +- **Prefill** is **compute-bound**, dominated by large GEMM operations +- **Decode** is **memory-bound**, dominated by KV cache access and GEMV operations + +This can be seen in the subsequent analysis with Streamline. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md new file mode 100644 index 0000000000..b1ed11a127 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md @@ -0,0 +1,196 @@ +--- +title: Integrating Streamline Annotations into llama.cpp +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Integrating Streamline Annotations into llama.cpp + +To visualize token generation at the **Prefill** and **Decode** stages, we use **Streamline’s Annotation Marker** feature. +This requires integrating annotation support into the **llama.cpp** project. +More information about the Annotation Marker API can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). + +{{% notice Note %}} +You can either build natively on an **Arm platform**, or cross-compile on another architecture using an Arm cross-compiler toolchain. +{{% /notice %}} + +### Step 1: Build Streamline Annotation library + +Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first. + +Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*. + +For installation guidance, refer to the [Streamline installation guide](https://learn.arm.com/install-guides/streamline/). + +Clone the gator repository that matches your Streamline version and build the `Annotation support library`. + +The installation step is depends on your developement machine. + +For Arm native build, you can use following insturction to install the packages. +For other machine, you need to set up the cross compiler environment by install [aarch64 gcc compiler toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). +You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for Cross-compiler installation. + +{{< tabpane code=true >}} + {{< tab header="Arm Native Build" language="bash">}} + apt-get update + apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git + cd ~ + git clone https://github.com/ARM-software/gator.git + cd gator + ./build-linux.sh + + cd annotate + make + {{< /tab >}} + {{< tab header="Cross Compiler" language="bash">}} + apt-get update + apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git + cd ~ + git clone https://github.com/ARM-software/gator.git + + cd gator + make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool + {{< /tab >}} +{{< /tabpane >}} + +Once complete, the static library **libstreamline_annotate.a** will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file at: `gator/annotate/streamline_annotate.h` + +### Step 2: Integrate Annotation Marker into llama.cpp + +Next, we need to install **llama.cpp** to run the LLM model. +To make the following performance profiling content easier to follow, this Learning Path will use a specific release version of llama.cpp to ensure the steps and results remain consistent. + +Before the build **llama.cpp**, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the folder. + +```bash +cd ~ +wget https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz +tar -xvzf b6202.tar.gz +mv llama.cpp-b6202 llama.cpp +cd ./llama.cpp +mkdir streamline_annotation +cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation +``` + +To link `libstreamline_annotate.a` library when building llama-cli, adding following lines in the end of `llama.cpp/tools/main/CMakeLists.txt`. + +```makefile +set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") +target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_annotation") +target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}") +``` + +To add Annotation Markers to llama-cli, change the llama-cli code **llama.cpp/tools/main/main.cpp** by adding + +```c +#include "streamline_annotate.h" +``` + +After the call to common_init(), add the setup macro: + +```c + common_init(); + //Add the Annotation setup code + ANNOTATE_SETUP; +``` + +Finally, add an annotation marker inside the main loop: + +```c + for (int i = 0; i < (int) embd.size(); i += params.n_batch) { + int n_eval = (int) embd.size() - i; + if (n_eval > params.n_batch) { + n_eval = params.n_batch; + } + + LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str()); + + // Add annotation marker code for Streamline + { + char printf_buf[200]; + sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval ); + ANNOTATE_MARKER_STR(printf_buf); + } + // End of annotation marker + + if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) { + LOG_ERR("%s : failed to eval\n", __func__); + return 1; + } +``` + +A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed. + +### Step 3: Build llama-cli + +For convenience, llama-cli is **static linked**. + +Firstly, create a new directory `build` understand llama.cpp root directory and go into it. + +```bash +cd ~/llama.cpp +mkdir ./build & cd ./build +``` + +Then configure the project by running + +{{< tabpane code=true >}} + {{< tab header="Arm Native Build" language="bash">}} + cmake .. \ + -DGGML_NATIVE=ON \ + -DLLAMA_F16C=OFF \ + -DLLAMA_GEMM_ARM=ON \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_EXE_LINKER_FLAGS="-static -g" \ + -DGGML_OPENMP=OFF \ + -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \ + -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \ + -DGGML_CPU_KLEIDIAI=ON \ + -DLLAMA_BUILD_TESTS=OFF \ + -DLLAMA_BUILD_EXAMPLES=ON \ + -DLLAMA_CURL=OFF + {{< /tab >}} + {{< tab header="Cross Compiler" language="bash">}} + cmake .. \ + -DCMAKE_SYSTEM_NAME=Linux \ + -DCMAKE_SYSTEM_PROCESSOR=arm \ + -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc \ + -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ \ + -DLLAMA_NATIVE=OFF \ + -DLLAMA_F16C=OFF \ + -DLLAMA_GEMM_ARM=ON \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_EXE_LINKER_FLAGS="-static -g" \ + -DGGML_OPENMP=OFF \ + -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \ + -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \ + -DGGML_CPU_KLEIDIAI=ON \ + -DLLAMA_BUILD_TESTS=OFF \ + -DLLAMA_BUILD_EXAMPLES=ON \ + -DLLAMA_CURL=OFF + {{< /tab >}} +{{< /tabpane >}} + + +Set `CMAKE_C_COMPILER` and `DCMAKE_CXX_COMPILER` to your cross compiler path. Make sure that **-march** in `DCMAKE_C_FLAGS` and `CMAKE_CXX_FLAGS` matches your Arm CPU hardware. + + +In this learning path, we run llama-cli on an Arm CPU that supports **NEON Dotprod** and **I8MM** instructions. +Therefore, we specify: **armv8.2-a+dotprod+i8mm**. + +We also specify **-static** and **-g** options: +- **-static**: produces a statically linked executable, so it can run on different Arm64 Linux/Android environments without needing shared libraries. +- **-g**: includes debug information, which makes source code and function-level profiling in Streamline much easier. + +so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android. + +Now you can build the project by running: + +```bash +cd ~/llama.cpp/build +cmake --build ./ --config Release +``` + +After the building process, you should find the llama-cli will be generated at **~/llama.cpp/build/bin/** directory. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md new file mode 100644 index 0000000000..6af4230564 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md @@ -0,0 +1,144 @@ +--- +title: Running llama-cli and Analyzing Data with Streamline +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Running llama-cli and Analyzing Data with Streamline + +After successfully building **llama-cli**, the next step is to set up the runtime environment on your Arm64 platform. + +### Setup gatord + +Depending on how you built llama.cpp: + +- **Cross Build:** + - Copy the `llama-cli` executable to your Arm64 target. + - Also copy the `gatord` binary from the Arm DS or Streamline installation: + - Linux: `Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64` + - Android: `Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64` + +- **Native Build:** + - Use the `llama-cli` from your local build and the `gatord` you compiled earlier (`~/gator/build-native-gcc-rel/gatord`). + +### Download a lightweight model + +Then, download the LLM model into the target platform. +For demonstration, we use the lightweight **Qwen1_5-0_5b-chat-q4_0.gguf** model, which can run on both Arm servers and resource-constrained edge devices: + +```bash +cd ~ +wget https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_0.gguf +``` + +### Run gatord + +Start the gator daemon on your Arm64 target: +```bash +./gatord +``` + +You should see similar messages as below, + +``` bash +Streamline Data Recorder v9.4.0 (Build 9b1e8f8) +Copyright (c) 2010-2024 Arm Limited. All rights reserved. +Gator ready +``` + +### Connect Streamline + +Next, we will need use Streamline to setup the collect CPU performance data. + +If you’re accessing the Arm server via **SSH**, you need to forward port `8080` from the host platform to your local machine. +``` bash +ssh -i user@arm-server -L 8080:localhost:8080 -N +``` +Append `-L 8080:localhost:8080 -N` to your original SSH command to enable local port forwarding, this allows Arm Streamline on your local machine to connect to the Arm server. + +Then launch the Streamline application on your host machine, connect to the gatord running on your Arm64 target with either TCP or ADB connection. +You can select PMU events to be monitored at this point. + +{{% notice Note %}} +If you are using ssh port forwarding, you need select TCP `127.0.0.1:8080`. +{{% /notice %}} + +![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ") + +Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis. +![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path") + +Click `Start Capture` button on Streamline to start collecting data from the Arm64 target. + +{{% notice Note %}} +This guide is not intended to introduce how to use Streamline, if you encounter any issue during setting up gatord or Streamline, please refer this [user guide](https://developer.arm.com/documentation/101816/latest/?lang=en) +{{% /notice %}} + +### Run llama-cli + +Now, run the llama-cli executable as below, + +``` bash +cd ~/llama.cpp/build/bin +./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1 +``` + +After a while, you can stop the Streamline data collection by clicking ‘Stop’ button on Streamline. Then Streamline tool on your host PC will start the data analysis. + +### Analyze the data with Streamline + +From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation. +![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker") + +The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example, +![text#center](images/annotation_marker_2.png "Figure 9. Annotation String") + +The number after `past` indicates the position of input tokens, the number after `n_eval` indicates the number of tokens to be processed this time. + +As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage. +![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage") + +By checking the string of Annotation Marker, the first token generation at Prefill stage has `past 0, n_eval 78`, which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. + +We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. + +We can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. + +At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher. +By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall, +![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event") + +We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. +All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage. + +Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack. +![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack") + +In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions. +![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view") + +As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model, +* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation. + - At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage. + ![text#center](images/prefill_only.png "Figure 14. Prefill only view") + + - At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion. + ![text#center](images/decode_only.png "Figure 15. Decode only view") + +- There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. +- The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. +- The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. + +### Analyzing results +- Annotation Markers show token generation start points. +- Prefill stage: past 0, n_eval 78 → compute-bound (large GEMM). +- Decode stage: one token at a time → memory-bound (KV cache, GEMV). +- PMU events: SIMD/FP/INT instructions high in Prefill, L3 cache misses high in Decode. +- Backend stalls: ~10% memory stalls in Prefill vs ~50% in Decode. + +| Stage | Main Ops | Bottleneck | Observations | +|---------|----------|----------------|--------------------------------------------------| +| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | +| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md new file mode 100644 index 0000000000..295169777d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md @@ -0,0 +1,172 @@ +--- +title: Deep Dive Into Individual Operator +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Deep Dive Into Individual Operator + +This module shows how to use **Streamline Annotation Channels** to analyze the execution time of each node in the compute graph. More details on Annotation Channels can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en). + +## Integrating Annotation Channels into llama.cpp + +In llama.cpp, tensor nodes are executed in the CPU backend inside the function `ggml_graph_compute_thread` (`~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`). + +In our selected release tag, the loop over tensor nodes looks like this (around line 2862): + +```c +for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { + struct ggml_tensor * node = cgraph->nodes[node_n]; + + ggml_compute_forward(¶ms, node); +``` + +To monitor operator execution time, let's create annotation channels for each type of operators (such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE` and `GGML_OP_MUL`). + +Since `GGML_OP_MUL_MAT` including both GEMM and GEMV operation takes significant portion of execution time, two dedicated annotation channels are created for GEMM and GEMV respectively. + +The annotation starts at the beginning of `ggml_compute_forward` and stops at the end, so that the computation of tensor node/operator can be monitored. + +### Step 1: Add Annotation Code + +Firstly, add Streamline annotation header file to ggml-cpu.c, + +```c +#include "streamline_annotate.h" +``` + +Edit `ggml_graph_compute_thread` function in `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. + +Add following code in front and after the **ggml_compute_forward(¶ms, node)**. + +Your code will be looks like: + +```c + +for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { + struct ggml_tensor * node = cgraph->nodes[node_n]; + + // +++ Start Annotation Channel for Streamline + { + char printf_buf[256]; + sprintf(printf_buf," %s, %s", node->name, ggml_get_name(node)); + + if(node->op==GGML_OP_MUL_MAT ) { + if (node->src[1]->ne[1] == 1) + ANNOTATE_CHANNEL(0, printf_buf); //It is GEMV + else + ANNOTATE_CHANNEL(1, printf_buf); //It is GEMM + } + else + ANNOTATE_CHANNEL((node->op)+2, printf_buf); + } + // --- Start Annotation Channel for Streamline + + ggml_compute_forward(¶ms, node); + + // +++ End Annotation Channel for Streamline + { + if(node->op==GGML_OP_MUL_MAT) { + if (node->src[1]->ne[1] == 1) + ANNOTATE_CHANNEL_END(0); + else + ANNOTATE_CHANNEL_END(1); + } + else + ANNOTATE_CHANNEL_END((node->op)+2); + } + // --- End Annotation Channel for Streamline +``` + +### Step 2: Add Tensor Shape Info (Optional) + +You can also add information of the shape and size of source tensor by replace sprintf funcation as follow: + +```c + sprintf(printf_buf,"%s %s %d_%d_%d %d_%d_%d", node->name, ggml_get_name(node), \ + node->src[0]? node->src[0]->ne[0] : 0, \ + node->src[0]? node->src[0]->ne[1] : 0 , \ + node->src[0]? node->src[0]->ne[2] : 0 ,\ + node->src[1]? node->src[1]->ne[0] : 0, \ + node->src[1]? node->src[1]->ne[1] : 0, \ + node->src[1]? node->src[1]->ne[2] : 0 \ + ); +``` + +### Step 3: Update CMakeLists + +Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annotation header file and libstreamline_annotate.a library by adding codes, copy following lines inside ggml_add_cpu_backend_variant_impl function. + +```bash + set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") + target_include_directories( ${GGML_CPU_NAME} PRIVATE "${CMAKE_SOURCE_DIR}/streamline_annotation") + target_link_libraries(${GGML_CPU_NAME} PRIVATE ${STREAMLINE_LIB_PATH} ) +``` + +Then, build `llama-cli` again. + +### Analyze the data with Streamline + +Run llama-cli and collect profiling data with Streamline as previous session. + +String annotations are displayed as text overlays inside the relevant channels in the details panel of the `Timeline` view. + +For example, inside Channel 0 in the following screenshot. +![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel") + +The letter A is displayed in the process list to indicate the presence of annotations. +String annotations are also displayed in the Message column in the Log view. +![text#center](images/deep_dive_2.png "Figure 17. Annotation log") + + +### View of individual operators at Prefill stage + +The screenshot of annotation channel view at Prefill stage is shown as below, +![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage") + +Note that the name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. +This annotation macro is defined as, + +```c +ANNOTATE_NAME_CHANNEL(channel, group, string) +``` + +For example, +```c + ANNOTATE_NAME_CHANNEL(0, 0, "MUL_MAT_GEMV"); + ANNOTATE_NAME_CHANNEL(1, 0, "MUL_MAT_GEMM"); +``` + +The code above sets the name of annotation channel 0 as **MUL_MAT_GEMV** and channel 1 as **MUL_MAT_GEMM**. +By zooming into the timeline view, you can see more details: +![text#center](images/prefill_annotation_channel_2.png "Figure 19. Annotation Channel at Prefill stage") + + +When moving the cursor over an annotation channel, Streamline shows: +- The tensor node name +- The operator type +- The shape and size of the source tensors +![text#center](images/prefill_annotation_channel_3.png "Figure 20. Annotation Channel Zoom in") + +In the example above, we see a `GGML_OP_MUL_MAT` operator for the **FFN_UP** node. +Its source tensors have shapes **[1024, 2816]** and **[1024, 68]**. + +This view makes it clear that: +- The majority of time at the **Prefill stage** is spent on **MUL_MAT GEMM** operations in the attention and FFN layers. +- There is also a large **MUL_MAT GEMV** operation in the `result_output` linear layer. +- Other operators, such as **MUL, Softmax, Norm, RoPE**, consume only a small portion of execution time. + +### View of individual operators at Decode stage +The annotation channel view for the **Decode stage** is shown below: +![text#center](images/decode_annotation_channel.png "Figure 21. Annotation Channel at Decode stage") + +Zooming in provides additional details: +![text#center](images/decode_annotation_channel_2.png "Figure 22. Annotation Channel string") + +From this view, we observe: +- The majority of time in **Decode** is spent on **MUL_MAT GEMV** operations in the attention and FFN layers. +- In contrast to Prefill, **no GEMM operations** are executed in these layers. +- The `result_output` linear layer has a **large GEMV operation**, which takes an even larger proportion of runtime in Decode. +- This is expected, since each token generation at Decode is shorter due to KV cache reuse, making the result_output layer more dominant. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md new file mode 100644 index 0000000000..908766cf8c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md @@ -0,0 +1,64 @@ +--- +title: Analyzing Multi-Core/Multi-Thread Performance +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Analyzing Multi-Core/Multi-Thread Performance + +The CPU backend in **llama.cpp** uses multiple cores and threads to accelerate operator execution. +It creates a **threadpool**, where: +- The number of threads is controlled by the `-t` option +- If `-t` is not specified, it defaults to the number of CPU cores in the system + +The entrypoint for secondary threads is the function **`ggml_graph_compute_secondary_thread`**. + +When computing a tensor node/operator with a large workload, llama.cpp splits the computation into multiple parts and distributes them across threads. + +### Example: MUL_MAT Operator + +For the **MUL_MAT** operator, the output matrix **C** can be divided across threads: +![text#center](images/multi_thread.jpg "Figure 23. Multi-Thread") + +In this example, four threads each compute one quarter of matrix C. + +### Observing Thread Execution with Streamline + +The execution of multiple threads on CPU cores can be observed using **Core Map** and **Cluster Map** modes in the Streamline Timeline. +Learn more about these modes [here](https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes). + +Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1, +* -t 2 → creates two worker threads +* -C 0x3 → sets CPU affinity to core0 and core1 + +```bash +./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 2 -C 0x3 +``` + +### Streamline Results + +Collect profiling data with **Streamline**, then select **Core Map** and **Cluster Map** modes in the Timeline view. + +![text#center](images/multi_thread_core_map.png "Figure 24. Multi-Thread") + +In the screenshot above: +- Two threads are created +- They are running on **CPU core0** and **CPU core1**, respectively + +In addition, you can use the **Annotation Channel** view to analyze operator execution on a per-thread basis. +Each thread generates its own annotation channel independently. + +![text#center](images/multi_thread_annotation_channel.png "Figure 25. Multi-Thread") + +In the screenshot above, at the highlighted time: +- Both threads are executing the **same node** +- In this case, the node is the **result_output linear layer** + + +Congratulations — you have completed the walkthrough of profiling an LLM model on an Arm CPU. + +By combining **Arm Streamline** with a solid understanding of llama.cpp, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. + +Keep in mind that adding annotation code to llama.cpp and gatord may introduce a small performance overhead, so profiling results should be interpreted with this in mind. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md deleted file mode 100644 index 3773982b6f..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md +++ /dev/null @@ -1,204 +0,0 @@ ---- -title: Analyze token generation at Prefill and Decode stage -weight: 4 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -# Analyze token generation at Prefill and Decode stage -To get a visible token generation view at Prefill and Decode stage, Annotation Marker feature of Streamline is used and the Annotation Marker generation code is integrated to the llama.cpp project. -You can find more information about Annotation Marker feature here, https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en. - -## Steps of llama.cpp integration and Streamline setup - -### Step 1: Build Streamline Annotation library -Install ArmDS or Arm Streamline on your host PC first. -You can get Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*. -You also can get the Annotation support code here, https://github.com/ARM-software/gator/tree/main , please download the right code that matches the version of Streamline tool on your host PC. - -Then you can build the Streamline Annotation Library by running -```bash -make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool -``` - -for example, -```bash -make CROSS_COMPILE=./Work/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu- -``` -You can get the aarch64 gcc compiler toolchain here, https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads . - -The static linked library, libstreamline_annotate.a, will be produced. - -### Step 2: Integrate Annotation Marker code to llama.cpp -Download llama.cpp code from https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz -Go to llama.cpp root directory and create a directory ‘streamline_annotation’ there. -```bash -cd ./llama.cpp -mkdir streamline_annotation -``` - -Copy the library ‘libstreamline_annotate.a’ and the header file ‘streamline_annotate.h’ from Step 1 to the directory ‘streamline_annotation’. - -To link 'libstreamline_annotate.a' library when building llama-cli, change *llama.cpp\CMakeLists.txt* by adding following lines, - -```makefile -set(STREAMLINE_LIB_PATH ${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a) -target_include_directories(llama-cli PRIVATE ${CMAKE_SOURCE_DIR}/streamline_annotation) -target_link_libraries(${TARGET} PRIVATE ${STREAMLINE_LIB_PATH} ) -``` - -To add Annotation Markers to llama-cli, change the llama-cli code *llama.cpp/tools/main/main.cpp* by adding -```c -#include "streamline_annotate.h" -``` -and the Annotation Marker code in the 'main' function, - -Firstly, add the Streamline Annotation setup code after *common_init*, -```c - common_init(); - - //Add the Annotation setup code - ANNOTATE_SETUP; - -``` - - -then add the Annotation Marker generation code here, - - -```c - for (int i = 0; i < (int) embd.size(); i += params.n_batch) { - int n_eval = (int) embd.size() - i; - if (n_eval > params.n_batch) { - n_eval = params.n_batch; - } - - LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str()); - - // Add annotation marker code for Streamline - { - char printf_buf[200]; - sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval ); - ANNOTATE_MARKER_STR(printf_buf); - } - // End of annotation marker - - if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) { - LOG_ERR("%s : failed to eval\n", __func__); - return 1; - } -``` - -A string is added to the Annotation Marker to record the position of input tokens and number of tokens to be processed. - -### Step 3: Build llama-cli executable -For convenience, llama-cli is static linked. - -Firstly, create a new directory ‘build’ understand llama.cpp root directory and go into it. -```bash -mkdir ./build & cd ./build -``` -Then configure the project by running -```bash -cmake .. -DCMAKE_SYSTEM_NAME=Linux -DCMAKE_SYSTEM_PROCESSOR=arm -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ -DLLAMA_NATIVE=OFF -DLLAMA_F16C=OFF -DLLAMA_GEMM_ARM=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_EXE_LINKER_FLAGS="-static -g" -DGGML_OPENMP=OFF -DCMAKE_C_FLAGS="-march=armv8.2-a+i8mm+dotprod -g" -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" -DGGML_CPU_KLEIDIAI=ON -DGGML_OPENMP=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_CURL=OFF -``` - -Set CMAKE_C_COMPILER and DCMAKE_CXX_COMPILER to your cross compiler path. Make sure that “-march” in DCMAKE_C_FLAGS and CMAKE_CXX_FLAGS matches your Arm CPU hardware. - -In this guide, we run llama-cli on an Arm CPU which supports NEON Dotprod and I8MM instructions, so ‘-march’ is specified as ‘armv8.2-a+dotprod+i8mm’. We also specify ‘-static’ and ‘-g’ options so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android. - -Now, we can build the project by running -```bash -cmake --build ./ --config Release -``` - -After the building process, you should find the llama-cli executable in *./build/bin/* directory. - -### Step 4: Run llama-cli and analyze the data with Streamline -Copy following files to your Arm64 platform, -* llama-cli executable -* the ‘gatord’ executable in Arm DS or Streamline installation folder, such as *Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64* for Linux and *Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64* for Android -* the LLM model, Qwen1_5-0_5b-chat-q4_0.gguf - -Then run the gatord on your Arm64 target -```bash -./gatord -``` -You should see similar messages as below, - -``` bash -Streamline Data Recorder v9.4.0 (Build 9b1e8f8) -Copyright (c) 2010-2024 Arm Limited. All rights reserved. -Gator ready -``` - -Then launch the Streamline application on your host PC, connect to the gatord running on your Arm64 target with either TCP or ADB connection. You can select PMU events to be monitored at this point. - -![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ") - -Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis. - -![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path") - -Click ‘Start Capture’ button on Streamline to start collecting data from the Arm64 target. - -*Note: This guide is not intended to introduce how to use Streamline, if you encounter any issue during setting up gatord or Streamline, please seek for help from Arm support.* - -Now, run the llama-cli executable as below, - -``` bash -./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1 -``` - -After a while, you can stop the Streamline data collection by clicking ‘Stop’ button on Streamline. Then Streamline tool on your host PC will start the data analysis. - -## Analyze the data with Streamline -From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation. - -![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker") - -The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example, - -![text#center](images/annotation_marker_2.png "Figure 9. Annotation String") - -The number after ‘past’ indicates the position of input tokens, the number after ‘n_eval’ indicates the number of tokens to be processed this time. - -As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage. - -![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage") - -By checking the string of Annotation Marker, the first token generation at Prefill stage has 'past 0, n_eval 78', which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. -We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. - -We can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. - -At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher. -By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall, - -![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event") - -We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. -All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage. - -Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are organized in form of call stack. - -![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack") - -In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions. - -![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view") - -As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model, -* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation. - - At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage. - - ![text#center](images/Prefill_only.png "Figure 14. Prefill only view") - - - At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion. - - ![text#center](images/Decode_only.png "Figure 15. Decode only view") - -* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. -* The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. -* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Conclusion.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Conclusion.md deleted file mode 100644 index 55adcb95bc..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Conclusion.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: Conclusion -weight: 7 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -# Conclusion -By leveraging the Streamline tool together with a good understanding of the llama.cpp code, the execution process of the LLM model can be visualized, which helps analyze code efficiency and investigate potential optimization. - -Note that additional annotation code in llama.cpp and gatord might somehow affect the performance. - diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md deleted file mode 100644 index 3802be4996..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md +++ /dev/null @@ -1,132 +0,0 @@ ---- -title: Deep dive into individual operator -weight: 5 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -# Deep dive into individual operator -This session provides a guide on how to use the Streamline Annotation Channel feature to analyze execution time of each node in the compute graph. -More information about Streamline Annotation Channel can be found here https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en - -## Integrate Annotation Channel code to llama.cpp -In llama.cpp project, tensor nodes in compute graph are computed by the function ggml_graph_compute_thread in CPU backend, *llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c* -```c -for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { - struct ggml_tensor * node = cgraph->nodes[node_n]; - - ggml_compute_forward(¶ms, node); -``` -To monitor the execution time of each node, we create a annotation channel for each type of operators (such as GGML_OP_MUL_MAT, GGML_OP_SOFTMAX, GGML_OP_ROPE, GGML_OP_MUL), since GGML_OP_MUL_MAT including both GEMM and GEMV operation takes significant portion of execution time, two dedicated annotation channels are created for GEMM and GEMV respectively. - -The annotation channel starts at the beginning of 'ggml_compute_forward’, it stops at the end of ‘ggml_compute_forward’, so that the computation of tensor node/operator can be monitored. - -Firstly, add Streamline annotation header file to ggml-cpu.c, -```c -#include "streamline_annotate.h" -``` -Then add annotation channel code in ggml_graph_compute_thread function, -```c -for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { - struct ggml_tensor * node = cgraph->nodes[node_n]; - // Start Annotation Channel for Streamline - { - char printf_buf[256]; - sprintf(printf_buf," %s, %s", node->name, ggml_get_name(node)); - - if(node->op==GGML_OP_MUL_MAT ) - { - if (node->src[1]->ne[1] == 1) - ANNOTATE_CHANNEL(0, printf_buf); //It is GEMV - else - ANNOTATE_CHANNEL(1, printf_buf); //It is GEMM - } - else - ANNOTATE_CHANNEL((node->op)+2, printf_buf); - } - - - - ggml_compute_forward(¶ms, node); - - - // End Annotation Channel for Streamline - { - if(node->op==GGML_OP_MUL_MAT) - { - if (node->src[1]->ne[1] == 1) - ANNOTATE_CHANNEL_END(0); - else - ANNOTATE_CHANNEL_END(1); - } - else - ANNOTATE_CHANNEL_END((node->op)+2); - } -``` - - -We also add tensor node names and the names of operation to the string annotation channels. - -If information of the shape and size of source tensors is required, we can change the code as below, -```c - sprintf(printf_buf,"%s %s %d_%d_%d %d_%d_%d", node->name, ggml_get_name(node), \ - node->src[0]? node->src[0]->ne[0] : 0, \ - node->src[0]? node->src[0]->ne[1] : 0 , \ - node->src[0]? node->src[0]->ne[2] : 0 ,\ - node->src[1]? node->src[1]->ne[0] : 0, \ - node->src[1]? node->src[1]->ne[1] : 0, \ - node->src[1]? node->src[1]->ne[2] : 0 \ - ); -``` -Then we need to change *llama.cpp\ggml\src\ggml-cpu\CMakeLists.txt* to include Streamline Annotation header file and libstreamline_annotate.a library by adding codes as below, -```bash - set(STREAMLINE_LIB_PATH ${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a) - target_include_directories( ${GGML_CPU_NAME} PRIVATE ${CMAKE_SOURCE_DIR}/streamline_annotation) - target_link_libraries(${GGML_CPU_NAME} PRIVATE ${STREAMLINE_LIB_PATH} ) -``` - -Then build llama-cli executable, run llama-cli and collect profiling data with Streamline as previous session. - - -## Analyze the data with Streamline -String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view, for example inside Channel 0 in the following screenshot. -![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel") - -The letter A is displayed in the process list to indicate the presence of annotations. -String annotations are also displayed in the Message column in the Log view. -![text#center](images/deep_dive_2.png "Figure 17. Annotation log") - -### View of individual operators at Prefill stage - -The screenshot of annotation channel view at Prefill stage is shown as below, -![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage") - -Note that the name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. -This annotation macro is defined as, -```c -ANNOTATE_NAME_CHANNEL(channel, group, string) -``` -For example, -```c - ANNOTATE_NAME_CHANNEL(0, 0, "MUL_MAT_GEMV"); - ANNOTATE_NAME_CHANNEL(1, 0, "MUL_MAT_GEMM"); -``` -The code above sets the name of annotation channel 0 as ‘MUL_MAT_GEMV’, the name of annotation channel 1 as ‘MUL_MAT_GEMM’. -We can get more detailed information by zooming in the view, -![text#center](images/prefill_annotation_channel_2.png "Figure 18. Annotation Channel at Decode stage") - -When moving the cursor to the Annotation channel, the tensor node name, the name of operation, the shape and size of source tensor nodes will be shown. -![text#center](images/prefill_annotation_channel_3.png "Figure 19. Annotation Channel Zoom in") - -The screenshot above shows a GGML_OP_MUL_MAT operator of FFN_UP node, whose source tensors shape/size is [1024, 2816] and [1024, 68]. -The view clearly shows that the major time was spent on MUL_MAT GEMM operations of attention layers and FFN layers at Prefill stage. There is a large MUL_MAT GEMV operation at result_output linear layer. Other operators such as MUL, Softmax, Norm, RoPE do not take significant time. - -### View of individual operators at Decode stage -The screenshot of annotation channel view at Decode stage is shown as below, -![text#center](images/decode_annotation_channel.png "Figure 20. Annotation Channel at Decode stage") - -We can get more detailed information by zooming in the view, -![text#center](images/decode_annotation_channel_2.png "Figure 21. Annotation Channel string") - -The view shows that the major time was spent on MUL_MAT GEMV operations of attention layers and FFN layers at Decode stage. Comparing with Prefill stage, there is no GEMM at those layers, GEMV operations are performed instead. The large MUL_MAT GEMV operation at result_output linear layer takes more significant portion of time at Decode stage, since the time spent on each token generation at Decode stage is less due to utilization of KV cache. This corresponds to the percentage of execution time of the function ggml_vec_dot_q6_K_q8_K that we observed in previous session. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md deleted file mode 100644 index bdc885dad5..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -title: Overview -weight: 2 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -# Overview -Large Language Models (LLM) run very smoothly on Arm CPUs. The framework that runs LLM models is usually complex. To analyze the execution of LLM and utilize profiling information for potential code optimization, a good understanding of transformer architecture and an appropriate analysis tool is required. -This guide uses llama-cli application from llama.cpp and Arm’s Streamline tool to analyze the efficiency of LLM running on arm CPU. - -The guide includes, -* How to profile LLM token generation at Prefill and Decode stage -* How to profile execution of individual tensor node/operator -* How to profile LLM execution with multi-thread/multi-core - -Understanding this guide requires prerequisite knowledge of transformer architecture, llama.cpp and Streamline. - -We run Qwen1_5-0_5b-chat-q4_0.gguf model with llama-cli on Arm64 Linux and use Streamline for analysis. This guide should also work on Arm64 Android platform. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md deleted file mode 100644 index 15bc501c7d..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md +++ /dev/null @@ -1,39 +0,0 @@ ---- -title: Introduction to llama.cpp -weight: 3 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -# Introduction to llama.cpp -llama.cpp is a LLM framework implemented in C++ that can be used for both training and inference. This guide only covers inference on the CPU. -llama-cli provides a terminal interface to interact with LLM using the llama.cpp inference engine. It enables LLM inference, chat mode, grammar-constrained generation directly from the command line. -![text#center](images/llama_structure.png "Figure 1. Annotation String") - -llama-cli does the following things, -* Load and interpret LLMs in .gguf format. -* Build a compute graph according to the model structure. The compute graph can be divided into subgraphs that are assigned to the most suitable backend devices. At this step, the model structure are converted into a compute graph with many tensor nodes/operators (such as ADD, MUL_MAT, NORM, SOFTMAX) that can be actually computed. -Since this guide only focuses on running LLM on CPU, all operators are assigned to CPU backend. -* Allocate memory for tensors nodes in the compute graph by the graph planner. -* Compute tensor nodes at the graph compute stage, where the ‘graph_compute’ function forwards the compute subgraphs to the backend devices. The computation is performed by traversing the tree of nodes in the compute graph. - -Those steps above are wrapped in the function ‘llama_decode’. At LLM Prefill and Decode stage, llama-cli calls ‘llama_decode’ repeatedly to generate tokens. However, the parameter ‘llama_batch’ passed to ‘llama_decode' is different at Prefill and Decode stage. ‘llama_batch’ includes information such as input tokens, number of input tokens, the position of input tokens. - -The components of llama.cpp include -![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components") - -llama.cpp supports various backends such as CPU, GPU, CUDA, OpenCL etc. -For the CPU backend, it provides an optimized ggml-cpu library (mainly utilizing CPU vector instructions). For Arm CPUs, the ggml-cpu library also offers an aarch64 trait that leverages the new I8MM instructions for acceleration. The ggml-cpu library also integrates the Arm KleidiAI library as an additional trait. - -Most autoregressive LLMs are Decoder-only model. Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs. -![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage") - -At the Prefill stage, multiple input tokens of the prompt are processed. It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token. -![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage") - - -At the Decode stage, by utilizing the KV cache, it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. -![text#center](images/transformer_decode.jpg "Figure 5. Decode stage") - -Therefore, the prefill stage is compute-bound, while the decode stage has relatively less computation and is more memory-bound due to lots of KV cache memory access. This can be seen in the subsequent analysis with Streamline. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md deleted file mode 100644 index c3eee09f8c..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md +++ /dev/null @@ -1,39 +0,0 @@ ---- -title: Use Streamline to analyze multi-core/multi-thread support in llama.cpp -weight: 6 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -# Use Streamline to analyze multi-core/multi-thread support in llama.cpp -The CPU backend in llama.cpp utilizes multi-core/multi-thread to accelerate the computation of operators. -llama.cpp creates a threadpool. The number of threads in threadpool is decided by ‘-t’ option, if ‘-t’ option is not specified, then it is set as the number of CPU cores in the system by default. -The entrypoint of secondary thread is ggml_graph_compute_secondary_thread. -When computing one tensor node/operator in the compute graph, if the worksize is big, llama.cpp splits its computation into multiple parts for those threads. -Here is an example of MUL_MAT operator to demonstrate how the splitting is done. - -![text#center](images/multi_thread.jpg "Figure 22. Multi-thread") - -In this example, the result matrix C is split equally between four threads, each thread computes a quarter of matrix C. -The execution of multi-threads on CPU cores can be observed by Streamline. Core Map and Cluster Map modes in the Streamline Timeline view map threads to CPU cores. - -More information about Core Map and Cluster Map modes can be found here -https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes - -Run llama-cli with ‘-t 2 -C 0x3’ to specify two threads and thread affinity as CPU core0 and core1, -```bash -./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 2 -C 0x3 -``` - -Collect profiling data with Streamline, then select Core Map and Cluster Map modes in the Streamline Timeline view. - -![text#center](images/multi_thread_core_map.png "Figure 23. Multi-thread") - -As shown in the screenshot above, two threads are created and running on CPU core0 and core1 respectively. -Furthermore, individual operator view with annotation channel can be used to view two threads’ operators in parallel. -Note that annotation channels are created independently per-thread. - -![text#center](images/multi_thread_annotation_channel.png "Figure 24. Multi-thread") - -As shown in screenshot above, at the specific time, both threads are computing for the same node. In this example, it is result_output linear node. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md index 28843b0bde..04c7fb8948 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md @@ -1,24 +1,26 @@ --- -title: Use Streamline to analyze LLM running on CPU with llama.cpp and KleidiAI +title: Use Streamline to analyze LLM on CPU with llama.cpp and KleidiAI -draft: true -cascade: - draft: true - minutes_to_complete: 50 -who_is_this_for: Engineers who want to learn LLM inference on CPU or profile and optimize llama.cpp code. +who_is_this_for: This advanced topic is for software developers, performance engineers, and AI practitioners who want to run llama.cpp on Arm-based CPUs, learn how to use Arm Streamline to capture and analyze performance data, understand how LLM inference behaves at the Prefill and Decode stages. -learning_objectives: - - Be able to use Streamline to profile llama.cpp code - - Learn the execution of LLM on CPU +learning_objectives: + - Describe the architecture of llama.cpp and the role of Prefill and Decode stages + - Integrate Streamline Annotations into llama.cpp for fine-grained performance insights + - Capture and interpret profiling data with Streamline + - Use Annotation Channels to analyze specific operators during token generation + - Evaluate multi-core and multi-thread execution of llama.cpp on Arm CPUs prerequisites: - - Understanding of llama.cpp + - Basic understanding of llama.cpp - Understanding of transformer model - Knowledge of Streamline usage + - An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application -author: Zenon(Zhilong) Xiu +author: + - Zenon Zhilong Xiu + - Odin Shen ### Tags skilllevels: Advanced @@ -29,6 +31,10 @@ armips: tools_software_languages: - Arm Streamline - C++ + - llama.cpp + - KleidiAI + - Neoverse + - Profiling operatingsystems: - Linux - Android @@ -47,8 +53,6 @@ further_reading: link: https://developer.arm.com/documentation/101816/9-7 type: website - - ### FIXED, DO NOT MODIFY # ================================================================================ weight: 1 # _index.md always has weight of 1 to order correctly diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png index f6095be12c..e56067cabc 100644 Binary files a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_only.png similarity index 100% rename from content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png rename to content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_only.png diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.png deleted file mode 100644 index 5fdf8f3a66..0000000000 Binary files a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.png and /dev/null differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.png deleted file mode 100644 index 47188a01b8..0000000000 Binary files a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.png and /dev/null differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png index 1b435ae958..2d56abda3a 100644 Binary files a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png index b42cff8220..3fc7193779 100644 Binary files a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_only.png similarity index 100% rename from content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png rename to content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_only.png