diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md new file mode 100644 index 0000000000..298f03c361 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md @@ -0,0 +1,204 @@ +--- +title: Analyze token generation at Prefill and Decode stage +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Analyze token generation at Prefill and Decode stage +To get a visible token generation view at Prefill and Decode stage, Annotation Marker feature of Streamline is used and the Annotation Marker generation code is integrated to the llama.cpp project. +You can find more information about Annotation Marker feature here, https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en. + +## Steps of llama.cpp integration and Streamline setup + +### Step 1: Build Streamline Annotation library +Install ArmDS or Arm Streamline on your host PC first. +You can get Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*. +You also can get the Annotation support code here, https://github.com/ARM-software/gator/tree/main , please download the right code that matches the version of Streamline tool on your host PC. + +Then you can build the Streamline Annotation Library by running +```bash +make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool +``` + +for example, +```bash +make CROSS_COMPILE=./Work/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu- +``` +You can get the aarch64 gcc compiler toolchain here, https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads . + +The static linked library, libstreamline_annotate.a, will be produced. + +### Step 2: Integrate Annotation Marker code to llama.cpp +Download llama.cpp code from https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz +Go to llama.cpp root directory and create a directory ‘streamline_annotation’ there. +```bash +cd ./llama.cpp +mkdir streamline_annotation +``` + +Copy the library ‘libstreamline_annotate.a’ and the header file ‘streamline_annotate.h’ from Step 1 to the directory ‘streamline_annotation’. + +To link 'libstreamline_annotate.a' library when building llama-cli, change *llama.cpp\CMakeLists.txt* by adding following lines, + +```makefile +set(STREAMLINE_LIB_PATH ${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a) +target_include_directories(llama-cli PRIVATE ${CMAKE_SOURCE_DIR}/streamline_annotation) +target_link_libraries(${TARGET} PRIVATE ${STREAMLINE_LIB_PATH} ) +``` + +To add Annotation Markers to llama-cli, change the llama-cli code *llama.cpp/tools/main/main.cpp* by adding +```c +#include "streamline_annotate.h" +``` +and the Annotation Marker code in the 'main' function, + +Firstly, add the Streamline Annotation setup code after *common_init*, +```c + common_init(); + + //Add the Annotation setup code + ANNOTATE_SETUP; + +``` + + +then add the Annotation Marker generation code here, + + +```c + for (int i = 0; i < (int) embd.size(); i += params.n_batch) { + int n_eval = (int) embd.size() - i; + if (n_eval > params.n_batch) { + n_eval = params.n_batch; + } + + LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str()); + + // Add annotation marker code for Streamline + { + char printf_buf[200]; + sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval ); + ANNOTATE_MARKER_STR(printf_buf); + } + // End of annotation marker + + if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) { + LOG_ERR("%s : failed to eval\n", __func__); + return 1; + } +``` + +A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed. + +### Step 3: Build llama-cli executable +For convenience, llama-cli is static linked. + +Firstly, create a new directory ‘build’ understand llama.cpp root directory and go into it. +```bash +mkdir ./build & cd ./build +``` +Then configure the project by running +```bash +cmake .. -DCMAKE_SYSTEM_NAME=Linux -DCMAKE_SYSTEM_PROCESSOR=arm -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ -DLLAMA_NATIVE=OFF -DLLAMA_F16C=OFF -DLLAMA_GEMM_ARM=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_EXE_LINKER_FLAGS="-static -g" -DGGML_OPENMP=OFF -DCMAKE_C_FLAGS="-march=armv8.2-a+i8mm+dotprod -g" -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" -DGGML_CPU_KLEIDIAI=ON -DGGML_OPENMP=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_CURL=OFF +``` + +Set CMAKE_C_COMPILER and DCMAKE_CXX_COMPILER to your cross compiler path. Make sure that “-march” in DCMAKE_C_FLAGS and CMAKE_CXX_FLAGS matches your Arm CPU hardware. + +In this guide, we run llama-cli on an Arm CPU which supports NEON Dotprod and I8MM instructions, so ‘-march’ is specified as ‘armv8.2-a+dotprod+i8mm’. We also specify ‘-static’ and ‘-g’ options so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android. + +Now, we can build the project by running +```bash +cmake --build ./ --config Release +``` + +After the building process, you should find the llama-cli executable in *./build/bin/* directory. + +### Step 4: Run llama-cli and analyze the data with Streamline +Copy following files to your Arm64 platform, +* llama-cli executable +* the ‘gatord’ executable in Arm DS or Streamline installation folder, such as *Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64* for Linux and *Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64* for Android +* the LLM model, Qwen1_5-0_5b-chat-q4_0.gguf + +Then run the gatord on your Arm64 target +```bash +./gatord +``` +You should see similar messages as below, + +``` bash +Streamline Data Recorder v9.4.0 (Build 9b1e8f8) +Copyright (c) 2010-2024 Arm Limited. All rights reserved. +Gator ready +``` + +Then launch the Streamline application on your host PC, connect to the gatord running on your Arm64 target with either TCP or ADB connection. You can select PMU events to be monitored at this point. + +![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ") + +Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis. + +![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path") + +Click ‘Start Capture’ button on Streamline to start collecting data from the Arm64 target. + +*Note: This guide is not intended to introduce how to use Streamline, if you encounter any issue during setting up gatord or Streamline, please seek for help from Arm support.* + +Now, run the llama-cli executable as below, + +``` bash +./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1 +``` + +After a while, you can stop the Streamline data collection by clicking ‘Stop’ button on Streamline. Then Streamline tool on your host PC will start the data analysis. + +## Analyze the data with Streamline +From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation. + +![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker") + +The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example, + +![text#center](images/annotation_marker_2.png "Figure 9. Annotation String") + +The number after ‘past’ indicates the position of input tokens, the number after ‘n_eval’ indicates the number of tokens to be processed this time. + +As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage. + +![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage") + +By checking the string of Annotation Marker, the first token generation at Prefill stage has 'past 0, n_eval 78', which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. +We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. + +We can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. + +At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher. +By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall, + +![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event") + +We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. +All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage. + +Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack. + +![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack") + +In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions. + +![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view") + +As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model, +* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation. + - At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage. + + ![text#center](images/Prefill_only.png "Figure 14. Prefill only view") + + - At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion. + + ![text#center](images/Decode_only.png "Figure 15. Decode only view") + +* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. +* The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. +* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Conclusion.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Conclusion.md new file mode 100644 index 0000000000..94b21e56fa --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Conclusion.md @@ -0,0 +1,13 @@ +--- +title: Conclusion +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Conclusion +By leveraging the Streamline tool together with a good understanding of the llama.cpp code, the execution process of the LLM model can be visualized, which helps analyze code efficiency and investigate potential optimization. + +Note that addtional annotation code in llama.cpp and gatord might somehow affect the performance. + diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md new file mode 100644 index 0000000000..3802be4996 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md @@ -0,0 +1,132 @@ +--- +title: Deep dive into individual operator +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Deep dive into individual operator +This session provides a guide on how to use the Streamline Annotation Channel feature to analyze execution time of each node in the compute graph. +More information about Streamline Annotation Channel can be found here https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en + +## Integrate Annotation Channel code to llama.cpp +In llama.cpp project, tensor nodes in compute graph are computed by the function ggml_graph_compute_thread in CPU backend, *llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c* +```c +for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { + struct ggml_tensor * node = cgraph->nodes[node_n]; + + ggml_compute_forward(¶ms, node); +``` +To monitor the execution time of each node, we create a annotation channel for each type of operators (such as GGML_OP_MUL_MAT, GGML_OP_SOFTMAX, GGML_OP_ROPE, GGML_OP_MUL), since GGML_OP_MUL_MAT including both GEMM and GEMV operation takes significant portion of execution time, two dedicated annotation channels are created for GEMM and GEMV respectively. + +The annotation channel starts at the beginning of 'ggml_compute_forward’, it stops at the end of ‘ggml_compute_forward’, so that the computation of tensor node/operator can be monitored. + +Firstly, add Streamline annotation header file to ggml-cpu.c, +```c +#include "streamline_annotate.h" +``` +Then add annotation channel code in ggml_graph_compute_thread function, +```c +for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { + struct ggml_tensor * node = cgraph->nodes[node_n]; + // Start Annotation Channel for Streamline + { + char printf_buf[256]; + sprintf(printf_buf," %s, %s", node->name, ggml_get_name(node)); + + if(node->op==GGML_OP_MUL_MAT ) + { + if (node->src[1]->ne[1] == 1) + ANNOTATE_CHANNEL(0, printf_buf); //It is GEMV + else + ANNOTATE_CHANNEL(1, printf_buf); //It is GEMM + } + else + ANNOTATE_CHANNEL((node->op)+2, printf_buf); + } + + + + ggml_compute_forward(¶ms, node); + + + // End Annotation Channel for Streamline + { + if(node->op==GGML_OP_MUL_MAT) + { + if (node->src[1]->ne[1] == 1) + ANNOTATE_CHANNEL_END(0); + else + ANNOTATE_CHANNEL_END(1); + } + else + ANNOTATE_CHANNEL_END((node->op)+2); + } +``` + + +We also add tensor node names and the names of operation to the string annotation channels. + +If information of the shape and size of source tensors is required, we can change the code as below, +```c + sprintf(printf_buf,"%s %s %d_%d_%d %d_%d_%d", node->name, ggml_get_name(node), \ + node->src[0]? node->src[0]->ne[0] : 0, \ + node->src[0]? node->src[0]->ne[1] : 0 , \ + node->src[0]? node->src[0]->ne[2] : 0 ,\ + node->src[1]? node->src[1]->ne[0] : 0, \ + node->src[1]? node->src[1]->ne[1] : 0, \ + node->src[1]? node->src[1]->ne[2] : 0 \ + ); +``` +Then we need to change *llama.cpp\ggml\src\ggml-cpu\CMakeLists.txt* to include Streamline Annotation header file and libstreamline_annotate.a library by adding codes as below, +```bash + set(STREAMLINE_LIB_PATH ${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a) + target_include_directories( ${GGML_CPU_NAME} PRIVATE ${CMAKE_SOURCE_DIR}/streamline_annotation) + target_link_libraries(${GGML_CPU_NAME} PRIVATE ${STREAMLINE_LIB_PATH} ) +``` + +Then build llama-cli executable, run llama-cli and collect profiling data with Streamline as previous session. + + +## Analyze the data with Streamline +String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view, for example inside Channel 0 in the following screenshot. +![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel") + +The letter A is displayed in the process list to indicate the presence of annotations. +String annotations are also displayed in the Message column in the Log view. +![text#center](images/deep_dive_2.png "Figure 17. Annotation log") + +### View of individual operators at Prefill stage + +The screenshot of annotation channel view at Prefill stage is shown as below, +![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage") + +Note that the name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. +This annotation macro is defined as, +```c +ANNOTATE_NAME_CHANNEL(channel, group, string) +``` +For example, +```c + ANNOTATE_NAME_CHANNEL(0, 0, "MUL_MAT_GEMV"); + ANNOTATE_NAME_CHANNEL(1, 0, "MUL_MAT_GEMM"); +``` +The code above sets the name of annotation channel 0 as ‘MUL_MAT_GEMV’, the name of annotation channel 1 as ‘MUL_MAT_GEMM’. +We can get more detailed information by zooming in the view, +![text#center](images/prefill_annotation_channel_2.png "Figure 18. Annotation Channel at Decode stage") + +When moving the cursor to the Annotation channel, the tensor node name, the name of operation, the shape and size of source tensor nodes will be shown. +![text#center](images/prefill_annotation_channel_3.png "Figure 19. Annotation Channel Zoom in") + +The screenshot above shows a GGML_OP_MUL_MAT operator of FFN_UP node, whose source tensors shape/size is [1024, 2816] and [1024, 68]. +The view clearly shows that the major time was spent on MUL_MAT GEMM operations of attention layers and FFN layers at Prefill stage. There is a large MUL_MAT GEMV operation at result_output linear layer. Other operators such as MUL, Softmax, Norm, RoPE do not take significant time. + +### View of individual operators at Decode stage +The screenshot of annotation channel view at Decode stage is shown as below, +![text#center](images/decode_annotation_channel.png "Figure 20. Annotation Channel at Decode stage") + +We can get more detailed information by zooming in the view, +![text#center](images/decode_annotation_channel_2.png "Figure 21. Annotation Channel string") + +The view shows that the major time was spent on MUL_MAT GEMV operations of attention layers and FFN layers at Decode stage. Comparing with Prefill stage, there is no GEMM at those layers, GEMV operations are performed instead. The large MUL_MAT GEMV operation at result_output linear layer takes more significant portion of time at Decode stage, since the time spent on each token generation at Decode stage is less due to utilization of KV cache. This corresponds to the percentage of execution time of the function ggml_vec_dot_q6_K_q8_K that we observed in previous session. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md new file mode 100644 index 0000000000..bdc885dad5 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md @@ -0,0 +1,20 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Overview +Large Language Models (LLM) run very smoothly on Arm CPUs. The framework that runs LLM models is usually complex. To analyze the execution of LLM and utilize profiling information for potential code optimization, a good understanding of transformer architecture and an appropriate analysis tool is required. +This guide uses llama-cli application from llama.cpp and Arm’s Streamline tool to analyze the efficiency of LLM running on arm CPU. + +The guide includes, +* How to profile LLM token generation at Prefill and Decode stage +* How to profile execution of individual tensor node/operator +* How to profile LLM execution with multi-thread/multi-core + +Understanding this guide requires prerequisite knowledge of transformer architecture, llama.cpp and Streamline. + +We run Qwen1_5-0_5b-chat-q4_0.gguf model with llama-cli on Arm64 Linux and use Streamline for analysis. This guide should also work on Arm64 Android platform. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md new file mode 100644 index 0000000000..15bc501c7d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md @@ -0,0 +1,39 @@ +--- +title: Introduction to llama.cpp +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Introduction to llama.cpp +llama.cpp is a LLM framework implemented in C++ that can be used for both training and inference. This guide only covers inference on the CPU. +llama-cli provides a terminal interface to interact with LLM using the llama.cpp inference engine. It enables LLM inference, chat mode, grammar-constrained generation directly from the command line. +![text#center](images/llama_structure.png "Figure 1. Annotation String") + +llama-cli does the following things, +* Load and interpret LLMs in .gguf format. +* Build a compute graph according to the model structure. The compute graph can be divided into subgraphs that are assigned to the most suitable backend devices. At this step, the model structure are converted into a compute graph with many tensor nodes/operators (such as ADD, MUL_MAT, NORM, SOFTMAX) that can be actually computed. +Since this guide only focuses on running LLM on CPU, all operators are assigned to CPU backend. +* Allocate memory for tensors nodes in the compute graph by the graph planner. +* Compute tensor nodes at the graph compute stage, where the ‘graph_compute’ function forwards the compute subgraphs to the backend devices. The computation is performed by traversing the tree of nodes in the compute graph. + +Those steps above are wrapped in the function ‘llama_decode’. At LLM Prefill and Decode stage, llama-cli calls ‘llama_decode’ repeatedly to generate tokens. However, the parameter ‘llama_batch’ passed to ‘llama_decode' is different at Prefill and Decode stage. ‘llama_batch’ includes information such as input tokens, number of input tokens, the position of input tokens. + +The components of llama.cpp include +![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components") + +llama.cpp supports various backends such as CPU, GPU, CUDA, OpenCL etc. +For the CPU backend, it provides an optimized ggml-cpu library (mainly utilizing CPU vector instructions). For Arm CPUs, the ggml-cpu library also offers an aarch64 trait that leverages the new I8MM instructions for acceleration. The ggml-cpu library also integrates the Arm KleidiAI library as an additional trait. + +Most autoregressive LLMs are Decoder-only model. Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs. +![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage") + +At the Prefill stage, multiple input tokens of the prompt are processed. It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token. +![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage") + + +At the Decode stage, by utilizing the KV cache, it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. +![text#center](images/transformer_decode.jpg "Figure 5. Decode stage") + +Therefore, the prefill stage is compute-bound, while the decode stage has relatively less computation and is more memory-bound due to lots of KV cache memory access. This can be seen in the subsequent analysis with Streamline. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md new file mode 100644 index 0000000000..c3eee09f8c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md @@ -0,0 +1,39 @@ +--- +title: Use Streamline to analyze multi-core/multi-thread support in llama.cpp +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Use Streamline to analyze multi-core/multi-thread support in llama.cpp +The CPU backend in llama.cpp utilizes multi-core/multi-thread to accelerate the computation of operators. +llama.cpp creates a threadpool. The number of threads in threadpool is decided by ‘-t’ option, if ‘-t’ option is not specified, then it is set as the number of CPU cores in the system by default. +The entrypoint of secondary thread is ggml_graph_compute_secondary_thread. +When computing one tensor node/operator in the compute graph, if the worksize is big, llama.cpp splits its computation into multiple parts for those threads. +Here is an example of MUL_MAT operator to demonstrate how the splitting is done. + +![text#center](images/multi_thread.jpg "Figure 22. Multi-thread") + +In this example, the result matrix C is split equally between four threads, each thread computes a quarter of matrix C. +The execution of multi-threads on CPU cores can be observed by Streamline. Core Map and Cluster Map modes in the Streamline Timeline view map threads to CPU cores. + +More information about Core Map and Cluster Map modes can be found here +https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes + +Run llama-cli with ‘-t 2 -C 0x3’ to specify two threads and thread affinity as CPU core0 and core1, +```bash +./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 2 -C 0x3 +``` + +Collect profiling data with Streamline, then select Core Map and Cluster Map modes in the Streamline Timeline view. + +![text#center](images/multi_thread_core_map.png "Figure 23. Multi-thread") + +As shown in the screenshot above, two threads are created and running on CPU core0 and core1 respectively. +Furthermore, individual operator view with annotation channel can be used to view two threads’ operators in parallel. +Note that annotation channels are created independently per-thread. + +![text#center](images/multi_thread_annotation_channel.png "Figure 24. Multi-thread") + +As shown in screenshot above, at the specific time, both threads are computing for the same node. In this example, it is result_output linear node. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md new file mode 100644 index 0000000000..f68a4ad4db --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md @@ -0,0 +1,57 @@ +--- +title: Use Streamline to analyze LLM running on CPU with llama.cpp and KleidiAI + +draft: true +cascade: + draft: true + +minutes_to_complete: 50 + +who_is_this_for: Engineers who want to learn LLM inference on CPU or proflie and optimize llama.cpp code. + +learning_objectives: + - Be able to use Streamline to profile llama.cpp code + - Learn the execution of LLM on CPU + +prerequisites: + - Understanding of llama.cpp + - Understanding of transformer model + - Knowledge of Streamline usage + +author: Zenon(Zhilong) Xiu + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Cortex-A + - Neoverse +tools_software_languages: + - Arm Streamline + - C++ +operatingsystems: + - Linux + - Android + +further_reading: + - resource: + title: llama.cpp project + link: https://github.com/ggml-org/llama.cpp + type: source code + - resource: + title: Qwen1_5-0_5b-chat-q4_0.gguf + link: https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/blob/main/qwen1_5-0_5b-chat-q4_0.gguf + type: LLM model + - resource: + title: Arm Streamline User Guide + link: https://developer.arm.com/documentation/101816/9-7 + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png new file mode 100644 index 0000000000..3d084767d8 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png new file mode 100644 index 0000000000..68b6d57957 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_1.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_1.png new file mode 100644 index 0000000000..8ee615057c Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_2.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_2.png new file mode 100644 index 0000000000..c466f72fb1 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_prefill.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_prefill.png new file mode 100644 index 0000000000..2b425ddb29 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_prefill.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_pmu_stall.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_pmu_stall.png new file mode 100644 index 0000000000..dc00fd642f Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_pmu_stall.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_prefill_call_stack.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_prefill_call_stack.png new file mode 100644 index 0000000000..f1c29741e8 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_prefill_call_stack.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_prefill_functions.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_prefill_functions.png new file mode 100644 index 0000000000..f2c393f885 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_prefill_functions.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel.png new file mode 100644 index 0000000000..5dc572a063 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png new file mode 100644 index 0000000000..f6095be12c Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/decode_annotation_channel_2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/deep_dive_1.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/deep_dive_1.png new file mode 100644 index 0000000000..e63b2b7b26 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/deep_dive_1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/deep_dive_2.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/deep_dive_2.png new file mode 100644 index 0000000000..1fc58df987 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/deep_dive_2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.jpg b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.jpg new file mode 100644 index 0000000000..55f56c2883 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.png new file mode 100644 index 0000000000..5fdf8f3a66 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_structure.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_structure.png new file mode 100644 index 0000000000..67cea85969 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_structure.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llm_prefill_decode.jpg b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llm_prefill_decode.jpg new file mode 100644 index 0000000000..9be52a78fd Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llm_prefill_decode.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.jpg b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.jpg new file mode 100644 index 0000000000..7b6fc6a7f8 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.png new file mode 100644 index 0000000000..47188a01b8 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png new file mode 100644 index 0000000000..1b435ae958 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_annotation_channel.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_core_map.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_core_map.png new file mode 100644 index 0000000000..505de28210 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/multi_thread_core_map.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel.png new file mode 100644 index 0000000000..5bbca5fbb9 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_2.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_2.png new file mode 100644 index 0000000000..e32eed9703 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png new file mode 100644 index 0000000000..b42cff8220 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/prefill_annotation_channel_3.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/streamline_capture.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/streamline_capture.png new file mode 100644 index 0000000000..8deffcef4a Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/streamline_capture.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/streamline_capture_image.png b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/streamline_capture_image.png new file mode 100644 index 0000000000..1a6c359f52 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/streamline_capture_image.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/transformer_decode.jpg b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/transformer_decode.jpg new file mode 100644 index 0000000000..4618ca890f Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/transformer_decode.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/transformer_prefill.jpg b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/transformer_prefill.jpg new file mode 100644 index 0000000000..f501973bb4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/transformer_prefill.jpg differ