diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index c2da7c098d..b9cc2dc326 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -45,6 +45,8 @@ Read the files in the directory `content/learning-paths/cross-platform/_example- Each Learning Path must have an _index.md file and a _next-steps.md file. The _index.md file contains the main content of the Learning Path. The _next-steps.md file contains links to related content and is included at the end of the Learning Path. +Additional resources and 'next steps' content should be placed in the `further_reading` section of `_index.md`, NOT in `_next-steps.md`. The `_next-steps.md` file should remain minimal and unmodified as indicated by "FIXED, DO NOT MODIFY" comments in the template. + The _index.md file should contain the following front matter and content sections: Front Matter (YAML format): @@ -60,6 +62,16 @@ Front Matter (YAML format): - `skilllevels`: Skill levels allowed are only Introductory and Advanced - `operatingsystems`: Operating systems used, must match the closed list on https://learn.arm.com/learning-paths/cross-platform/_example-learning-path/write-2-metadata/ +### Further Reading Curation + +Limit further_reading resources to 4-6 essential links. Prioritize: +- Direct relevance to the topic +- Arm-specific Learning Paths over generic external resources +- Foundation knowledge for target audience +- Required tools (install guides) +- Logical progression from basic to advanced + +Avoid overwhelming readers with too many links, which can cause them to leave the platform. All Learning Paths should generally include: Title: [Imperative verb] + [technology/tool] + [outcome] @@ -205,18 +217,23 @@ Some links are useful in content, but too many links can be distracting and read ### Internal links -Use a relative path format for internal links that are on learn.arm.com. -For example, use: descriptive link text pointing to a relative path like learning-paths/category/path-name/ +Use the full path format for internal links: `/learning-paths/category/path-name/` (e.g., `/learning-paths/cross-platform/docker/`). Do NOT use relative paths like `../path-name/`. Examples: -- learning-paths/servers-and-cloud-computing/csp/ (Arm-based instance) -- learning-paths/cross-platform/docker/ (Docker learning path) +- /learning-paths/servers-and-cloud-computing/csp/ (Arm-based instance) +- /learning-paths/cross-platform/docker/ (Docker learning path) ### External links Use the full URL for external links that are not on learn.arm.com, these open in a new tab. -This instruction set enables high-quality Arm Learning Paths content while maintaining consistency and technical accuracy. - +### Link Verification Process +When creating Learning Path content: +- Verify internal links exist before adding them +- Use semantic search or website browsing to confirm Learning Path availability +- Prefer verified external authoritative sources over speculative internal links +- Test link formats against existing Learning Path examples +- Never assume Learning Paths exist without verification +This instruction set enables high-quality Arm Learning Paths content while maintaining consistency and technical accuracy. \ No newline at end of file diff --git a/.wordlist.txt b/.wordlist.txt index dbc30450ad..3111457920 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -4976,4 +4976,5 @@ StatefulSets codemia multidisks testsh -uops \ No newline at end of file +uops +subgraph diff --git a/assets/contributors.csv b/assets/contributors.csv index a149228b12..a9e1572658 100644 --- a/assets/contributors.csv +++ b/assets/contributors.csv @@ -102,5 +102,6 @@ Ker Liu,,,,, Rui Chang,,,,, Alejandro Martinez Vicente,Arm,,,, Mohamad Najem,Arm,,,, +Ruifeng Wang,Arm,,,, Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,, -Zbynek Roubalik,Kedify,,,, +Zbynek Roubalik,Kedify,,,, \ No newline at end of file diff --git a/content/learning-paths/cross-platform/topdown-compare/1-top-down.md b/content/learning-paths/cross-platform/topdown-compare/1-top-down.md index d87ebb3e3e..cab6ed9313 100644 --- a/content/learning-paths/cross-platform/topdown-compare/1-top-down.md +++ b/content/learning-paths/cross-platform/topdown-compare/1-top-down.md @@ -14,17 +14,20 @@ Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitorin While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas: -**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints. +- Retiring +- Bad Speculation +- Frontend Bound +- Backend Bound -This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas. +This Learning Path provides a comparison of how x86 processors implement multi-level hierarchical top-down analysis compared to Arm Neoverse's methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas. ## Introduction to top-down performance analysis -The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories. +The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of the four categories. **Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability. -The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time. +The methodology allows you to drill down only into the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time. The next sections compare the Intel x86 methodology with the Arm top-down methodology. diff --git a/content/learning-paths/cross-platform/topdown-compare/1a-intel.md b/content/learning-paths/cross-platform/topdown-compare/1a-intel.md index 4ff98e1b1b..317eb3f105 100644 --- a/content/learning-paths/cross-platform/topdown-compare/1a-intel.md +++ b/content/learning-paths/cross-platform/topdown-compare/1a-intel.md @@ -1,5 +1,5 @@ --- -title: "Implement Intel x86 4-level hierarchical top-down analysis" +title: "Understand Intel x86 multi-level hierarchical top-down analysis" weight: 4 ### FIXED, DO NOT MODIFY @@ -8,9 +8,9 @@ layout: learningpathall ## Configure slot-based accounting with Intel x86 PMU counters -Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design but current Intel processor designs typically have four issue slots per cycle. +Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design, but current Intel processor designs typically have four issue slots per cycle. -Intel's methodology uses a multi-level hierarchy that extends to 4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events. +Intel's methodology uses a multi-level hierarchy that typically extends to 3-4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events. ## Level 1: Identify top-level performance categories @@ -27,18 +27,20 @@ Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores. Once you've identified the dominant Level 1 category, Level 2 drills into each area to identify broader causes. This level distinguishes between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend. -- Frontend Bound covers frontend latency in comparison with frontend bandwidth -- Backend Bound covers memory bound in comparison with core bound -- Bad Speculation covers branch mispredicts in comparison with machine clears -- Retiring covers base in comparison with microcode sequencer +- Frontend Bound covers frontend latency compared with frontend bandwidth +- Backend Bound covers memory bound compared with core bound +- Bad Speculation covers branch mispredicts compared with machine clears +- Retiring covers base compared with microcode sequencer ## Level 3: Target specific microarchitecture bottlenecks -After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations. Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories, while Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning. +After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations. + +Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories. Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning. ## Level 4: Access specific PMU counter events -The final level provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes. +Level 4 provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes. ## Apply essential Intel x86 PMU counters for analysis @@ -63,5 +65,5 @@ Intel processors expose hundreds of performance events, but top-down analysis re | `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) | -Using the above levels of metrics you can find out which of the four top-level categories are causing bottlenecks. +Using the above levels of metrics, you can determine which of the four top-level categories are causing bottlenecks. diff --git a/content/learning-paths/cross-platform/topdown-compare/1b-arm.md b/content/learning-paths/cross-platform/topdown-compare/1b-arm.md index 7ce61660b9..b328921f80 100644 --- a/content/learning-paths/cross-platform/topdown-compare/1b-arm.md +++ b/content/learning-paths/cross-platform/topdown-compare/1b-arm.md @@ -1,5 +1,5 @@ --- -title: "Implement Arm Neoverse 2-stage top-down analysis" +title: "Understand Arm Neoverse top-down analysis" weight: 5 ### FIXED, DO NOT MODIFY @@ -9,15 +9,15 @@ layout: learningpathall After understanding Intel's comprehensive 4-level hierarchy, you can explore how Arm approached the same performance analysis challenge with a different philosophy. Arm developed a complementary top-down methodology specifically for Neoverse server cores that prioritizes practical usability while maintaining analysis effectiveness. -The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability. +The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, which differs from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability. ### Execute Stage 1: Calculate top-down performance categories -Stage 1 identifies high-level bottlenecks using the same four categories as Intel but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture. +Stage 1 identifies high-level bottlenecks using the same four categories as Intel, but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture. #### Configure Arm-specific PMU counter formulas -Arm uses different top-down metrics based on different events but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology: +Arm uses different top-down metrics based on different events, but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology: | Metric | Formula | Purpose | | :-- | :-- | :-- | @@ -32,7 +32,9 @@ Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU compon #### Navigate resource groups without hierarchical constraints -Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. **Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages. +Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. + +**Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages. ## Apply essential Arm Neoverse PMU counters for analysis diff --git a/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md b/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md index b87f0b03b0..f1541c9302 100644 --- a/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md +++ b/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md @@ -13,7 +13,7 @@ After understanding each architecture's methodology individually, you can now ex - Hierarchical analysis: broad classification followed by drill-down into dominant bottlenecks - Resource attribution: map performance issues to specific CPU micro-architectural components -## Compare 4-level hierarchical and 2-stage methodologies +## Compare multi-level hierarchical and resource groups methodologies | Aspect | Intel x86 | Arm Neoverse | | :-- | :-- | :-- | diff --git a/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md b/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md index bf7f23ec71..40e2e7152e 100644 --- a/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md +++ b/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md @@ -98,9 +98,9 @@ S0-D0-C1 1 8.5% 0.0% 0 6.052117775 seconds time elapsed ``` -You see a very large `backend bound` component for this program. +You see a very large `backend bound` component for this program. -You can also run with the `-M topdownl1` argument on Perf. +You can also run with the `-M topdownl1` argument with Perf. ```console taskset -c 1 perf stat -C 1 -M topdownl1 ./test 1000000000 @@ -129,7 +129,7 @@ Done. Final result: 0.000056 6.029283206 seconds time elapsed ``` -Again, showing `Backend_Bound` value very high (0.96). Notice the x86-specific PMU counters: +Again, showing a `Backend_Bound` value that is very high (0.96). Notice the x86-specific PMU counters: - `uops_issued.any` and `uops_retired.retire_slots` for micro-operation accounting - `idq_uops_not_delivered.core` for frontend delivery failures - `cpu_clk_unhalted.thread` for cycle normalization @@ -137,13 +137,13 @@ Again, showing `Backend_Bound` value very high (0.96). Notice the x86-specific P If you want to learn more, you can continue with the Level 2 and Level 3 hierarchical analysis. -## Use the Arm Neoverse 2-stage top-down methodology +## Use the Arm Neoverse top-down methodology -Arm's approach uses a 2-stage methodology with PMU counters like `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` for Stage 1 analysis, followed by resource effectiveness groups in Stage 2. +Arm's approach uses a methodology with PMU counters like `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` for Stage 1 analysis, followed by resource effectiveness groups in Stage 2. Make sure you install the Arm topdown-tool using the [Telemetry Solution install guide](/install-guides/topdown-tool/). -Collect Stage 2 general metrics including Instructions Per Cycle (IPC): +Collect general metrics including Instructions Per Cycle (IPC): ```console taskset -c 1 topdown-tool -m General ./test 1000000000 @@ -153,11 +153,17 @@ The output is similar to: ```output Performing 1000000000 dependent floating-point divisions... +Monitoring command: test. Hit Ctrl-C to stop. +Run 1 Done. Final result: 0.000056 -Stage 2 (uarch metrics) -======================= -[General] -Instructions Per Cycle 0.355 per cycle +CPU Neoverse V2 metrics +└── Stage 2 (uarch metrics) + └── General (General) + └── ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓ + ┃ Metric ┃ Value ┃ Unit ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩ + │ Instructions Per Cycle │ 0.324 │ per cycle │ + └────────────────────────┴───────┴───────────┘ ``` Collect the Stage 1 topdown metrics using Arm's cycle accounting: @@ -170,52 +176,74 @@ The output is similar to: ```output Performing 1000000000 dependent floating-point divisions... +Monitoring command: test. Hit Ctrl-C to stop. +Run 1 Done. Final result: 0.000056 -Stage 1 (Topdown metrics) -========================= -[Cycle Accounting] -Frontend Stalled Cycles 0.04% cycles -Backend Stalled Cycles. 88.15% cycles +CPU Neoverse V2 metrics +└── Stage 2 (uarch metrics) + └── Cycle Accounting (Cycle_Accounting) + └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓ + ┃ Metric ┃ Value ┃ Unit ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩ + │ Backend Stalled Cycles │ 93.22 │ % │ + │ Frontend Stalled Cycles │ 0.03 │ % │ + └─────────────────────────┴───────┴──────┘ ``` -This confirms the example has high backend stalls equivalent to x86's Backend_Bound category. Notice how Arm's Stage 1 uses percentage of cycles rather than Intel's slot-based accounting. +This confirms the example has high backend stalls, equivalent to x86's Backend_Bound category. Notice how Arm's Stage 1 uses percentage of cycles rather than Intel's slot-based accounting. You can continue to use the `topdown-tool` for additional microarchitecture exploration. For L1 data cache: ```console -taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000 +taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000 ``` The output is similar to: ```output Performing 1000000000 dependent floating-point divisions... +Monitoring command: test. Hit Ctrl-C to stop. +Run 1 Done. Final result: 0.000056 -Stage 2 (uarch metrics) -======================= -[L1 Data Cache Effectiveness] -L1D Cache MPKI............... 0.023 misses per 1,000 instructions -L1D Cache Miss Ratio......... 0.000 per cache access +CPU Neoverse V2 metrics +└── Stage 2 (uarch metrics) + └── L1 Data Cache Effectiveness (L1D_Cache_Effectiveness) + ├── Follows + │ └── Backend Bound (backend_bound) + └── ┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ + ┃ Metric ┃ Value ┃ Unit ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ + │ L1D Cache Miss Ratio │ 0.000 │ per cache access │ + │ L1D Cache MPKI │ 0.129 │ misses per 1,000 instructions │ + └──────────────────────┴───────┴───────────────────────────────┘ ``` For L1 instruction cache effectiveness: ```console -taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000 +taskset -c 1 topdown-tool -m L1I_Cache_Effectiveness ./test 1000000000 ``` The output is similar to: ```output Performing 1000000000 dependent floating-point divisions... +Monitoring command: test. Hit Ctrl-C to stop. +Run 1 Done. Final result: 0.000056 -Stage 2 (uarch metrics) -======================= -[L1 Data Cache Effectiveness] -L1D Cache MPKI............... 0.022 misses per 1,000 instructions -L1D Cache Miss Ratio......... 0.000 per cache access +CPU Neoverse V2 metrics +└── Stage 2 (uarch metrics) + └── L1 Instruction Cache Effectiveness (L1I_Cache_Effectiveness) + ├── Follows + │ └── Frontend Bound (frontend_bound) + └── ┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ + ┃ Metric ┃ Value ┃ Unit ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ + │ L1I Cache Miss Ratio │ 0.003 │ per cache access │ + │ L1I Cache MPKI │ 0.474 │ misses per 1,000 instructions │ + └──────────────────────┴───────┴───────────────────────────────┘ ``` For last level cache: @@ -228,13 +256,22 @@ The output is similar to: ```output Performing 1000000000 dependent floating-point divisions... +Monitoring command: test. Hit Ctrl-C to stop. +Run 1 Done. Final result: 0.000056 -Stage 2 (uarch metrics) -======================= -[Last Level Cache Effectiveness] -LL Cache Read MPKI.............. 0.017 misses per 1,000 instructions -LL Cache Read Miss Ratio........ 0.802 per cache access -LL Cache Read Hit Ratio......... 0.198 per cache access +CPU Neoverse V2 metrics +└── Stage 2 (uarch metrics) + └── Last Level Cache Effectiveness (LL_Cache_Effectiveness) + ├── Follows + │ ├── Backend Bound (backend_bound) + │ └── Frontend Bound (frontend_bound) + └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ + ┃ Metric ┃ Value ┃ Unit ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ + │ LL Cache Read Hit Ratio │ nan │ per cache access │ + │ LL Cache Read Miss Ratio │ nan │ per cache access │ + │ LL Cache Read MPKI │ 0.000 │ misses per 1,000 instructions │ + └──────────────────────────┴───────┴───────────────────────────────┘ ``` For operation mix: @@ -247,25 +284,38 @@ The output is similar to: ```output Performing 1000000000 dependent floating-point divisions... +Monitoring command: test. Hit Ctrl-C to stop. +Run 1 Done. Final result: 0.000056 -Stage 2 (uarch metrics) -======================= -[Speculative Operation Mix] -Load Operations Percentage.......... 16.70% operations -Store Operations Percentage......... 16.59% operations -Integer Operations Percentage....... 33.61% operations -Advanced SIMD Operations Percentage. 0.00% operations -Floating Point Operations Percentage 16.45% operations -Branch Operations Percentage........ 16.65% operations -Crypto Operations Percentage........ 0.00% operations +CPU Neoverse V2 metrics +└── Stage 2 (uarch metrics) + └── Speculative Operation Mix (Operation_Mix) + ├── Follows + │ ├── Backend Bound (backend_bound) + │ └── Retiring (retiring) + └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓ + ┃ Metric ┃ Value ┃ Unit ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩ + │ Barrier Operations Percentage │ ❌ │ % │ + │ Branch Operations Percentage │ ❌ │ % │ + │ Crypto Operations Percentage │ 0.00 │ % │ + │ Integer Operations Percentage │ 33.52 │ % │ + │ Load Operations Percentage │ 16.69 │ % │ + │ Floating Point Operations Percentage │ 16.51 │ % │ + │ Advanced SIMD Operations Percentage │ 0.00 │ % │ + │ Store Operations Percentage │ 16.58 │ % │ + │ SVE Operations (Load/Store Inclusive) Percentage │ 0.00 │ % │ + └──────────────────────────────────────────────────┴───────┴──────┘ ``` ## Cross-architecture performance analysis summary -Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods. Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns. +Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods. + +Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns. Arm Neoverse platforms implement a complementary two-stage methodology where Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics through `topdown-tool`, which accepts the desired metric group via the `-m` argument. -Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach. +Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach. diff --git a/content/learning-paths/cross-platform/topdown-compare/_index.md b/content/learning-paths/cross-platform/topdown-compare/_index.md index d358ec630c..0acdd66b2e 100644 --- a/content/learning-paths/cross-platform/topdown-compare/_index.md +++ b/content/learning-paths/cross-platform/topdown-compare/_index.md @@ -1,16 +1,12 @@ --- title: Compare Arm Neoverse and Intel x86 top-down performance analysis with PMU counters -draft: true -cascade: - draft: true - minutes_to_complete: 30 who_is_this_for: This is an advanced topic for software developers and performance engineers who want to understand the similarities and differences between Arm Neoverse and Intel x86 top-down performance analysis using PMU counters, Linux Perf, and the topdown-tool. learning_objectives: - - Compare Intel x86 4-level hierarchical top-down methodology with Arm Neoverse 2-stage approach using PMU counters + - Compare Intel x86 multi-level hierarchical methodology with Arm Neoverse micro-architecture exploration methodology - Execute performance analysis using Linux Perf on x86 and topdown-tool on Arm systems - Analyze Backend Bound, Frontend Bound, Bad Speculation, and Retiring categories across both architectures diff --git a/content/learning-paths/embedded-and-microcontrollers/rpi-llama3/llama3.md b/content/learning-paths/embedded-and-microcontrollers/rpi-llama3/llama3.md index 5b576665d4..714ec069d3 100755 --- a/content/learning-paths/embedded-and-microcontrollers/rpi-llama3/llama3.md +++ b/content/learning-paths/embedded-and-microcontrollers/rpi-llama3/llama3.md @@ -90,7 +90,9 @@ cmake -DPYTHON_EXECUTABLE=python \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \ -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \ + -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_EXTENSION_LLM=ON \ + -DEXECUTORCH_BUILD_KERNELS_LLM=ON \ -Bcmake-out . cmake --build cmake-out -j16 --target install --config Release ``` @@ -101,10 +103,7 @@ Next, compile and build `llama_runner` and `llama_main`: cmake -DPYTHON_EXECUTABLE=python \ -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Release \ - -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ - -DEXECUTORCH_BUILD_XNNPACK=ON \ - -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -Bcmake-out/examples/models/llama \ examples/models/llama cmake --build cmake-out/examples/models/llama -j16 --config Release diff --git a/content/learning-paths/laptops-and-desktops/win-resource-ps1/_index.md b/content/learning-paths/laptops-and-desktops/win-resource-ps1/_index.md new file mode 100644 index 0000000000..3e37b6e743 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/win-resource-ps1/_index.md @@ -0,0 +1,57 @@ +--- +title: Track resource usage of applications on Windows on Arm + +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This is an introductory topic for developers who want to measure resource usage of applications on Windows on Arm devices. + +learning_objectives: + - Run video encode and decode tasks by using FFmpeg + - Benchmark video encode task + - Sample CPU / memory / power usage of video decode task + +prerequisites: + - A Windows on Arm computer such as the Lenovo Thinkpad X13s running Windows 11 + - Any code editor. [Visual Studio Code for Arm64](https://code.visualstudio.com/docs/?dv=win32arm64user) is suitable. + +author: Ruifeng Wang + +### Tags +skilllevels: Introductory +subjects: Migration to Arm +armips: + - Cortex-A +tools_software_languages: + - FFmpeg + - PowerShell +operatingsystems: + - Windows + + + +further_reading: + - resource: + title: Recording for Resource-based Analysis + link: https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-8.1-and-8/hh448202(v=win.10) + type: documentation + - resource: + title: Get started with Arm64EC + link: https://learn.microsoft.com/en-us/windows/arm/arm64ec-build + type: documentation + - resource: + title: Arm64EC - Build and port apps for native performance on Arm + link: https://learn.microsoft.com/en-us/windows/arm/arm64ec + type: documentation + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/laptops-and-desktops/win-resource-ps1/_next-steps.md b/content/learning-paths/laptops-and-desktops/win-resource-ps1/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/win-resource-ps1/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-1.md b/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-1.md new file mode 100644 index 0000000000..e07941f6d1 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-1.md @@ -0,0 +1,86 @@ +--- +title: Application and data set +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview +System resource usage provides an approach to understand the performance of an application as a black box. This Learning Path demonstrates how to sample system resource usage by using a script. + +The application used is FFmpeg. It is a tool set that performs video encode and decode tasks. We will run the same tests with both x86_64 binary (through emulation) and Arm64 native binary. + +## Application +Binary builds are available. You don't need to build them from source. Download executable files for Windows: + +1. Download [x86_64 package](https://github.com/BtbN/FFmpeg-Builds/releases/download/autobuild-2025-07-31-14-15/ffmpeg-n7.1.1-56-gc2184b65d2-win64-gpl-7.1.zip). +2. Download [Arm64 native package](https://github.com/BtbN/FFmpeg-Builds/releases/download/autobuild-2025-07-31-14-15/ffmpeg-n7.1.1-56-gc2184b65d2-winarm64-gpl-7.1.zip). + +Unzip the downloaded packages. You can find the binaries in **bin** folder. Note paths to **ffmpeg.exe** and **ffplay.exe**. They are used in later steps. + +## Video source +Download test video [RaceNight](https://ultravideo.fi/video/RaceNight_3840x2160_50fps_420_8bit_YUV_RAW.7z) from a public dataset. Unzip the package and note path to the uncompressed yuv file. + +## Video encoding +The downloaded video file is in yuv raw format. It means playback of the video file involves no decoding effort. You need to encode the raw video with some compressing algorithms to add computation pressure at playback. + +Use **ffmpeg.exe** to compress the yuv raw video with x265 algorithm and convert file format to mp4. Open a terminal and run command: +```console +path\to\ffmpeg.exe -f rawvideo -pix_fmt yuv420p -s 3840x2160 -r 50 -i D:\path\to\RaceNight_YUV_RAW\RaceNight_3840x2160_50fps_8bit.yuv -vf scale=1920:1080 -c:v libx265 -preset medium -crf 20 D:\RaceNight_1080p.mp4 -benchmark -stats -report +``` + +{{% notice Note %}} +Modify the paths to `ffmpeg.exe` and yuv raw video file accordingly. +{{% /notice %}} + +The command transforms video size, compresses the video into a H.265 encoded mp4 file. `benchmark` option is turned on to show performance data at the same time. The generated file is at D:\RaceNight_1080p.mp4. + +### View results +Shown below is example output from running x86_64 version ffmpeg.exe: +```output +x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip mode=1 signhide tmvp +x265 [info]: tools: b-intra strong-intra-smoothing lslices=6 deblock sao +Output #0, mp4, to 'D:\RaceNight_1080p.mp4': + Metadata: + encoder : Lavf61.7.100 + Stream #0:0: Video: hevc (hev1 / 0x31766568), yuv420p(tv, progressive), 1920x1080, q=2-31, 50 fps, 12800 tbn + Metadata: + encoder : Lavc61.19.101 libx265 + Side data: + cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A +[out#0/mp4 @ 0000020e0d6f3880] video:13297KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:2KiB muxing overhead: 0.079970% +frame= 600 fps=8.4 q=29.4 Lsize= 13308KiB time=00:00:11.96 bitrate=9115.2kbits/s speed=0.167x +bench: utime=480.344s stime=10.203s rtime=71.548s +bench: maxrss=910112KiB +x265 [info]: frame I: 3, Avg QP:22.41 kb/s: 50202.13 +x265 [info]: frame P: 146, Avg QP:23.73 kb/s: 18265.18 +x265 [info]: frame B: 451, Avg QP:28.45 kb/s: 5827.62 +x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0% + +encoded 600 frames in 71.51s (8.39 fps), 9075.96 kb/s, Avg QP:27.27 +``` + +Example output from running Arm64 native ffmpeg.exe: +```output +x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip mode=1 signhide tmvp +x265 [info]: tools: b-intra strong-intra-smoothing lslices=6 deblock sao +Output #0, mp4, to 'D:\RaceNight_1080p.mp4': + Metadata: + encoder : Lavf61.7.100 + Stream #0:0: Video: hevc (hev1 / 0x31766568), yuv420p(tv, progressive), 1920x1080, q=2-31, 50 fps, 12800 tbn + Metadata: + encoder : Lavc61.19.101 libx265 + Side data: + cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A +[out#0/mp4 @ 000001b3c215f8e0] video:13348KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:2KiB muxing overhead: 0.080169% +frame= 600 fps= 23 q=29.3 Lsize= 13359KiB time=00:00:11.96 bitrate=9150.2kbits/s speed=0.456x +bench: utime=169.891s stime=7.281s rtime=26.224s +bench: maxrss=1040836KiB +x265 [info]: frame I: 3, Avg QP:22.40 kb/s: 50457.20 +x265 [info]: frame P: 146, Avg QP:23.71 kb/s: 18246.21 +x265 [info]: frame B: 451, Avg QP:28.40 kb/s: 5878.38 +x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0% + +encoded 600 frames in 26.20s (22.90 fps), 9110.78 kb/s, Avg QP:27.23 +``` \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-2.md b/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-2.md new file mode 100644 index 0000000000..50708a2e35 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-2.md @@ -0,0 +1,148 @@ +--- +title: Tracking system resource +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Sampling video decoding resource usage +A PowerShell script does all the work. It launches the video decoding task, samples CPU and memory usage, and outputs sampled data to a file with format. + +Open your code editor, copy content below and save it as `sample_decoding.ps1`. +```PowerShell { line_numbers = true } +param ( + [string]$exePath = "path\to\ffplay.exe", + [string[]]$argList = @("-loop", "15", "-autoexit", "D:\RaceNight_1080p.mp4"), + [int]$interval = 2, + [string]$outputFile = "usage_log.csv" +) + +"" | Out-File -FilePath $outputFile + +if (-Not (Test-Path $exePath)) { + Write-Host "Executable not found at path: $exePath" + exit 1 +} + +$zoneIdentifier = "$exePath`:Zone.Identifier" +if (Test-Path $zoneIdentifier) { + Write-Host "exe is locked. Trying to unlock..." + try { + Unblock-File -Path $exePath + Write-Host "Unlocked exe file." + } catch { + Write-Host "Failed to unlock exe: $($_.Exception.Message)" + } +} else { + Write-Host "exe is not locked." +} + +try { + $cmdLine = "`"$exePath`" $argList" + Write-Host "Executing: $cmdLine" + $process = Start-Process -FilePath $exePath -ArgumentList $argList -PassThru +} catch { + Write-Host "Failed to start process. Error: $_" + exit 1 +} + +$appPid = $process.Id +Write-Host "Parent PID: $appPid" + +Start-Sleep -Seconds 2 +$childProcess = Get-CimInstance -ClassName Win32_Process | Where-Object { $_.ParentProcessId -eq $appPid } + +$index = 1 +$outHead = @() +$outHead += "Timestamp,CPU Sum (s),Memory Sum (MB),Memory Private Sum (MB),CPU0 (s),Memory0 (MB),Memory Private0 (MB)" +foreach ($child in $childProcess) { + $childPid = $child.ProcessID + Write-Host " - Child: $childPid" + $outHead += "CPU$index (s),Memory$index (MB),Memory Private$index (MB)" + $index++ +} +$outHead -join "," | Out-File -Encoding utf8 $outputFile + +Write-Host "Sampling start..." + +while (-not $process.HasExited) { + $cpu = @() + $mem = @() + $memPriv = @() + $outLine = @() + + $timestamp = Get-Date -Format o + $outLine += $timestamp + $proc = Get-Process -Id $appPid -ErrorAction SilentlyContinue + if ($proc) { + $cpu += $proc.CPU + $mem += $proc.WorkingSet64 / 1MB + $memPriv += $proc.PrivateMemorySize64 / 1MB + + foreach ($child in $childProcess) { + $procChild = Get-Process -Id $child.ProcessId -ErrorAction SilentlyContinue + $cpu += $procChild.CPU + $mem += $procChild.WorkingSet64 / 1MB + $memPriv += $procChild.PrivateMemorySize64 / 1MB + } + + $outLine += ($cpu | Measure-Object -Sum).Sum + $outLine += "{0:F2}" -f ($mem | Measure-Object -Sum).Sum + $outLine += "{0:F2}" -f ($memPriv | Measure-Object -Sum).Sum + for ($i = 0; $i -lt $cpu.Count; $i++) { + $outLine += $cpu[$i] + $outLine += $mem[$i] + $outLine += $memPriv[$i] + } + + $outLine -join "," | Out-File -Append -Encoding utf8 $outputFile + } + + Start-Sleep -Seconds $interval + $process.Refresh() +} +``` + +{{% notice Note %}} +Modify the path to `ffplay.exe` on line 2 accordingly. +{{% /notice %}} + +Run the script: +```console +Set-ExecutionPolicy -Scope Process RemoteSigned +.\sample_decoding.ps1 +``` +A video starts playing. It ends in 3 minutes. And then you can find the sample results file **usage_log.csv** in current directory. + +{{% notice Note %}} +Script execution can be blocked due to policy configuration. The `Set-ExecutionPolicy` line allows local script to run during this session. +{{% /notice %}} + +### Script explained +The `param` section defines variables including binary path, video playback arguments, sampling interval and result file path. + +Line 15 - Line 26 check and modify binary file attribute. The binaries in use are downloaded from the web. They can be blocked to run due to lack of signature. These lines unlock the binaries. + +Line 41 gets all the child processes of the main process. The statistic data include resources used by all the processes spawned by the main process. + +The `while` setction collects processes' CPU and memory usage periodically until the application exits. The CPU usage is accumulated time length that the process runs on CPU. And the memory usage is size of memory occupation with or without shared spaces accounted. + +### View result +Shown below is example sample result from running x86_64 version ffplay.exe: +```output +Timestamp,CPU Sum (s),Memory Sum (MB),Memory Private Sum (MB),CPU0 (s),Memory0 (MB),Memory Private0 (MB),CPU1 (s),Memory1 (MB),Memory Private1 (MB) +2025-08-18T10:40:12.3480939+08:00,3.6875,378.65,342.16,3.671875,366.3515625,340.33984375,0.015625,12.296875,1.82421875 +...... +2025-08-18T10:43:09.7262439+08:00,396.375,391.71,355.00,396.359375,379.453125,353.2421875,0.015625,12.2578125,1.7578125 +``` + +Example result from running Arm64 native ffplay.exe: +```output +Timestamp,CPU Sum (s),Memory Sum (MB),Memory Private Sum (MB),CPU0 (s),Memory0 (MB),Memory Private0 (MB),CPU1 (s),Memory1 (MB),Memory Private1 (MB) +2025-08-18T10:36:04.3654823+08:00,3.296875,340.51,328.17,3.28125,328.18359375,326.359375,0.015625,12.32421875,1.8125 +...... +2025-08-18T10:39:01.7856168+08:00,329.109375,352.53,339.96,329.09375,340.23046875,338.20703125,0.015625,12.30078125,1.75390625 +``` + +The sample result file is in **csv** format. You can open it with spreadsheet applications like Microsoft Excel for a better view and plot lines for data analysis. diff --git a/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-3.md b/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-3.md new file mode 100644 index 0000000000..8365c33c66 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/win-resource-ps1/how-to-3.md @@ -0,0 +1,95 @@ +--- +title: Measuring power usage +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Sampling battery status +Querying battery status provides a way to measure power usage without an external power meter. It is also handy in that data collection and logging can be automatic. + +A PowerShell script does all the work. It launches the video decoding task, samples battery status, and outputs sampled data to a file with format. + +Open your code editor, copy content below and save it as `sample_power.ps1`. +```PowerShell { line_numbers = true } +param ( + [string]$exePath = "path\to\ffplay.exe", + [string[]]$argList = @("-loop", "150", "-autoexit", "D:\RaceNight_1080p.mp4"), + [int]$interval = 10, + [string]$outputFile = "watts.csv" +) + +# Clear or create log file +"" | Out-File -FilePath $outputFile + +try { + $cmdLine = "`"$exePath`" $argList" + Write-Host "Executing: $cmdLine" + $process = Start-Process -FilePath $exePath -ArgumentList $argList -PassThru +} catch { + Write-Host "Failed to start process. Error: $_" + exit 1 +} + +$appPid = $process.Id +Write-Host "Started application with PID: $appPid" + +$outHead = @() +$outHead += "Timestamp,RemainingCapacity(mWh),DischargeRate(mW)" +$outHead -join "," | Out-File -Encoding utf8 $outputFile + +Write-Host "Sampling start..." + +while (-not $process.HasExited) { + $outLine = @() + + $timestamp = Get-Date -Format o + $outLine += $timestamp + $proc = Get-Process -Id $appPid -ErrorAction SilentlyContinue + if ($proc) { + # Battery status sampling + $powerConsumption = Get-WmiObject -Namespace "root\wmi" -Class "BatteryStatus" + $remainingCapacity = $powerConsumption.RemainingCapacity + $dischargeRate = $powerConsumption.DischargeRate + $outLine += $remainingCapacity + $outLine += $dischargeRate + + $outLine -join "," | Out-File -Append -Encoding utf8 $outputFile + } + + Start-Sleep -Seconds $interval + $process.Refresh() +} +``` + +{{% notice Note %}} +Modify the path to `ffplay.exe` on line 2 accordingly. +{{% /notice %}} + +The battery data is system based and process agnostic. Full charge the battery. Close any unnecessary applications. Unplug the power cord. And run the script: +```console +.\sample_power.ps1 +``` +A video starts playing. It ends in 30 minutes. And then you can find the sample results file **watts.csv** in current directory. The test runs for a longer time so you can observe a distinct battery remaining capacity drop. + +The script collects battery remaining capacity and discharge rate periodically. You can track the battery remaining capacity to have an understanding of the power consumption. + +### View result +Shown below is example sample result from running x86_64 version ffplay.exe: +```output +Timestamp,RemainingCapacity(mWh),DischargeRate(mW) +2025-08-15T14:42:50.5231628+08:00,48438,4347 +...... +2025-08-15T15:12:38.2028188+08:00,43823,8862 +``` + +Example result from running Arm64 native ffplay.exe: +```output +Timestamp,RemainingCapacity(mWh),DischargeRate(mW) +2025-08-15T15:53:05.8430758+08:00,48438,3255 +...... +2025-08-15T16:22:55.3163530+08:00,44472,7319 +``` + +The sample result file is in **csv** format. You can open it with spreadsheet applications like Microsoft Excel for a better view and plot lines for data analysis. diff --git a/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/5-run-benchmark-on-android.md b/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/5-run-benchmark-on-android.md index 158c47cbc8..a9422241fc 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/5-run-benchmark-on-android.md +++ b/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/5-run-benchmark-on-android.md @@ -22,9 +22,9 @@ export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ Make sure you can confirm $ANDROID_NDK/build/cmake/android.toolchain.cmake is available for CMake to cross-compile. {{% /notice %}} -### 2. Build ExecuTorch and associated libraries for Android with KleidiAI +### 2. Build ExecuTorch and associated libraries for Android with KleidiAI -You are now ready to build ExecuTorch for Android by taking advantage of the performance optimization provided by the [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai) kernels. +You are now ready to build ExecuTorch for Android by taking advantage of the performance optimization provided by the [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai) kernels. Use `cmake` to cross-compile ExecuTorch: @@ -119,7 +119,7 @@ adb push cmake-out-android/examples/models/llama/llama_main /data/local/tmp/llam Use the Llama runner to execute the model on the phone with the `adb` command: ``` bash -adb shell "cd /data/local/tmp/llama && ./llama_main --model_path llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte --tokenizer_path tokenizer.model --prompt "<|start_header_id|>system<|end_header_id|>\nYour name is Cookie. you are helpful, polite, precise, concise, honest, good at writing. You always give precise and brief answers up to 32 words<|eot_id|><|start_header_id|>user<|end_header_id|>\nHey Cookie! how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>" --warmup=1 --cpu_threads=5" +adb shell "cd /data/local/tmp/llama && ./llama_main --model_path llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte --tokenizer_path tokenizer.model --prompt '<|start_header_id|>system<|end_header_id|>\nYour name is Cookie. you are helpful, polite, precise, concise, honest, good at writing. You always give precise and brief answers up to 32 words<|eot_id|><|start_header_id|>user<|end_header_id|>\nHey Cookie! how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>' --warmup=1 --cpu_threads=5" ``` The output should look something like this. diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/1-introduction.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/1-introduction.md new file mode 100644 index 0000000000..555322717a --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/1-introduction.md @@ -0,0 +1,34 @@ +--- +title: Install Model Gym and Explore Neural Graphics Examples +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is Neural Graphics? + +Neural graphics is an intersection of graphics and machine learning. Rather than relying purely on traditional GPU pipelines, neural graphics integrates learned models directly into the rendering stack. The techniques are particularly powerful on mobile devices, where battery life and performance constraints limit traditional compute-heavy rendering approaches. The goal is to deliver high visual fidelity without increasing GPU cost. This is achieved by training and deploying compact neural networks optimized for the device's hardware. + +## How does Arm support neural graphics? + +Arm enables neural graphics through the [**Neural Graphics Development Kit**](https://developer.arm.com/mobile-graphics-and-gaming/neural-graphics): a set of open-source tools that let developers train, evaluate, and deploy ML models for graphics workloads. + +At its core are the ML Extensions for Vulkan, which bring native ML inference into the GPU pipeline using structured compute graphs. These extensions (`VK_ARM_tensors` and `VK_ARM_data_graph`) allow real-time upscaling and similar effects to run efficiently alongside rendering tasks. + +The neural graphics models can be developed using well-known ML frameworks like PyTorch, and exported to deployment using Arm's hardware-aware pipeline. The workflow converts the model to `.vgf` via the TOSA intermediate representation, making it possible to do tailored model development for you game use-case. This Learning Path focuses on **Neural Super Sampling (NSS)** as the use case for training, evaluating, and deploying neural models using a toolkit called the [**Neural Graphics Model Gym**](https://github.com/arm/neural-graphics-model-gym). To learn more about NSS, you can check out the [resources on Hugging Face](https://huggingface.co/Arm/neural-super-sampling). Additonally, Arm has developed a set of Vulkan Samples to get started. Specifically, `.vgf` format is introduced in the `postprocessing_with_vgf` one. The Vulkan Samples and over-all developer resources for neural graphics is covered in the [introductory Learning Path](/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample). + +Starting in 2026, Arm GPUs will feature dedicated neural accelerators, optimized for low-latency inference in graphics workloads. To help developers get started early, Arm provides the ML Emulation Layers for Vulkan that simulate future hardware behavior, so you can build and test models now. + +## What is the Neural Graphics Model Gym? + +The Neural Graphics Model Gym is an open-source toolkit for fine-tuning and exporting neural graphics models. It is designed to streamline the entire model lifecycle for graphics-focused use cases, like NSS. + +Model Gym gives you: + +- A training and evaluation API built on PyTorch +- Model export to .vgf using ExecuTorch for real-time use in game development +- Support for quantization-aware training (QAT) and post-training quantization (PTQ) using ExecuTorch +- Optional Docker setup for reproducibility + +The toolkit supports workflows via both Python notebooks (for rapid experimentation) and command-line interface. This Learning Path will walk you through the demonstrative notebooks, and prepare you to start using the CLI for your own model development. diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/2-devenv.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/2-devenv.md new file mode 100644 index 0000000000..013148bcaa --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/2-devenv.md @@ -0,0 +1,58 @@ +--- +title: Set up your environment +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this section, you will install a few dependencies into your Ubuntu environment. You'll need a working Python 3.10+ environment with some ML and system dependencies. Make sure Python is installed by verifying that the version is >3.10: + +```bash +python3 --version +``` + +Next, install a few additional packages: + +```bash +sudo apt update +sudo apt install python3-venv python-is-python3 gcc make python3-dev -y +``` + +## Set up the examples repository + +The example notebooks are open-sourced in a GitHub repository. Start by cloning it: + +```bash +git clone https://github.com/arm/neural-graphics-model-gym-examples.git +cd neural-graphics-model-gym-examples +``` + +From inside the `neural-graphics-model-gym-examples/` folder, run the setup script: + +```bash +./setup.sh +``` + +This will: +- create a Python virtual environment called `nb-env` +- install the `ng-model-gym` package and required dependencies +- download the datasets and weights needed to run the notebooks + +Activate the virtual environment: + +```bash +source nb-env/bin/activate +``` + +Run the following in a python shell to confirm that the script was successful: + +```python +import torch +import ng_model_gym + +print("Torch version:", torch.__version__) +print("Model Gym version:", ng_model_gym.__version__) +``` + +You’re now ready to start walking through the training and evaluation steps. diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/3-model-training.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/3-model-training.md new file mode 100644 index 0000000000..37deaf987c --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/3-model-training.md @@ -0,0 +1,67 @@ +--- +title: Launch the training notebook +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this section, you'll get hands-on with how you can use the model gym to fine-tune the NSS use-case. + +## About NSS + +Arm Neural Super Sampling (NSS) is an upscaling technique designed to solve a growing challenge in real-time graphics: delivering high visual quality without compromising performance or battery life. Instead of rendering every pixel at full resolution, NSS uses a neural network to intelligently upscale frames, freeing up GPU resources and enabling smoother, more immersive experiences on mobile devices. + +The NSS model is available in two formats: + +| Model format | File extension | Used for | +|--------------|----------------|--------------------------------------------------------------------------| +| PyTorch | .pt | training, fine-tuning, or evaluation in or scripts using the Model Gym | +| VGF | .vgf | for deployment using ML Extensions for Vulkan on Arm-based hardware or emulation layers | + +Both formats are available in the [NSS repository on Hugging Face](https://huggingface.co/Arm/neural-super-sampling). You'll also be able to explore config files, model metadata, usage details and detailed documentation on the use-case. + +Aside from the model in HuggingFace, the Neural Graphics Development Kit features [an NSS plugin for game engines such as Unreal](/learning-paths/mobile-graphics-and-gaming/nss-unreal). + +## Run the training notebook + +With your environment set up, you're ready to launch the first step in the workflow: training your neural graphics model using the `model_training_example.ipynb` notebook. + +{{% notice Before you begin %}} +In this part of the Learning Path, you will run through two Jupyter Notebooks. Return to this tutorial when you're done to explore further resources and next steps. +{{% /notice %}} + +You will get familiarized with the following steps: + +- Loading a model configuration +- Launching a full training pipeline +- Visualizing metrics with TensorBoard +- Saving intermediate checkpoints + +### Start Jupyter Lab + +Launch Jupyter Lab with the following command: + +```bash +jupyter lab +``` + +This will prompt you to open your browser to `http://localhost:8888`and enter the token that is printed in the terminal output. Navigate to: + +```output +neural-graphics-model-gym-examples/tutorials/nss/model_training_example.ipynb +``` + +Step through the notebook for training. + +Once your model is trained, the next step is evaluation. You'll measure accuracy, compare checkpoints, and prepare the model for export. Open the evaluation notebook located at the following location: + +```output +neural-graphics-model-gym-examples/tutorials/nss/model_evaluation_example.ipynb +``` + +At the end you should see a visual comparison of the NSS upscaling and the ground truth image. + +Proceed to the final section to view the model structure and explore further resources. + + diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/4-model-explorer.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/4-model-explorer.md new file mode 100644 index 0000000000..7d3e720066 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/4-model-explorer.md @@ -0,0 +1,49 @@ +--- +title: Visualize your model with Model Explorer +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is Model Explorer? + +Model Explorer is a visualization tool for inspecting neural network structures and execution graphs. Arm provides a VGF adapter for Model Explorer, allowing you to visualize `.vgf` models created from your training and export pipeline. + +This lets you inspect model architecture, tensor shapes, and graph connectivity before deployment. This can be a powerful way to debug and understand your exported neural graphics models. + +## Setting up the VGF adapter + +The VGF adapter extends Model Explorer to support `.vgf` files exported from the Model Gym toolchain. + +### Install the VGF adapter with pip + +```bash +pip install vgf-adapter-model-explorer +``` + +The source code is available on [GitHub](https://github.com/arm/vgf-adapter-model-explorer). + +### Install Model Explorer + +The next step is to make sure the Model Explorer itself is installed. Use pip to set it up: + +```bash +pip install torch ai-edge-model-explorer +``` + +### Launch the viewer + +Once installed, launch the explorer with the VGF adapter: + +```bash +model-explorer --extensions=vgf_adapter_model_explorer +``` + +Use the file browser to open the `.vgf` model exported earlier in your training workflow. + +## Wrapping up + +Through this Learning Path, you’ve learned what neural graphics is and why it matters for game performance. You’ve stepped through the process of training and evaluating an NSS model using PyTorch and the Model Gym, and seen how to export that model into VGF (.vgf) for real-time deployment. You’ve also explored how to visualize and inspect the model’s structure using Model Explorer. + +As a next step, you can head over to the [Model Training Gym repository](https://github.com/arm/neural-graphics-model-gym/tree/main) documentation to explore integration into your own game development workflow. You’ll find resources on fine-tuning, deeper details about the training and export process, and everything you need to adapt to your own content and workflows. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md new file mode 100644 index 0000000000..0c809ef4ef --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md @@ -0,0 +1,59 @@ +--- +title: Fine-Tuning Neural Graphics Models with Model Gym + +draft: true +cascade: + draft: true + +minutes_to_complete: 45 + +who_is_this_for: This is an advanced topic for developers exploring neural graphics and interested in training and deploying upscaling models like Neural Super Sampling (NSS) using PyTorch and Arm’s hardware-aware backend. + +learning_objectives: + - Understand the principles of neural graphics and how it’s applied to game performance + - Learn how to fine-tune and evaluate a neural network for Neural Super Sampling (NSS) + - Use the Model Gym Python API and CLI to configure and train neural graphics models + - Visualize and inspect .vgf models using the Model Explorer tool + +prerequisites: + - Basic understanding of PyTorch and machine learning concepts + - A development machine running Ubuntu 22.04, with a CUDA-capable NVIDIA® GPU + - CUDA Toolkit version 11.8 or later + +author: Annie Tallund + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Mali +tools_software_languages: + - PyTorch + - Jupyter Notebook + - Vulkan +operatingsystems: + - Linux +further_reading: + - resource: + title: Model Gym GitHub Repository + link: https://github.com/arm/neural-graphics-model-gym + type: code + - resource: + title: NSS Fine-Tuning Guide + link: https://developer.arm.com/documentation/111141/latest + type: documentation + - resource: + title: Neural Graphics Development Kit + link: https://developer.arm.com/mobile-graphics-and-gaming/neural-graphics + type: website + - resource: + title: NSS on HuggingFace + link: https://huggingface.co/Arm/neural-super-sampling + type: website + + +### FIXED, DO NOT MODIFY +weight: 1 +layout: "learningpathall" +learning_path_main_page: "yes" +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md index b7118c9a15..1c8d42ec2e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md @@ -18,9 +18,9 @@ sudo apt update sudo apt install cmake git-lfs -y ``` -You can use Android Studio to obtain the NDK. +You can use Android Studio to obtain the NDK. -Click **Tools > SDK Manager** and navigate to the **SDK Tools** tab. +Click **Tools > SDK Manager** and navigate to the **SDK Tools** tab. Select the **NDK (Side by side)** and **CMake** checkboxes, as shown below: @@ -55,7 +55,7 @@ source vision_llm/bin/activate ## Set up Phone Connection -You need to set up an authorized connection with your phone. The Android SDK Platform Tools package, included with Android Studio, provides Android Debug Bridge (ADB) for transferring files. +You need to set up an authorized connection with your phone. The Android SDK Platform Tools package, included with Android Studio, provides Android Debug Bridge (ADB) for transferring files. Connect your phone to your computer using a USB cable, and enable USB debugging on your phone. To do this, tap the **Build Number** in your **Settings** app 7 times, then enable **USB debugging** in **Developer Options**. @@ -79,7 +79,9 @@ The pre-quantized model is available in Hugging Face, you can download with the ```bash git lfs install git clone https://huggingface.co/taobao-mnn/Qwen2.5-VL-3B-Instruct-MNN +cd Qwen2.5-VL-3B-Instruct-MNN git checkout a4622194b3c518139e2cb8099e147e3d71975f7a +cd .. ``` ## (Optional) Download and Convert the Model @@ -133,11 +135,11 @@ Verify that the model was built correctly by checking that the `Qwen2.5-VL-3B-In ## Push the model to Android device -Push the model onto the device: +Push the repository you cloned earlier onto the device: ```shell adb shell mkdir /data/local/tmp/models/ -adb push Qwen2.5-VL-3B-Instruct-MNN /data/local/tmp/models +adb push Qwen2.5-VL-3B-Instruct-MNN/ /data/local/tmp/models ``` With the model set up, you're ready to build and run an example application. diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-generate-apk.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-generate-apk.md new file mode 100644 index 0000000000..6e4de6e791 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-generate-apk.md @@ -0,0 +1,53 @@ +--- +title: Benchmark the Vision Transformer performance with KleidiAI +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Clone Vision Language Models repo + +In this section, you will run the Qwen model in action using a demo application using a Android Package Kit (APK). + +This repository is set up to enable building the app as an Android Studio project. + +Run the following commands to clone the repository and checkout the source tree: + +```bash +git clone https://gitlab.arm.com/kleidi/kleidi-examples/vision-language-models +``` + +## Build the App Using Android Studio + +You can use Android Studio to build the app and create an APK. + +### Open project and build + +Open Android Studio. + +Go to **File > Open**. + +Navigate to the vision-language-models directories, and click `Open`. + +This triggers a build of the project, and you should see output similar to the following on completion: + +```output +BUILD SUCCESSFUL in 1m 42s +``` + +### Generate and Run the APK + +Navigate to **Build > Generate App Bundles or APKs**. Select **Generate APKs**. + +The build will be executed, and then the app will be copied and installed on the Android device. + +After opening the app, you will see the splash screen: + +![Loading screenshot](Loading_page.png) + +Finally, you can use the UI to chat with the app. Try uploading an image and ask a question on it. + +![Loading screenshot](chat2.png) + +The final step is to examine how KleidiAI can improve the performance of the model. Continue to the next section to find out. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/3-benchmark.md similarity index 85% rename from content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md rename to content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/3-benchmark.md index 863eb1a49c..3849c44ed1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/3-benchmark.md @@ -1,6 +1,6 @@ --- title: Build the MNN Command-line ViT Demo -weight: 4 +weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall @@ -29,7 +29,7 @@ Run the following commands to clone the MNN repository and checkout the source t cd $HOME git clone https://github.com/alibaba/MNN.git cd MNN -git checkout a739ea5870a4a45680f0e36ba9662ca39f2f4eec +git checkout fa3b2161a9b38ac1e7dc46bb20259bd5eb240031 ``` Create a build directory and run the build script. @@ -40,10 +40,9 @@ The first time that you do this, build the binaries with the `-DMNN_KLEIDIAI` fl cd $HOME/MNN/project/android mkdir build_64 && cd build_64 -../build_64.sh "-DMNN_LOW_MEMORY=true -DLLM_SUPPORT_VISION=true -DMNN_KLEIDIAI=FALSE \ - -DMNN_CPU_WEIGHT_DEQUANT_GEMM=true -DMNN_BUILD_LLM=true \ - -DMNN_SUPPORT_TRANSFORMER_FUSE=true -DMNN_ARM82=true -DMNN_OPENCL=true \ - -DMNN_USE_LOGCAT=true -DMNN_IMGCODECS=true -DMNN_BUILD_OPENCV=true" +../build_64.sh "-DMNN_BUILD_LLM=true -DMNN_BUILD_LLM_OMNI=ON -DLLM_SUPPORT_VISION=true \ +-DMNN_BUILD_OPENCV=true -DMNN_IMGCODECS=true -DMNN_LOW_MEMORY=true \ +-DMNN_CPU_WEIGHT_DEQUANT_GEMM=true -DMNN_BUILD_LLM=true -DMNN_SUPPORT_TRANSFORMER_FUSE=true" ``` {{% notice Note %}} If your NDK toolchain isn't set up correctly, you might run into issues with the above script. Make a note of where the NDK was installed - this will be a directory named after the version you downloaded earlier. Try exporting the following environment variables before re-running `build_64.sh`: @@ -102,14 +101,19 @@ prefill speed = 192.28 tok/s ## Enable KleidiAI and Re-run Inference -The next step is to re-generate the binaries with KleidiAI activated. This is done by updating the flag `-DMNN_KLEIDIAI` to `TRUE`. +The next step is to re-generate the binaries with KleidiAI activated. This is done by inserting a hint into the code. + +From the `MNN` directory, run: +```bash +sed -i '/void Llm::setRuntimeHint(std::shared_ptr &rtg) {/a\ + rtg->setHint(MNN::Interpreter::CPU_ENABLE_KLEIDIAI, 1);' transformers/llm/engine/src/llm.cpp +``` From the `build_64` directory, run: ```bash -../build_64.sh "-DMNN_LOW_MEMORY=true -DLLM_SUPPORT_VISION=true -DMNN_KLEIDIAI=TRUE \ --DMNN_CPU_WEIGHT_DEQUANT_GEMM=true -DMNN_BUILD_LLM=true \ --DMNN_SUPPORT_TRANSFORMER_FUSE=true -DMNN_ARM82=true -DMNN_OPENCL=true \ --DMNN_USE_LOGCAT=true -DMNN_IMGCODECS=true -DMNN_BUILD_OPENCV=true" +../build_64.sh "-DMNN_BUILD_LLM=true -DMNN_BUILD_LLM_OMNI=ON -DLLM_SUPPORT_VISION=true \ +-DMNN_BUILD_OPENCV=true -DMNN_IMGCODECS=true -DMNN_LOW_MEMORY=true \ +-DMNN_CPU_WEIGHT_DEQUANT_GEMM=true -DMNN_BUILD_LLM=true -DMNN_SUPPORT_TRANSFORMER_FUSE=true" ``` ## Update Files on the Device diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md index a841bab945..67d34163a6 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md @@ -1,17 +1,16 @@ --- -title: Learn about the impact of network interrupts on cloud workloads +title: Optimize network interrupt handling on Arm servers -draft: true -cascade: - draft: true - + minutes_to_complete: 20 -who_is_this_for: This is a specialized topic for developers and performance engineers who are interested in understanding how network interrupt patterns can impact performance on cloud servers. +who_is_this_for: This is an introductory topic for developers and performance engineers who are interested in understanding how network interrupt patterns can impact performance on cloud servers. learning_objectives: - Analyze the current interrupt request (IRQ) layout on an Arm Linux system - Experiment with different interrupt options and patterns to improve performance + - Configure optimal IRQ distribution strategies for your workload + - Implement persistent IRQ management solutions prerequisites: - An Arm computer running Linux @@ -36,6 +35,22 @@ further_reading: title: Perf for Linux on Arm (LinuxPerf) link: https://learn.arm.com/install-guides/perf/ type: website + - resource: + title: Tune network workloads on Arm-based bare-metal instances + link: /learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/ + type: learning-path + - resource: + title: Get started with Arm-based cloud instances + link: /learning-paths/servers-and-cloud-computing/csp/ + type: learning-path + - resource: + title: Linux kernel IRQ subsystem documentation + link: https://www.kernel.org/doc/html/latest/core-api/irq/index.html + type: website + - resource: + title: Microbenchmark and tune network performance with iPerf3 + link: /learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/ + type: learning-path ### FIXED, DO NOT MODIFY # ================================================================================ diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_next-steps.md index c3db0de5a2..a648cb125c 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_next-steps.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_next-steps.md @@ -6,3 +6,5 @@ weight: 21 # Set to always be larger than the content in this p title: "Next Steps" # Always the same, html page title. layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. --- + + diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md index 91a635293c..3bea52e34b 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md @@ -1,5 +1,5 @@ --- -title: Understand and Analyze network IRQ configuration +title: Understand and analyze network IRQ configuration weight: 2 ### FIXED, DO NOT MODIFY @@ -10,18 +10,18 @@ layout: learningpathall In modern cloud environments, network performance is critical to overall system efficiency. Network interface cards (NICs) generate interrupt requests (IRQs) to notify the CPU when data packets arrive or need to be sent. These interrupts temporarily pause normal processing, allowing the system to handle network traffic. -By default, Linux distributes these network interrupts across available CPU cores. However, this distribution is not always optimal for performance: +By default, Linux distributes these network interrupts across available CPU cores. However, this distribution is not always optimal for performance, for the following reasons: -- High interrupt rates: In busy servers, network cards can generate thousands of interrupts per second -- CPU cache locality: Processing related network operations on the same CPU core improves cache efficiency -- Resource contention: When network IRQs compete with application workloads for the same CPU resources, both can suffer +- High interrupt rates: in busy servers, network cards can generate thousands of interrupts per second +- CPU cache locality: processing related network operations on the same CPU core improves cache efficiency +- Resource contention: when network IRQs compete with application workloads for the same CPU resources, both can suffer - Power efficiency: IRQ management can help reduce unnecessary CPU wake-ups, improving energy efficiency Understanding and optimizing IRQ assignment allows you to balance network processing loads, reduce latency, and maximize throughput for your specific workloads. ## Identifying IRQs on your system -To get started, run this command to display all IRQs on your system and their CPU assignments: +To get started, display all IRQs on your system and their CPU assignments: ```bash grep '' /proc/irq/*/smp_affinity_list | while IFS=: read path cpus; do @@ -31,7 +31,7 @@ grep '' /proc/irq/*/smp_affinity_list | while IFS=: read path cpus; do done ``` -The output is very long and looks similar to: +The output is long and looks similar to: ```output IRQ 104 -> CPUs 12 -> Device ens34-Tx-Rx-5 @@ -50,7 +50,7 @@ IRQ 26 -> CPUs 0-15 -> Device ACPI:Ged ## How to identify network IRQs -Network-related IRQs can be identified by looking at the "Device" column in the output. +Network-related IRQs can be identified by looking at the **Device** column in the output. You can identify network interfaces using the command: @@ -58,28 +58,17 @@ You can identify network interfaces using the command: ip link show ``` -Here are some common patterns to look for: +Look for common interface naming patterns in the output. Traditional ethernet interfaces use names like `eth0`, while wireless interfaces typically appear as `wlan0`. Modern Linux systems often use the predictable naming scheme, which creates names like `enP3p3s0f0` and `ens5-Tx-Rx-0`. -Common interface naming patterns include `eth0` for traditional ethernet, `enP3p3s0f0` and `ens5-Tx-Rx-0` for the Linux predictable naming scheme, or `wlan0` for wireless. - -The predictable naming scheme breaks down into: - -- en = ethernet -- P3 = PCI domain 3 -- p3 = PCI bus 3 -- s0 = PCI slot 0 -- f0 = function 0 - -This naming convention helps ensure network interfaces have consistent names across reboots by encoding their physical -location in the system. +The predictable naming scheme encodes the physical location within the interface name. For example, `enP3p3s0f0` breaks down as: `en` for ethernet, `P3` for PCI domain 3, `p3` for PCI bus 3, `s0` for PCI slot 0, and `f0` for function 0. This naming convention helps ensure network interfaces maintain consistent names across reboots by encoding their physical location in the system. ## Improve performance -Once you've identified the network IRQs, you can adjust their CPU assignments to try to improve performance. +Once you've identified the network IRQs, you can adjust their CPU assignments to improve performance. Identify the NIC (Network Interface Card) IRQs and adjust the system by experimenting and seeing if performance improves. -You may notice that some NIC IRQs are assigned to the same CPU cores by default, creating duplicate assignments. +You might notice that some NIC IRQs are assigned to the same CPU cores by default, creating duplicate assignments. For example: @@ -95,13 +84,13 @@ IRQ 106 -> CPUs 10 -> Device ens34-Tx-Rx-7 ## Understanding IRQ performance impact -When network IRQs are assigned to the same CPU cores (as shown in the example above where IRQ 101 and 104 both use CPU 12), this can potentially hurt performance as multiple interrupts compete for the same CPU core's attention, while other cores remain underutilized. +When network IRQs are assigned to the same CPU cores (as shown in the example above where IRQ 101 and 104 both use CPU 12), this can potentially degrade performance as multiple interrupts compete for the same resources, while other cores remain underutilized. By optimizing IRQ distribution, you can achieve more balanced processing and improved throughput. This optimization is especially important for high-traffic servers where network performance is critical. -Suggested experiments are covered in the next section. +{{% notice Note%}} There are suggestions for experiments in the next section. {{% /notice %}} -### How can I reset my IRQs if I make performance worse? +## How can I reset my IRQs if I worsen performance? If your experiments reduce performance, you can return the IRQs back to default using the following commands: @@ -110,12 +99,14 @@ sudo systemctl unmask irqbalance sudo systemctl enable --now irqbalance ``` -If needed, install `irqbalance` on your system. For Debian based systems run: +If needed, install `irqbalance` on your system. + +For Debian based systems run: ```bash sudo apt install irqbalance ``` -### Saving these changes +## Saving the changes -Any changes you make to IRQs will be reset at reboot. You will need to change your system's settings to make your changes permanent. +Any changes you make to IRQs are reset at reboot. You will need to change your system's settings to make your changes permanent. diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md index d73a807ef9..38cc22d165 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md @@ -8,44 +8,41 @@ layout: learningpathall ## Optimal IRQ Management Strategies -Testing across multiple cloud platforms reveals that IRQ management effectiveness varies significantly based on system size and workload characteristics. No single pattern works optimally for all scenarios, but clear patterns emerged during performance testing under heavy network loads. +Performance testing across multiple cloud platforms shows that IRQ management effectiveness depends heavily on system size and workload characteristics. While no single approach works optimally in all scenarios, clear patterns emerged during testing under heavy network loads. -## Recommendations by system size +## Recommendations for systems with 16 vCPUs or less -### Systems with 16 vCPUs or less +For smaller systems with 16 or fewer vCPUs, different strategies prove more effective: -For smaller systems with 16 or less vCPUs, concentrated IRQ assignment may provide measurable performance improvements. +- Concentrate network IRQs on just one or two CPU cores rather than spreading them across all available cores. +- Use the `smp_affinity` range assignment pattern with a limited core range (example: `0-1`). +- This approach works best when the number of NIC IRQs exceeds the number of available vCPUs. +- Focus on high-throughput network workloads where concentrated IRQ handling delivers the most significant performance improvements. -- Assign all network IRQs to just one or two CPU cores -- This approach showed the most significant performance gains -- Most effective when the number of NIC IRQs exceeds the number of vCPUs -- Use the `smp_affinity` range assignment pattern from the previous section with a very limited core range, for example `0-1` +Performance improves significantly when network IRQs are concentrated rather than dispersed across all available cores on smaller systems. This concentration reduces context switching overhead and improves cache locality for interrupt handling. -Performance improves significantly when network IRQs are concentrated rather than dispersed across all available cores on smaller systems. +## Recommendations for systems with more than 16 vCPUs -### Systems with more than 16 vCPUs +For larger systems with more than 16 vCPUs, different strategies prove more effective: -For larger systems with more than 16 vCPUs, the findings are different: +- Default IRQ distribution typically delivers good performance. +- Focus on preventing multiple network IRQs from sharing the same CPU core. +- Use the diagnostic scripts from the previous section to identify and resolve overlapping IRQ assignments. +- Apply the paired core pattern to ensure balanced distribution across the system. -- Default IRQ distribution generally performs well -- The primary concern is avoiding duplicate core assignments for network IRQs -- Use the scripts from the previous section to check and correct any overlapping IRQ assignments -- The paired core pattern can help ensure optimal distribution on these larger systems +On larger systems, interrupt handling overhead becomes less significant relative to total processing capacity. The primary performance issue occurs when high-frequency network interrupts compete for the same core, creating bottlenecks. -On larger systems, the overhead of interrupt handling is proportionally smaller compared to the available processing power. The main performance bottleneck occurs when multiple high-frequency network interrupts compete for the same core. +## Implementation considerations -## Implementation Considerations +When implementing these IRQ management strategies, several factors influence your success: -When implementing these IRQ management strategies, there are some important points to keep in mind. +- Consider your workload type first, as CPU-bound applications can benefit from different IRQ patterns than I/O-bound applications. Always benchmark your specific workload with different IRQ patterns rather than assuming one approach works universally. +- For real-time monitoring, use `watch -n1 'grep . /proc/interrupts'` to observe IRQ distribution as it happens. This helps you verify your changes are working as expected. +- On multi-socket systems, NUMA effects become important. Keep IRQs on cores close to the PCIe devices generating them to minimize cross-node memory access latency. Additionally, ensure your IRQ affinity settings persist across reboots by adding them to `/etc/rc.local` or creating a systemd service file. -Pay attention to the workload type. CPU-bound applications may benefit from different IRQ patterns than I/O-bound applications. +As workloads and hardware evolve, revisiting and adjusting IRQ management strategies might be necessary to maintain optimal performance. What works well today might need refinement as your application scales or changes. -Always benchmark your specific workload with different IRQ patterns. +## Next Steps -Monitor IRQ counts in real-time using `watch -n1 'grep . /proc/interrupts'` to observe IRQ distribution in real-time. +You have successfully learned how to optimize network interrupt handling on Arm servers. You can now analyze IRQ distributions, implement different management patterns, and configure persistent solutions for your workloads. -Also consider NUMA effects on multi-socket systems. Keep IRQs on cores close to the PCIe devices generating them to minimize cross-node memory access. - -Make sure to set up IRQ affinity settings in `/etc/rc.local` or a systemd service file to ensure they persist across reboots. - -Remember that as workloads and hardware evolve, revisiting and adjusting IRQ management strategies may be necessary to maintain optimal performance. diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md index 46c788226e..a316e3d38c 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md @@ -12,28 +12,26 @@ Different IRQ management patterns can significantly impact network performance a Network interrupt requests (IRQs) can be distributed across CPU cores in various ways, each with potential benefits depending on your workload characteristics and system configuration. By strategically assigning network IRQs to specific cores, you can improve cache locality, reduce contention, and potentially boost overall system performance. -The following patterns have been tested on various systems and can be implemented using the provided scripts. An optimal pattern is suggested at the conclusion of this Learning Path, but your specific workload may benefit from a different approach. +The following patterns have been tested on various systems and can be implemented using the provided scripts. An optimal pattern is suggested at the conclusion of this Learning Path, but your specific workload might benefit from a different approach. -### Patterns +## Common IRQ distribution patterns -1. Default: IRQ pattern provided at boot. -2. Random: All IRQs are assigned a core and do not overlap with network IRQs. -3. Housekeeping: All IRQs outside of network IRQs are assigned to specific core(s). -4. NIC IRQs are assigned to single or multiple ranges of cores, including pairs. +Four main distribution strategies offer different performance characteristics: -### Scripts to change IRQ +- Default: uses the IRQ pattern provided at boot time by the Linux kernel +- Random: assigns all IRQs to cores without overlap with network IRQs +- Housekeeping: assigns all non-network IRQs to specific dedicated cores +- NIC-focused: assigns network IRQs to single or multiple ranges of cores, including pairs -The scripts below demonstrate how to implement different IRQ management patterns on your system. Each script targets a specific distribution strategy: +## Scripts to implement IRQ management patterns -Before running these scripts, identify your network interface name using `ip link show` and determine your system's CPU topology with `lscpu`. Always test these changes in a non-production environment first, as improper IRQ assignment can impact system stability. +The scripts below demonstrate how to implement different IRQ management patterns on your system. Each script targets a specific distribution strategy. Before running these scripts, identify your network interface name using `ip link show` and determine your system's CPU topology with `lscpu`. Always test these changes in a non-production environment first, as improper IRQ assignment can impact system stability. -To change the NIC IRQs or IRQs in general you can use the following scripts. +## Housekeeping pattern -### Housekeeping +The housekeeping pattern isolates non-network IRQs to dedicated cores, reducing interference with your primary workloads. -The housekeeping pattern isolates non-network IRQs to dedicated cores. - -You need to add more to account for other IRQs on your system. +Replace `#core range here` with your desired CPU range (for example: "0,3"): ```bash HOUSEKEEP=#core range here (example: "0,3") @@ -43,13 +41,11 @@ for irq in $(awk '/ACPI:Ged/ {sub(":","",$1); print $1}' /proc/interrupts); do done ``` -### Paired core - -The paired core assignment pattern distributes network IRQs across CPU core pairs for better cache coherency. +## Paired core pattern -This is for pairs on a 16 vCPU machine. +The paired core assignment pattern distributes network IRQs across CPU core pairs for better cache coherency. -You need to add the interface name. +This example works for a 16 vCPU machine. Replace `#interface name` with your network interface (for example: "ens5"): ```bash IFACE=#interface name (example: "ens5") @@ -68,13 +64,11 @@ for irq in "${irqs[@]}"; do done ``` -### Range assignment - -The range assignment pattern assigns network IRQs to a specific range of cores. +## Range assignment pattern -This will assign a specific core(s) to NIC IRQs only. +The range assignment pattern assigns network IRQs to a specific range of cores, providing dedicated network processing capacity. -You need to add the interface name. +Replace `#interface name` with your network interface (for example: "ens5"): ```bash IFACE=#interface name (example: "ens5") @@ -84,6 +78,6 @@ for irq in $(awk '/'$IFACE'/ {sub(":","",$1); print $1}' /proc/interrupts); do done ``` -Each pattern offers different performance characteristics depending on your workload. The housekeeping pattern reduces system noise, paired cores optimize cache usage, and range assignment provides dedicated network processing capacity. Test these patterns in your environment to determine which provides the best performance for your specific use case. +Each pattern offers different performance characteristics depending on your workload. The housekeeping pattern reduces system noise, paired cores optimize cache usage, and range assignment provides dedicated network processing capacity. Improper configuration can degrade performance or stability, so always test these patterns in a non-production environment to determine which provides the best results for your specific use case. Continue to the next section for additional guidance. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md index ffcbe3a19b..2c62928a51 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md @@ -6,21 +6,14 @@ weight: 2 layout: learningpathall --- -## Profiling LLMs on Arm CPUs with Streamline +## Profile LLMs on Arm CPUs with Streamline -Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution. While larger models may benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone. +Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution for many applications. While larger models can benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone by reducing model precision to save memory. -Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provide a convenient way to run LLMs, but it also comes with a certain level of complexity. +Frameworks such as [llama.cpp](https://github.com/ggml-org/llama.cpp) provide a convenient way to run LLMs. However, understanding their performance characteristics requires specialized analysis tools. To optimize LLM execution on Arm platforms, you need both a basic understanding of transformer architectures and the right profiling tools to identify bottlenecks. -To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools. +This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. You'll gain insights into token generation performance at both the Prefill and Decode stages. You'll also understand how individual tensor operations contribute to overall execution time, and evaluate multi-threaded performance across multiple CPU cores. -This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. +You will run the Qwen1_5-0_5b-chat-q4_0.gguf model using `llama-cli` on Arm Linux and use Streamline for detailed performance analysis. The same methodology can also be applied on Android systems. -You will learn how to: -- Profile token generation at the Prefill and Decode stages -- Profile execution of individual tensor nodes and operators -- Profile LLM execution across multiple threads and cores - -You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis. - -The same method can also be used on Android. +By the end of this Learning Path, you'll understand how to profile LLM inference, identify performance bottlenecks, and analyze multi-threaded execution patterns on Arm CPUs. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md index 2c1e4129f0..75bf788463 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md @@ -1,86 +1,89 @@ --- -title: Understand llama.cpp +title: Explore llama.cpp architecture and the inference workflow weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Understand llama.cpp +## Key concepts and architecture overview -llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. +llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. This Learning Path focuses specifically on inference performance on Arm CPUs. -This Learning Path focuses on inference on Arm CPUs. +The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. It supports text generation, chat mode, and grammar-constrained output directly from the terminal. -The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. -It supports text generation, chat mode, and grammar-constrained output directly from the terminal. +{{% notice Note %}} +These are some key terms used in this Learning Path: +- *Inference*: the process of generating text from a trained model +- *GGUF format*: a file format optimized for storing and loading LLM models efficiently +- *Tokenization*: converting text into numerical tokens that the model can process +{{% /notice %}} -![text#center](images/llama_structure.png "Figure 1. llama-cli Flow") +## The llama-cli workflow -### What does the Llama CLI do? +The following diagram shows the high-level workflow of llama-cli during inference: -Here are the steps performed by `llama-cli`: +![Workflow diagram showing llama-cli inference pipeline with input prompt processing through model loading, tokenization, parallel Prefill stage, and sequential Decode stage for token generation alt-text#center](images/llama_structure.png "The llama-cli inference workflow") -1. Load and interpret LLMs in GGUF format +The workflow begins when you provide an input prompt to `llama-cli`. The tool loads the specified GGUF model file and tokenizes your prompt. It then processes the prompt through two distinct stages: -2. Build a compute graph based on the model structure +- Prefill stage: the entire prompt is processed in parallel to generate the first output token +- Decode stage: additional tokens are generated sequentially, one at a time - The graph can be divided into subgraphs, each assigned to the most suitable backend device, but in this Learning Path all operations are executed on the Arm CPU backend. +This process continues until the model generates a complete response or reaches a stopping condition. -3. Allocate memory for tensor nodes using the graph planner +## How does llama-cli process requests? -4. Execute tensor nodes in the graph during the `graph_compute` stage, which traverses nodes and forwards work to backend devices +Here are the steps performed by `llama-cli` during inference: -Steps 2 to 4 are wrapped inside the function `llama_decode`. -During Prefill and Decode, `llama-cli` repeatedly calls `llama_decode` to generate tokens. +- Load and interpret LLMs in GGUF format -The parameter `llama_batch` passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions. +- Build a compute graph based on the model structure: + - A compute graph defines the mathematical operations required for inference + - The graph is divided into subgraphs to optimize execution across available hardware backends + - Each subgraph is assigned to the most suitable backend device; in this Learning Path, all subgraphs are assigned to the Arm CPU backend + +- Allocate memory for tensor nodes using the graph planner + - Tensor nodes represent data and operations in the compute graph -### What are the components of llama.cpp? +- Execute tensor nodes in the graph during the `graph_compute` stage + - This stage traverses nodes and forwards work to backend devices -The components of llama.cpp include: +The compute graph building and tensor node execution stages are wrapped inside the function `llama_decode`. During both Prefill and Decode stages, `llama-cli` repeatedly calls `llama_decode` to generate tokens. The parameter `llama_batch` passed to `llama_decode` differs between stages. It contains input tokens, their count, and their positions. -![text#center](images/llama_components.jpg "Figure 2. llama.cpp components") +## What are the components of llama.cpp? -llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, and `OpenCL`. +The architecture of llama.cpp includes several key components that work together to provide efficient LLM inference, as shown in the diagram: -For the CPU backend, it provides an optimized `ggml-cpu` library, mainly utilizing CPU vector instructions. +![Architecture diagram showing llama.cpp components including backends, ggml-cpu library, and KleidiAI integration alt-text#center](images/llama_components.jpg "llama.cpp components") -For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages 8-bit integer multiply (i8mm) instructions for acceleration. +llama.cpp provides optimized support for Arm CPUs through its `ggml-cpu` library, which leverages Arm-specific vector instructions such as NEON and SVE, and includes an AArch64 trait that accelerates inference using 8-bit integer multiply (i8mm) instructions. The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. In addition to Arm CPU support, llama.cpp offers backends for GPU, CUDA, and OpenCL to enable inference on a variety of hardware platforms. -The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. +## Prefill and Decode in autoregressive LLMs -### Prefill and Decode in autoregressive LLMs +An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token based on all the previously-generated tokens. A token represents a word or word piece in the sequence. -An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token (word or word piece) in a sequence based on all the previously generated tokens. +The term *autoregressive* means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. For example, when generating the sentence "The cat sat on the...", an autoregressive LLM takes the input prompt as context and predicts the next most likely token, such as "mat". The model then uses the entire sequence including "mat" to predict the following token, continuing this process token by token until completion, which is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one). -The term "autoregressive" means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. - -For example, when generating the sentence "The cat sat on the", an autoregressive LLM: -1. Takes the input prompt as context -2. Predicts the next most likely token (e.g., "mat") -3. Uses the entire sequence including "mat" to predict the following token -4. Continues this process token by token until completion - -This sequential nature is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one). - -Most autoregressive LLMs are Decoder-only models. This refers to the transformer architecture they use, which consists only of decoder blocks from the original Transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation. +Most autoregressive LLMs are decoder-only models. This refers to the transformer architecture, which consists only of decoder blocks from the original transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation. Decoder-only models like LLaMA have become dominant for text generation because they are simpler to train at scale, can handle both understanding and generation tasks, and are more efficient for text generation. -Here is a brief introduction to Prefill and Decode stages of autoregressive LLMs. -![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stages") +This diagram introduces the idea of Prefill and Decode stages of autoregressive LLMs: +![Diagram illustrating the two stages of autoregressive LLM inference: Prefill stage processing input tokens and Decode stage generating output tokens sequentially alt-text#center](images/llm_prefill_decode.jpg "Prefill and Decode stages") + +The Prefill stage is shown below, and as you can see, multiple input tokens of the prompt are processed simultaneously. -At the Prefill stage, multiple input tokens of the prompt are processed. +In the context of Large Language Models (LLMs), a *matrix* is a two-dimensional array of numbers representing data such as model weights or token embeddings, while a *vector* is a one-dimensional array often used to represent a single token or feature set. -It mainly performs GEMM (a matrix is multiplied by another matrix) operations to generate the first output token. +This stage mainly performs GEMM operations (General Matrix Multiply; where one matrix is multiplied by another matrix) to generate the first output token. -![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage") +![Diagram showing the Prefill stage processing multiple input tokens in parallel through transformer blocks using GEMM operations alt-text#center](images/transformer_prefill.jpg "Prefill stage") -At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (a vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. +At the Decode stage, the model utilizes the [KV cache](https://huggingface.co/blog/not-lain/kv-caching) (Key-Value cache; which is stored attention information from previous tokens). This stage mainly performs GEMV operations (General Matrix-Vector multiply - where a vector is multiplied by a matrix) to generate subsequent output tokens one by one. -![text#center](images/transformer_decode.jpg "Figure 5. Decode stage") +![Diagram showing the Decode stage generating tokens one by one using KV cache and GEMV operations alt-text#center](images/transformer_decode.jpg "Decode stage") -In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations. +## Summary -You will see this highlighted during the Streamline performance analysis. \ No newline at end of file +In this section, you learned about llama.cpp architecture and its inference workflow. The framework uses a two-stage process where the Prefill stage is compute-bound and dominated by large GEMM operations that process multiple tokens in parallel, while the Decode stage is memory-bound and dominated by KV cache access and GEMV operations that process one token at a time. You will see this distinction between Prefill and Decode stages reflected in the performance metrics and visualizations. In the next section, you'll integrate Streamline annotations into llama.cpp to enable detailed performance profiling of these stages. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md index cdb90f1223..f6288f05ce 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md @@ -6,32 +6,33 @@ weight: 4 layout: learningpathall --- -## Integrate Streamline Annotations into llama.cpp +## Set up performance annotation markers -To visualize token generation at the Prefill and Decode stages, you can use Streamline's Annotation Marker feature. +To visualize token generation at the Prefill and Decode stages, you can use Streamline's Annotation Marker feature. -This requires integrating annotation support into the llama.cpp project. +{{% notice Note %}} +*Annotation markers* are code markers that you insert into your application to identify specific events or time periods during execution. When Streamline captures performance data, these markers appear in the timeline, making it easier to correlate performance data with specific application behavior. +{{% /notice %}} -More information about the Annotation Marker API can be found in the [Streamline User Guide](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). +This requires integrating annotation support into the llama.cpp project. More information about the Annotation Marker API can be found in the [Streamline User Guide](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). {{% notice Note %}} You can either build natively on an Arm platform, or cross-compile on another architecture using an Arm cross-compiler toolchain. {{% /notice %}} -### Step 1: Build Streamline Annotation library +## Build the Streamline annotation library Download and install [Arm Performance Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads) on your development machine. {{% notice Note %}} You can also download and install [Arm Development Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio#Downloads), as it also includes Streamline. - {{% /notice %}} -Streamline Annotation support code is in the Arm Performance Studio installation directory in the `streamline/gator/annotate` directory. +Streamline Annotation support code is located in the Arm Performance Studio installation directory under `streamline/gator/annotate`. -Clone the gator repository that matches your Streamline version and build the `Annotation support library`. You can build it on your current machine using the native build instructions and you can cross compile it for another Arm computer using the cross compile instructions. +Clone the gator repository that matches your Streamline version and build the Annotation support library. You can build it natively on your current machine or cross-compile it for another Arm computer. -If you need to set up a cross compiler you can review the [GCC install guide](/install-guides/gcc/cross/). +If you need to set up a cross-compiler, you can review the [GCC install guide](/install-guides/gcc/cross/). {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} @@ -56,7 +57,7 @@ If you need to set up a cross compiler you can review the [GCC install guide](/i Once complete, the static library `libstreamline_annotate.a` will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file is at `gator/annotate/streamline_annotate.h`. -### Step 2: Integrate Annotation Marker into llama.cpp +## Integrate annotation marker into llama.cpp Next, you need to install llama.cpp to run the LLM model. @@ -64,7 +65,7 @@ Next, you need to install llama.cpp to run the LLM model. To make the performance profiling content easier to follow, this Learning Path uses a specific release version of llama.cpp to ensure the steps and results remain consistent. {{% /notice %}} -Before building llama.cpp, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the new directory. +Before building llama.cpp, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the new directory: ```bash cd ~ @@ -76,7 +77,7 @@ mkdir streamline_annotation cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation ``` -To link the `libstreamline_annotate.a` library when building llama-cli, use an editor to add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`. +To link the `libstreamline_annotate.a` library when building llama-cli, use an editor to add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`: ```makefile set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -84,15 +85,15 @@ target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_ann target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}") ``` -To add Annotation Markers to `llama-cli`, edit the file `llama.cpp/tools/main/main.cpp` and make 3 modification. +To add Annotation Markers to `llama-cli`, edit the file `llama.cpp/tools/main/main.cpp` and make three modifications. -First, add the include file at the top of `main.cpp` with the other include files. +First, add the include file at the top of `main.cpp` with the other include files: ```c #include "streamline_annotate.h" ``` -Next, the find the `common_init()` call in the `main()` function and add the Streamline setup macro below it so that the code looks like: +Next, the find the `common_init()` call in the `main()` function and add the Streamline setup macro below it so that the code looks like this: ```c common_init(); @@ -127,7 +128,7 @@ Finally, add an annotation marker inside the main loop. Add the complete code in A string is added to the Annotation Marker to record the position of input tokens and number of tokens to be processed. -### Step 3: Build llama-cli +## Compile llama-cli with annotation support For convenience, llama-cli is statically linked. @@ -138,7 +139,7 @@ cd ~/llama.cpp mkdir build && cd build ``` -Next, configure the project. +Next, configure the project: {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} @@ -194,4 +195,6 @@ cmake --build ./ --config Release -j $(nproc) After the building process completes, you can find the `llama-cli` in the `~/llama.cpp/build/bin/` directory. -You now have an annotated version of `llama-cli` ready for Streamline. \ No newline at end of file +## Summary + +You have successfully integrated Streamline annotations into llama.cpp and built an annotated version of `llama-cli`. The annotation markers you added will help identify token generation events during profiling. In the next section, you'll use this instrumented executable to capture performance data with Streamline and analyze the distinct characteristics between Prefill and Decode stages during LLM inference. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md index 00472c5863..58838ce1dc 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md @@ -1,34 +1,40 @@ --- -title: Run llama-cli and analyze the data with Streamline +title: Analyze token generation performance with Streamline profiling weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Run llama-cli and analyze the data with Streamline +## Set up the profiling environment -After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. This can be your development machine or another Arm system. +After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. This can be on your development machine or another Arm system. You'll configure the gator daemon for performance data collection and prepare your target system with the necessary executables and model files. This setup enables comprehensive performance analysis of both the compute-intensive Prefill stage and memory-bound Decode operations during LLM inference. -### Set up the gator daemon +## Set up the gator daemon -The gator daemon, `gatord`, is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data. + Start with setting up the gator daemon. The setup process depends on your llama.cpp build method. -Depending on how you built llama.cpp: - -For the cross-compiled build flow: +{{% notice Note %}} +The daemon must be running on your target device before you can capture performance data. +{{% /notice %}} - - Copy the `llama-cli` executable to your Arm target. - - Copy the `gatord` binary from the Arm Performance Studio release. If you are targeting Linux, take it from `streamline\bin\linux\arm64` and if you are targeting Android take it from `streamline\bin\android\arm64`. +### For cross-compiled builds: +Copy the required files to your Arm target system: +- Transfer the `llama-cli` executable to your target device +- Copy the `gatord` binary from your Arm Performance Studio installation: + - Linux targets: Use `streamline\bin\linux\arm64\gatord` + - Android targets: Use `streamline\bin\android\arm64\gatord` -Put both of these programs in your home directory on the target system. +Place both programs in your home directory on the target system. -For the native build flow: - - Use the `llama-cli` from your local build in `llama.cpp/build/bin` and the `gatord` you compiled earlier at `~/gator/build-native-gcc-rel/gatord`. +### For native builds: +Use the locally built binaries: +- The `llama-cli` executable from `llama.cpp/build/bin` +- The `gatord` binary you compiled earlier at `~/gator/build-native-gcc-rel/gatord` -You now have the `gatord` and the `llama-cli` on the computer you want to run and profile. +Both programs are now ready for profiling on your target Arm system. -### Download a lightweight model +## Download a lightweight model You can download the LLM model to the target platform. @@ -39,7 +45,7 @@ cd ~ wget https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_0.gguf ``` -### Run the Gator daemon +## Run the gator daemon Start the gator daemon on your Arm target: @@ -56,115 +62,114 @@ Copyright (c) 2010-2025 Arm Limited. All rights reserved. Gator ready ``` -### Connect Streamline +## Connect Streamline Next, you can use Streamline to set up the collection of CPU performance data. -If you're accessing the Arm server via SSH, you need to forward port `8080` from the host platform to your local machine. +If you're accessing the Arm server via SSH, you need to forward port `8080` from the host platform to your local machine: ``` bash ssh -i user@arm-server -L 8080:localhost:8080 -N ``` -Append `-L 8080:localhost:8080 -N` to your original SSH command to enable local port forwarding, this allows Arm Streamline on your local machine to connect to the Arm server. +Append `-L 8080:localhost:8080 -N` to your original SSH command to enable local port forwarding. This allows Arm Streamline on your local machine to connect to the Arm server. -Then launch the Streamline application on your host machine, connect to the gatord running on your Arm target with either TCP or ADB connection. +Then launch the Streamline application on your host machine and connect to the gatord running on your Arm target with either TCP or ADB connection. You can select PMU events to be monitored at this point. {{% notice Note %}} If you are using ssh port forwarding, you need to select TCP `127.0.0.1:8080`. {{% /notice %}} -![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ") +![Screenshot of Arm Streamline application showing the capture configuration interface with connection settings and PMU event selection alt-text#center](images/streamline_capture.png "Streamline start capture") -Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis. -![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path") +Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis: +![Screenshot showing Streamline image path configuration for llama-cli executable debug information alt-text#center](images/streamline_capture_image.png "Streamline image path") -Click `Start Capture` button on Streamline to start collecting data from the Arm target. +Click the **Start Capture** button on Streamline to start collecting data from the Arm target. {{% notice Note %}} -This guide is not intended to introduce how to use Streamline, if you encounter any issues with gatord or Streamline, please refer to the [Streamline User Guide](https://developer.arm.com/documentation/101816/latest/?lang=en) +This Learning Path focuses on analyzing llama.cpp performance data. If you encounter issues with gatord or Streamline setup, check the [Streamline User Guide](https://developer.arm.com/documentation/101816/latest/?lang=en) for detailed troubleshooting steps. {{% /notice %}} -### Run llama-cli +## Run llama-cli -Run the `llama-cli` executable as below: +Run the `llama-cli` executable as shown below: ``` bash cd ~/llama.cpp/build/bin ./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1 ``` -After a while, you can stop the Streamline data collection by clicking the `Stop` button on Streamline. +After a while, you can stop the Streamline data collection by clicking the **Stop** button on Streamline. Streamline running on your host PC will start the data analysis. -### Analyze the data with Streamline +## Analyze the data with Streamline -From the timeline view of Streamline, you can see some Annotation Markers. Since an Annotation Marker is added before the llama_decode function, each Annotation Marker marks the start time of a token generation. -![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker") +From the timeline view of Streamline, you can see some annotation markers. Since an Annotation Marker is added before the llama_decode function, each Annotation Marker marks the start time of a token generation. +![Screenshot of Streamline timeline view showing annotation markers indicating token generation start points during llama.cpp execution alt-text#center](images/annotation_marker_1.png "Annotation marker") -The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example, -![text#center](images/annotation_marker_2.png "Figure 9. Annotation String") +You can view the annotation details by clicking on any Annotation Marker in the timeline. This displays the marker string with token position and processing information: -The number after `past` indicates the position of input tokens, the number after `n_eval` indicates the number of tokens to be processed this time. +![Screenshot showing detailed annotation marker information with token position and count data displayed in Streamline alt-text#center](images/annotation_marker_2.png "Annotation string") + +The number after **past** indicates the position of input tokens, the number after **n_eval** indicates the number of tokens to be processed this time. By checking the string of Annotation Marker, the first token generation at Prefill stage has `past 0, n_eval 78`, which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. -You can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. +You can see that the first token generated at the Prefill stage takes more time since 78 input tokens have to be processed at the Prefill stage, performing lots of GEMM operations. At the Decode stage, tokens are generated one by one at mostly equal speed; one token takes less time than that of the Prefill stage, thanks to the effect of KV cache. At the Decode stage, it performs many GEMV operations. -You can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. +You can further investigate it with PMU event counters that are captured by Streamline. At the Prefill stage, the amount of computation, which is indicated by PMU event counters that count the number of Advanced SIMD (NEON), floating-point, and integer data processing instructions, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of the Decode stage. -At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher. +At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refills/misses increases significantly. -![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event") +![Graph showing PMU backend stall cycles analysis comparing memory stalls between Prefill and Decode stages alt-text#center](images/annotation_pmu_stall.png "Backend stall PMU event") You can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage. -Now, you can further profile the code execution with Streamline. In the Call Paths view of Streamline, you can see the percentage of running time of functions that are organized in form of call stack. +Now, you can further profile the code execution with Streamline. In the **Call Paths** view of Streamline, you can see the percentage of running time of functions that are organized in form of call stack. -![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack") +![Screenshot of Streamline Call Paths view showing function execution hierarchy and performance distribution alt-text#center](images/annotation_prefill_call_stack.png "Call stack") In the Functions view of Streamline, you can see the overall percentage of running time of functions. -![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view") +![Screenshot of Streamline Functions view displaying execution time percentages for different functions during llama.cpp execution alt-text#center](images/annotation_prefill_functions.png "Functions view") + +As you can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With the `Qwen1_5-0_5b-chat-q4_0` model, the computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by `ggml_cpu_extra_compute_forward`. KleidiAI microkernels implemented with NEON dot product and i8mm vector instructions accelerate the computation. + +At the Prefill stage, `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes advantage of i8mm instructions. Since the Prefill stage only takes a small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if you focus only on the Prefill stage with Samplings view in Timeline, you see `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` takes the largest portion of the Prefill stage. -As you can see, the function, graph_compute, takes the largest portion of the running time. +![Screenshot showing Streamline analysis focused on Prefill stage execution with KleidiAI GEMM operations highlighted alt-text#center](images/prefill_only.png "Prefill only view") -It shows that large amounts of GEMM and GEMV operations take most of the time. +At the Decode stage, `kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod` KleidiAI ukernel is used for GEMV operators. It takes advantage of dot product instructions. If you look only at the Decode stage, you can see this function takes the second largest portion. -With the `Qwen1_5-0_5b-chat-q4_0` model, the computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. +![Screenshot showing Streamline analysis focused on Decode stage execution with KleidiAI GEMV operations highlighted alt-text#center](images/decode_only.png "Decode only view") -The computation is forwarded to KleidiAI trait by `ggml_cpu_extra_compute_forward`. KleidiAI microkernels implemented with NEON dot product and i8mm vector instructions accelerate the computation. +There is a `result_output` linear layer in the Qwen1_5-0_5b-chat-q4_0 model where the weights use Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, so it is handled by the `ggml_vec_dot_q6_K_q8_K` function in the ggml-cpu library. -At the Prefill stage, `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes advantage of i8mm instructions. Since the Prefill stage only takes a small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if you focus only on the Prefill stage with Samplings view in Timeline, you see `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` takes the largest portion of the Prefill stage. +The tensor nodes for Multi-Head attention computation are represented as three-dimensional matrices with FP16 data type (KV cache also holds FP16 values). These are computed by the `ggml_vec_dot_f16` function in the ggml-cpu library. -![text#center](images/prefill_only.png "Figure 14. Prefill only view") +The computation of RoPE, Softmax, and RMSNorm layers does not take a significant portion of the running time. -At the Decode stage, `kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod` KleidiAI ukernel is used for GEMV operators. It takes advantage of dot product instructions. If you look only at the Decode stage, you can see this function takes the second largest portion. +## Analyze results + +The profiling data reveals clear differences between Prefill and Decode stages: + +- Annotation Markers show token generation start points. The Prefill stage shows `past 0, n_eval 78`, indicating 78 input tokens processed simultaneously. During Decode, tokens are generated one at a time. -![text#center](images/decode_only.png "Figure 15. Decode only view") +- Performance characteristics differ significantly between stages. Prefill demonstrates compute-bound behavior with high SIMD, floating-point, and integer instruction counts but relatively few L3 cache misses. Decode shows memory-bound behavior with lighter compute workloads but frequent L3 cache accesses. -There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the weights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. +- PMU events confirm this analysis. Backend stall cycles due to memory account for only ~10% of total stalls during Prefill, but increase to ~50% during Decode. This pattern indicates efficient compute utilization during Prefill and memory bottlenecks during Decode. -The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. +| Stage | Main Operations | Bottleneck | Key Observations | +|---------|----------------|----------------|-------------------------------------------------| +| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | +| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | -The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. +The results demonstrate how KV caching transforms the computational profile from matrix-matrix operations during Prefill to vector-matrix operations during Decode, fundamentally changing the performance characteristics. -### Analyzing results -- Annotation Markers show token generation start points. -- Prefill stage: past 0, n_eval 78 → compute-bound (large GEMM). -- Decode stage: one token at a time → memory-bound (KV cache, GEMV). -- PMU events: SIMD/FP/INT instructions high in Prefill, L3 cache misses high in Decode. -- Backend stalls: ~10% memory stalls in Prefill vs ~50% in Decode. +## Summary -| Stage | Main Ops | Bottleneck | Observations | -|---------|----------|----------------|--------------------------------------------------| -| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | -| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | -| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | -| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | -|---------|----------|----------------|--------------------------------------------------| -| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | -| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | +You have successfully captured and analyzed LLM inference performance using Streamline. Use this data to optimize your applications by identifying the distinct characteristics between Prefill (compute-bound) and Decode (memory-bound) stages. Leverage the function execution time data and PMU event correlations to pinpoint performance bottlenecks in your inference pipeline. Apply these insights to make informed decisions about hardware selection and code optimization strategies. Take advantage of this foundation to dive deeper into operator-level analysis in the next section, where you'll gain even more granular control over your LLM performance optimization efforts. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md index 8b2ebf8eb0..4ac130aafa 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md @@ -1,16 +1,18 @@ --- -title: Deep dive into operators +title: Implement operator-level performance analysis with Annotation Channels weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Deep dive into operators +## Overview of Annotation Channels -You can use Streamline Annotation Channels to analyze the execution time of each node in the compute graph. More details on Annotation Channels can be found in the [Group and Channel annotations](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en) section of the Streamline User Guide. +You can use Streamline Annotation Channels to analyze the execution time of each node in the compute graph, which is especially valuable for understanding and optimizing performance on Arm-based systems. Annotation Channels are specialized annotations that group related operations into separate visual channels in Streamline. Unlike simple markers, channels allow you to track multiple concurrent operations and see their relationships over time. -## Integrating Annotation Channels into llama.cpp +More details on Annotation Channels can be found in the [Group and Channel annotations](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en) section of the Streamline User Guide. + +## Integrate Annotation Channels into llama.cpp In llama.cpp, tensor nodes are executed in the CPU backend inside the function `ggml_graph_compute_thread()` in the file `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. @@ -23,15 +25,11 @@ for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort ggml_compute_forward(¶ms, node); ``` -To monitor operator execution time, you can create annotation channels for each type of operators such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE` and `GGML_OP_MUL`. - -Since `GGML_OP_MUL_MAT` including both GEMM and GEMV operation takes significant portion of execution time, two dedicated annotation channels are created for GEMM and GEMV respectively. - -The annotation starts at the beginning of `ggml_compute_forward()` and stops at the end, so that the computation of tensor node/operator can be monitored. +To monitor operator execution time, you can create annotation channels for each type of operators such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE`, and `GGML_OP_MUL`. Matrix operations (`GGML_OP_MUL_MAT`) take a significant portion of execution time. These operations include both GEMM (General Matrix Multiply) and GEMV (General Matrix-Vector multiply) operations. You'll create two dedicated annotation channels for GEMM and GEMV respectively to analyze their performance separately. The annotation starts at the beginning of `ggml_compute_forward()` and stops at the end. This approach allows you to monitor the computation time of each tensor node/operator. -### Step 1: Add annotation code +## Add annotation code to monitor operators -First, add Streamline annotation header file to the file `ggml-cpu.c`: +First, add the Streamline annotation header file to `ggml-cpu.c`: ```c #include "streamline_annotate.h" @@ -39,7 +37,7 @@ First, add Streamline annotation header file to the file `ggml-cpu.c`: Edit the `ggml_graph_compute_thread()` function in the file `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. -Add following code in front and after the `ggml_compute_forward(¶ms, node)`. +Add the following code in front and after the `ggml_compute_forward(¶ms, node)`. Your code now looks like: @@ -79,9 +77,9 @@ for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort // --- End Annotation Channel for Streamline ``` -### Step 2: Add tensor shape info (optional) +## Include tensor shape information (optional) -You can also add information of the shape and size of source tensor by replace sprintf function as follow: +You can also add information about the shape and size of source tensors by replacing the sprintf function as follows: ```c sprintf(printf_buf,"%s %s %d_%d_%d %d_%d_%d", node->name, ggml_get_name(node), \ @@ -94,9 +92,9 @@ You can also add information of the shape and size of source tensor by replace s ); ``` -### Step 3: Update CMakeLists +## Update build configuration -Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annotation header file and `libstreamline_annotate.a` library by adding the lines: +Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include the Streamline Annotation header file and `libstreamline_annotate.a` library by adding these lines: ```bash set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -106,29 +104,27 @@ Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annota Then, build `llama-cli` again. -### Analyze the data with Streamline +## Examine operator performance patterns -Run `llama-cli` and collect profiling data with Streamline as you did in the previous session. +Run `llama-cli` and collect profiling data with Streamline as you did in the previous section. -String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view. +Arm Streamline displays string annotations as text overlays in the relevant channels in the Timeline view, such as Channel 0, as shown in the following screenshot. -For example, inside Channel 0 in the following screenshot. +![Screenshot of Streamline annotation channels displaying operator execution timing with channel indicators alt-text#center](images/deep_dive_1.png "Annotation channel") -![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel") +The letter `A` is displayed in the process list to indicate the presence of annotations. -The letter A is displayed in the process list to indicate the presence of annotations. +String annotations are also displayed in the **Message** column in the Log view. -String annotations are also displayed in the Message column in the Log view. +![Screenshot of Streamline Log view showing annotation messages with operator details and timing information alt-text#center](images/deep_dive_2.png "Annotation log") -![text#center](images/deep_dive_2.png "Figure 17. Annotation log") +## Compare GEMM operations during prefill -### View the individual operators at Prefill stage +The screenshot of annotation channel view at prefill stage is shown as below: -The screenshot of annotation channel view at Prefill stage is shown as below: +![Screenshot showing Streamline annotation channels during Prefill stage with operator categorization and timing visualization alt-text#center](images/prefill_annotation_channel.png "Annotation channel at Prefill stage") -![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage") - -The name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. +The operator name in the screenshot above is manually edited. If you want the operator name to be shown instead of the Channel number by Streamline, you can add ANNOTATE_NAME_CHANNEL to the `ggml_graph_compute_thread` function. This annotation macro is defined as: @@ -136,7 +132,7 @@ This annotation macro is defined as: ANNOTATE_NAME_CHANNEL(channel, group, string) ``` -For example, +For example: ```c ANNOTATE_NAME_CHANNEL(0, 0, "MUL_MAT_GEMV"); @@ -145,9 +141,9 @@ For example, The code above sets the name of annotation channel 0 as `MUL_MAT_GEMV` and channel 1 as `MUL_MAT_GEMM`. -By zooming into the timeline view, you can see more details: +Zoom into the timeline view to examine additional details: -![text#center](images/prefill_annotation_channel_2.png "Figure 19. Annotation Channel at Prefill stage") +![Detailed view of Streamline annotation channels showing individual operator execution blocks during Prefill stage alt-text#center](images/prefill_annotation_channel_2.png "Annotation channel at Prefill stage") When moving the cursor over an annotation channel, Streamline shows: @@ -155,9 +151,9 @@ When moving the cursor over an annotation channel, Streamline shows: - The operator type - The shape and size of the source tensors -![text#center](images/prefill_annotation_channel_3.png "Figure 20. Annotation Channel Zoom in") +![Close-up screenshot of annotation channel tooltip showing tensor node details including operator type and tensor dimensions alt-text#center](images/prefill_annotation_channel_3.png "Annotation channel zoom in") -In the example above, you see a `GGML_OP_MUL_MAT` operator for the `FFN_UP` node. +The example above shows a `GGML_OP_MUL_MAT` operator for the `FFN_UP` node. The source tensors have shapes [1024, 2816] and [1024, 68]. This view makes it clear that: @@ -165,18 +161,17 @@ This view makes it clear that: - There is also a large `MUL_MAT GEMV` operation in the `result_output` linear layer. - Other operators, such as MUL, Softmax, Norm, RoPE, consume only a small portion of execution time. -### View of individual operators at Decode stage +## Analyze GEMV operations during Decode The annotation channel view for the Decode stage is shown below: -![text#center](images/decode_annotation_channel.png "Figure 21. Annotation Channel at Decode stage") +![Screenshot showing Streamline annotation channels during Decode stage highlighting GEMV operations and reduced computation time alt-text#center](images/decode_annotation_channel.png "Annotation channel at Decode stage") Zooming in provides additional details: -![text#center](images/decode_annotation_channel_2.png "Figure 22. Annotation Channel string") +![Detailed view of Decode stage annotation channels showing shorter execution blocks compared to Prefill stage alt-text#center](images/decode_annotation_channel_2.png "Annotation channel string") +This view reveals that the majority of time in Decode is spent on `MUL_MAT GEMV` operations within the attention and FFN layers. Unlike the Prefill stage, no GEMM operations are executed in these layers during Decode. The `result_output` linear layer contains a large GEMV operation that takes an even larger proportion of runtime in Decode compared to Prefill. This pattern is expected since each token generation at Decode is shorter due to KV cache reuse, making the `result_output` layer more dominant in the overall execution profile. + +## Summary -From this view, you can see: -- The majority of time in Decode is spent on `MUL_MAT GEMV` operations in the attention and FFN layers. -- In contrast to Prefill, **no GEMM operations** are executed in these layers. -- The `result_output` linear layer has a large GEMV operation, which takes an even larger proportion of runtime in Decode. -- This is expected, since each token generation at Decode is shorter due to KV cache reuse, making the result_output layer more dominant. +You have successfully implemented Annotation Channels to analyze individual operators within llama.cpp. This detailed view reveals how different operators contribute to overall execution time and shows the stark differences between Prefill (GEMM-dominated) and Decode (GEMV-dominated) stages. The next section will explore how these operations utilize multiple CPU cores and threads. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md index 00b0c4bf00..0d3afbc47e 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md @@ -1,63 +1,60 @@ --- -title: Analyze multi-threaded performance +title: Examine multi-threaded performance patterns in llama.cpp weight: 7 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Analyze multi-threaded performance +## Understand llama.cpp multi-threading architecture -The CPU backend in llama.cpp uses multiple cores and threads to accelerate operator execution. +The CPU backend in llama.cpp uses multiple cores and threads to accelerate operator execution. Understanding how work is distributed across threads helps you optimize performance on Arm processors. -It creates a threadpool, where: -- The number of threads is controlled by the `-t` option -- If `-t` is not specified, it defaults to the number of CPU cores in the system +llama.cpp creates a threadpool where the number of threads is controlled by the `-t` option. If `-t` is not specified, it defaults to the number of CPU cores in the system. The `-C` option controls thread affinity, which determines which specific cores threads run on. -The entrypoint for secondary threads is the function `ggml_graph_compute_secondary_thread()`. +The entry point for secondary threads is the function `ggml_graph_compute_secondary_thread()`. When computing a tensor node/operator with a large workload, llama.cpp splits the computation into multiple parts and distributes these parts across threads for parallel execution. -When computing a tensor node/operator with a large workload, llama.cpp splits the computation into multiple parts and distributes them across threads. - -### Example: MUL_MAT Operator +## Example: MUL_MAT operator parallelization For the MUL_MAT operator, the output matrix C can be divided across threads: -![text#center](images/multi_thread.jpg "Figure 23. Multi-Thread") +![Diagram illustrating how MUL_MAT operator computation is distributed across multiple threads, with each thread computing a portion of the output matrix alt-text#center](images/multi_thread.jpg "Multi-thread") -In this example, four threads each compute one quarter of matrix C. +In this example, four threads each compute one quarter of matrix C. -### Observing thread execution with Streamline +## Profile thread execution with Streamline -The execution of multiple threads on CPU cores can be observed using Core Map and Cluster Map modes in the Streamline Timeline. +The execution of multiple threads on CPU cores can be observed using Core Map and Cluster Map modes in the Streamline Timeline. These visualization modes show how threads are distributed across CPU cores and help identify performance bottlenecks in parallel execution. Learn more about these modes in the [Core Map and Cluster Map modes](https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes) section of the Streamline User Guide. -Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1. +## Configure thread affinity for analysis + +Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1. Thread affinity ensures threads run on specific cores, making performance analysis more predictable. ```bash ./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 2 -C 0x3 ``` -### Streamline results +## Analyze Streamline results Collect profiling data with Streamline, then select Core Map and Cluster Map modes in the Timeline view. -![text#center](images/multi_thread_core_map.png "Figure 24. Multi-Thread") +![Screenshot of Streamline Core Map view showing thread execution across CPU cores with thread affinity mapping alt-text#center](images/multi_thread_core_map.png "Multi-thread core map") + +In the screenshot above, you can observe that two threads are created and they are running on CPU core0 and CPU core1, respectively. This confirms that the thread affinity configuration is working correctly. + +You can also use the Annotation Channel view to analyze operator execution on a per-thread basis. Each thread generates its own annotation channel independently, allowing you to see how work is distributed across parallel execution units. -In the screenshot above: -- Two threads are created -- They are running on CPU core0 and CPU core1, respectively +![Screenshot showing Streamline annotation channels with multiple threads executing the same tensor node simultaneously alt-text#center](images/multi_thread_annotation_channel.png "Multi-thread annotation channels") -In addition, you can use the Annotation Channel view to analyze operator execution on a per-thread basis. Each thread generates its own annotation channel independently. +In the screenshot above, at the highlighted time, both threads are executing the same node. In this particular case, the node is the result_output linear layer. You can see how the workload is distributed across threads, with each thread processing a different portion of the matrix computation. This visualization helps identify load balancing issues and optimization opportunities in parallel execution. -![text#center](images/multi_thread_annotation_channel.png "Figure 25. Multi-Thread") +## Summary -In the screenshot above, at the highlighted time: -- Both threads are executing the same node -- In this case, the node is the result_output linear layer +You have successfully completed the walkthrough of profiling an LLM model on an Arm CPU using advanced multi-threading analysis techniques. -You have completed the walkthrough of profiling an LLM model on an Arm CPU! +You now understand how to integrate Streamline annotations into LLM inference code for detailed profiling, capture and analyze performance data showing the distinct characteristics of Prefill and Decode stages, and use Annotation Channels to analyze individual operators and their execution patterns. Additionally, you can configure thread affinity and examine multi-threaded execution patterns across CPU cores while identifying performance bottlenecks and work distribution issues in parallel execution. -By combining Arm Streamline with a solid understanding of llama.cpp, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. +These skills enable you to optimize LLM performance on Arm CPUs by understanding where computational resources are spent and how to leverage multi-core parallelism effectively. By combining Arm Streamline with a solid understanding of llama.cpp threading architecture, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. -Keep in mind that adding annotation code to llama.cpp and gatord may introduce a small performance overhead, so profiling results should be interpreted with this in mind. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md index 1f97f58d6e..840ec69ccb 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md @@ -1,19 +1,15 @@ --- -title: Analyze llama.cpp with KleidiAI LLM performance using Streamline +title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels -draft: true -cascade: - draft: true +minutes_to_complete: 60 -minutes_to_complete: 50 - -who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners who want to run llama.cpp on Arm-based CPUs, learn how to use Arm Streamline to capture and analyze performance data, and understand how LLM inference behaves at the Prefill and Decode stages. +who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners who want to optimize llama.cpp performance on Arm-based CPUs. learning_objectives: - - Describe the architecture of llama.cpp and the role of the Prefill and Decode stages + - Profile llama.cpp architecture and identify the role of the Prefill and Decode stages - Integrate Streamline Annotations into llama.cpp for fine-grained performance insights - Capture and interpret profiling data with Streamline - - Use Annotation Channels to analyze specific operators during token generation + - Analyze specific operators during token generation using Annotation Channels - Evaluate multi-core and multi-thread execution of llama.cpp on Arm CPUs prerequisites: @@ -36,7 +32,6 @@ tools_software_languages: - Arm Streamline - C++ - llama.cpp - - KleidiAI - Profiling operatingsystems: - Linux @@ -45,16 +40,24 @@ operatingsystems: further_reading: - resource: title: llama.cpp project - link: https://github.com/ggml-org/llama.cpp - type: source code + link: https://github.com/ggml-org/llama.cpp + type: website - resource: - title: Qwen1_5-0_5b-chat-q4_0.gguf - link: https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/blob/main/qwen1_5-0_5b-chat-q4_0.gguf - type: LLM model + title: Build and run llama.cpp on Arm servers + link: /learning-paths/servers-and-cloud-computing/llama-cpu/ + type: website + - resource: + title: Run a Large Language Model chatbot with PyTorch using KleidiAI + link: /learning-paths/servers-and-cloud-computing/pytorch-llama/ + type: website - resource: title: Arm Streamline User Guide link: https://developer.arm.com/documentation/101816/9-7 type: website + - resource: + title: KleidiAI project + link: https://github.com/ARM-software/kleidiai + type: website ### FIXED, DO NOT MODIFY # ================================================================================