Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
ff9f8bc
Add NSS model training gym LP
annietllnd Sep 5, 2025
cce3859
Update _index.md
annietllnd Sep 22, 2025
9a03b72
Update model-training-gym LP
annietllnd Sep 22, 2025
a4d06c6
Add Learning Path about tracking resource usage on WoA
Reyfone Sep 4, 2025
8d1da63
Update Llama3 on RPi LP
annietllnd Sep 26, 2025
8636871
Update Llama3 on Android LP
annietllnd Sep 29, 2025
ca31aba
Update NSS model training gym LP
annietllnd Sep 29, 2025
a36cce6
Add link to .vgf resources
annietllnd Sep 30, 2025
1c922f3
First pass content dev
madeline-underwood Oct 4, 2025
78abd4f
Updates to IRQ Tuning Guide
madeline-underwood Oct 4, 2025
6b779e7
Final checks
madeline-underwood Oct 5, 2025
7916acd
Refine IRQ tuning guide for clarity and conciseness in explanations a…
madeline-underwood Oct 5, 2025
5cbdf01
Review top-down methodology Learning Path
jasonrandrews Oct 6, 2025
b9dc153
Merge pull request #2398 from jasonrandrews/review
jasonrandrews Oct 6, 2025
ca803aa
Update topdown-tool output
jasonrandrews Oct 6, 2025
4922add
Merge pull request #2399 from jasonrandrews/review
jasonrandrews Oct 6, 2025
d61635a
Update Model Training Gym LP
annietllnd Oct 7, 2025
4f3cacd
Merge pull request #2396 from madeline-underwood/review
jasonrandrews Oct 8, 2025
7de90dc
update AI tool instructions
jasonrandrews Oct 8, 2025
8a2bab1
Merge pull request #2401 from jasonrandrews/review
jasonrandrews Oct 8, 2025
8e37aa2
fix: remove unnecessary line break in multi-threaded performance section
madeline-underwood Oct 8, 2025
9106ee6
Enhance llama.cpp performance profiling documentation with Streamline…
madeline-underwood Oct 8, 2025
0ffb662
Update Model Training Gym LP
annietllnd Oct 9, 2025
8c860d1
Merge pull request #2369 from annietllnd/updates
pareenaverma Oct 9, 2025
4a15673
Merge pull request #2301 from annietllnd/neural-graphics
pareenaverma Oct 9, 2025
2f3c546
Merge pull request #2402 from madeline-underwood/llamacpp
jasonrandrews Oct 9, 2025
5f9e962
Merge branch 'main' into woa
jasonrandrews Oct 9, 2025
93901e6
Merge pull request #2297 from Reyfone/woa
jasonrandrews Oct 9, 2025
67cf96f
Put Windows resource usage in draft mode for tech review
jasonrandrews Oct 9, 2025
8ffc1df
Merge pull request #2404 from jasonrandrews/review
jasonrandrews Oct 9, 2025
0746421
Update the vision llm page and add content of UI demo
HenryDen Oct 10, 2025
d33a304
Update 1-devenv-and-model.md
annietllnd Oct 10, 2025
050c9be
Merge pull request #2405 from HenryDen/ViT_update
pareenaverma Oct 10, 2025
0b4e614
Merge pull request #2406 from annietllnd/ViT_update
pareenaverma Oct 10, 2025
06f0d4a
Update _index.md
pareenaverma Oct 10, 2025
175144f
Update .wordlist.txt
pareenaverma Oct 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ Read the files in the directory `content/learning-paths/cross-platform/_example-

Each Learning Path must have an _index.md file and a _next-steps.md file. The _index.md file contains the main content of the Learning Path. The _next-steps.md file contains links to related content and is included at the end of the Learning Path.

Additional resources and 'next steps' content should be placed in the `further_reading` section of `_index.md`, NOT in `_next-steps.md`. The `_next-steps.md` file should remain minimal and unmodified as indicated by "FIXED, DO NOT MODIFY" comments in the template.

The _index.md file should contain the following front matter and content sections:

Front Matter (YAML format):
Expand All @@ -60,6 +62,16 @@ Front Matter (YAML format):
- `skilllevels`: Skill levels allowed are only Introductory and Advanced
- `operatingsystems`: Operating systems used, must match the closed list on https://learn.arm.com/learning-paths/cross-platform/_example-learning-path/write-2-metadata/

### Further Reading Curation

Limit further_reading resources to 4-6 essential links. Prioritize:
- Direct relevance to the topic
- Arm-specific Learning Paths over generic external resources
- Foundation knowledge for target audience
- Required tools (install guides)
- Logical progression from basic to advanced

Avoid overwhelming readers with too many links, which can cause them to leave the platform.

All Learning Paths should generally include:
Title: [Imperative verb] + [technology/tool] + [outcome]
Expand Down Expand Up @@ -205,18 +217,23 @@ Some links are useful in content, but too many links can be distracting and read

### Internal links

Use a relative path format for internal links that are on learn.arm.com.
For example, use: descriptive link text pointing to a relative path like learning-paths/category/path-name/
Use the full path format for internal links: `/learning-paths/category/path-name/` (e.g., `/learning-paths/cross-platform/docker/`). Do NOT use relative paths like `../path-name/`.

Examples:
- learning-paths/servers-and-cloud-computing/csp/ (Arm-based instance)
- learning-paths/cross-platform/docker/ (Docker learning path)
- /learning-paths/servers-and-cloud-computing/csp/ (Arm-based instance)
- /learning-paths/cross-platform/docker/ (Docker learning path)

### External links

Use the full URL for external links that are not on learn.arm.com, these open in a new tab.

This instruction set enables high-quality Arm Learning Paths content while maintaining consistency and technical accuracy.

### Link Verification Process

When creating Learning Path content:
- Verify internal links exist before adding them
- Use semantic search or website browsing to confirm Learning Path availability
- Prefer verified external authoritative sources over speculative internal links
- Test link formats against existing Learning Path examples
- Never assume Learning Paths exist without verification

This instruction set enables high-quality Arm Learning Paths content while maintaining consistency and technical accuracy.
3 changes: 2 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4976,4 +4976,5 @@ StatefulSets
codemia
multidisks
testsh
uops
uops
subgraph
3 changes: 2 additions & 1 deletion assets/contributors.csv
Original file line number Diff line number Diff line change
Expand Up @@ -102,5 +102,6 @@ Ker Liu,,,,,
Rui Chang,,,,,
Alejandro Martinez Vicente,Arm,,,,
Mohamad Najem,Arm,,,,
Ruifeng Wang,Arm,,,,
Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,,
Zbynek Roubalik,Kedify,,,,
Zbynek Roubalik,Kedify,,,,
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,20 @@ Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitorin

While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:

**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints.
- Retiring
- Bad Speculation
- Frontend Bound
- Backend Bound

This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
This Learning Path provides a comparison of how x86 processors implement multi-level hierarchical top-down analysis compared to Arm Neoverse's methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.

## Introduction to top-down performance analysis

The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of the four categories.

**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.

The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time.
The methodology allows you to drill down only into the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time.

The next sections compare the Intel x86 methodology with the Arm top-down methodology.

Expand Down
22 changes: 12 additions & 10 deletions content/learning-paths/cross-platform/topdown-compare/1a-intel.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Implement Intel x86 4-level hierarchical top-down analysis"
title: "Understand Intel x86 multi-level hierarchical top-down analysis"
weight: 4

### FIXED, DO NOT MODIFY
Expand All @@ -8,9 +8,9 @@ layout: learningpathall

## Configure slot-based accounting with Intel x86 PMU counters

Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design but current Intel processor designs typically have four issue slots per cycle.
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design, but current Intel processor designs typically have four issue slots per cycle.

Intel's methodology uses a multi-level hierarchy that extends to 4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events.
Intel's methodology uses a multi-level hierarchy that typically extends to 3-4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events.

## Level 1: Identify top-level performance categories

Expand All @@ -27,18 +27,20 @@ Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores.

Once you've identified the dominant Level 1 category, Level 2 drills into each area to identify broader causes. This level distinguishes between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend.

- Frontend Bound covers frontend latency in comparison with frontend bandwidth
- Backend Bound covers memory bound in comparison with core bound
- Bad Speculation covers branch mispredicts in comparison with machine clears
- Retiring covers base in comparison with microcode sequencer
- Frontend Bound covers frontend latency compared with frontend bandwidth
- Backend Bound covers memory bound compared with core bound
- Bad Speculation covers branch mispredicts compared with machine clears
- Retiring covers base compared with microcode sequencer

## Level 3: Target specific microarchitecture bottlenecks

After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations. Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories, while Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning.
After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations.

Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories. Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning.

## Level 4: Access specific PMU counter events

The final level provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes.
Level 4 provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes.

## Apply essential Intel x86 PMU counters for analysis

Expand All @@ -63,5 +65,5 @@ Intel processors expose hundreds of performance events, but top-down analysis re
| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) |


Using the above levels of metrics you can find out which of the four top-level categories are causing bottlenecks.
Using the above levels of metrics, you can determine which of the four top-level categories are causing bottlenecks.

12 changes: 7 additions & 5 deletions content/learning-paths/cross-platform/topdown-compare/1b-arm.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Implement Arm Neoverse 2-stage top-down analysis"
title: "Understand Arm Neoverse top-down analysis"
weight: 5

### FIXED, DO NOT MODIFY
Expand All @@ -9,15 +9,15 @@ layout: learningpathall

After understanding Intel's comprehensive 4-level hierarchy, you can explore how Arm approached the same performance analysis challenge with a different philosophy. Arm developed a complementary top-down methodology specifically for Neoverse server cores that prioritizes practical usability while maintaining analysis effectiveness.

The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability.
The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, which differs from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability.

### Execute Stage 1: Calculate top-down performance categories

Stage 1 identifies high-level bottlenecks using the same four categories as Intel but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture.
Stage 1 identifies high-level bottlenecks using the same four categories as Intel, but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture.

#### Configure Arm-specific PMU counter formulas

Arm uses different top-down metrics based on different events but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology:
Arm uses different top-down metrics based on different events, but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology:

| Metric | Formula | Purpose |
| :-- | :-- | :-- |
Expand All @@ -32,7 +32,9 @@ Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU compon

#### Navigate resource groups without hierarchical constraints

Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. **Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages.
Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently.

**Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages.

## Apply essential Arm Neoverse PMU counters for analysis

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ After understanding each architecture's methodology individually, you can now ex
- Hierarchical analysis: broad classification followed by drill-down into dominant bottlenecks
- Resource attribution: map performance issues to specific CPU micro-architectural components

## Compare 4-level hierarchical and 2-stage methodologies
## Compare multi-level hierarchical and resource groups methodologies

| Aspect | Intel x86 | Arm Neoverse |
| :-- | :-- | :-- |
Expand Down
Loading
Loading