Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 138 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4812,4 +4812,141 @@ learnt
lof
BalenaOS
balenaCloud

MX
ARMFp
AndroidDemo
ApacheBench
ArmHalideAndroidDemo
Autoscheduler
BGR
BVM
BenchmarkBubbleSort
BenchmarkQuickSort
Botspot
BoundaryConditions
BubbleSort
ByteBuffer
DGGML
DNQZJ
DTLB
EPYC
ETag
EVEX
Esc
FuseAll
FuseBlurAndThreshold
GGG
GOPATH
GOROOT
GTK
GetByteArrayElements
Golang
Golang’s
HWC
Halide
Halide’s
ImageParam
Istio
KEDA
Kedify
Kedify’s
LLC
LLE
MPix
NIC’s
Netty
NoRuntime
OpenBMC’s
Parallelization
QCOW
QuickSort
RDom
RGBRGBRGB
RRR
RamFB
Recomputation
ReleaseByteArrayElements
Remmina
Roubalik
SAXPY
ScaledObject
Scaler
SetByteArrayRegion
SoL
Sor
Sysoev
TinyRPS
UFW
VLA
VTOR
VirtualService
WindowsOnArm
XMM
YMM
YUV
ZMM
Zbynek
adaptively
allocs
apiKey
armhalideandroiddemo
autounattend
autowiring
benchmarkHttpResponse
benchmem
blurThresholdImage
bvm
clusterName
coroutine
createBitmapFromGrayBytes
cv
extractGrayScaleBytes
fallbacks
firstlogin
golang
gosort
goweb
halide
httpd
inBytes
inlines
inputBuffer
insturction
jbyteArray
keda
kedify
keypress
kts
llmexport
loadImageFromAssets
microarchitectures
minikube
oOer
orgId
outputArray
outputBuffer
parallelization
parallelize
parallelized
parallelizes
preallocation
precomputing
qcow
recomputation
reconfig
reconversion
refetching
req
scaler
scalers
sprintf
stdev
thresholded
underperformed
underperforms
unvectorized
uop
walkthrough
warmups
xo
yi
9 changes: 5 additions & 4 deletions content/learning-paths/automotive/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,22 @@ title: Automotive
weight: 4
subjects_filter:
- Containers and Virtualization: 3
- Performance and Architecture: 5
- Performance and Architecture: 6
operatingsystems_filter:
- Baremetal: 1
- Linux: 7
- Linux: 8
- macOS: 1
- RTOS: 1
tools_software_languages_filter:
- Arm Development Studio: 1
- Arm Zena CSS: 1
- C: 2
- C++: 1
- Clang: 2
- Clang: 3
- DDS: 1
- Docker: 2
- GCC: 2
- FVP: 1
- GCC: 3
- Python: 2
- Raspberry Pi: 1
- ROS 2: 3
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,12 @@ tools_software_languages_filter:
- CMSIS-DSP: 1
- CMSIS-Toolbox: 3
- CNN: 1
- Computer Vision: 1
- Containerd: 1
- DetectNet: 1
- Docker: 10
- DSTREAM: 2
- Edge AI: 1
- Edge AI: 2
- Edge Impulse: 1
- ExecuTorch: 3
- FastAPI: 1
Expand Down
10 changes: 7 additions & 3 deletions content/learning-paths/laptops-and-desktops/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ maintopic: true
operatingsystems_filter:
- Android: 2
- ChromeOS: 2
- Linux: 33
- Linux: 34
- macOS: 9
- Windows: 44
- Windows: 45
subjects_filter:
- CI-CD: 5
- Containers and Virtualization: 7
- Migration to Arm: 28
- Migration to Arm: 29
- ML: 2
- Performance and Architecture: 27
subtitle: Create and migrate apps for power efficient performance
Expand All @@ -28,6 +28,7 @@ tools_software_languages_filter:
- Arm Performance Libraries: 2
- Arm64EC: 1
- Assembly: 1
- Bash: 1
- C: 8
- C#: 6
- C++: 11
Expand All @@ -48,6 +49,7 @@ tools_software_languages_filter:
- Intrinsics: 1
- JavaScript: 2
- Kubernetes: 1
- KVM: 1
- Linux: 1
- LLM: 1
- LLVM: 2
Expand All @@ -61,7 +63,9 @@ tools_software_languages_filter:
- OpenCV: 1
- perf: 4
- Python: 6
- QEMU: 1
- Qt: 2
- RDP: 1
- Remote.It: 1
- RME: 1
- Runbook: 18
Expand Down
7 changes: 4 additions & 3 deletions content/learning-paths/mobile-graphics-and-gaming/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ key_ip:
- Mali
maintopic: true
operatingsystems_filter:
- Android: 31
- Android: 32
- Linux: 30
- macOS: 14
- Windows: 14
subjects_filter:
- Gaming: 6
- Graphics: 6
- ML: 12
- Performance and Architecture: 34
- Performance and Architecture: 35
subtitle: Optimize Android apps and build faster games using cutting-edge Arm tech
title: Mobile, Graphics, and Gaming
tools_software_languages_filter:
Expand All @@ -26,7 +26,7 @@ tools_software_languages_filter:
- Android: 4
- Android NDK: 2
- Android SDK: 1
- Android Studio: 10
- Android Studio: 11
- Arm Development Studio: 1
- Arm Mobile Studio: 1
- Arm Performance Studio: 3
Expand All @@ -38,6 +38,7 @@ tools_software_languages_filter:
- CCA: 1
- Clang: 12
- CMake: 1
- Coding: 1
- Docker: 1
- ExecuTorch: 1
- Frame Advisor: 1
Expand Down
22 changes: 16 additions & 6 deletions content/learning-paths/servers-and-cloud-computing/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ key_ip:
maintopic: true
operatingsystems_filter:
- Android: 3
- Linux: 175
- Linux: 177
- macOS: 13
- Windows: 14
pinned_modules:
Expand All @@ -19,11 +19,11 @@ pinned_modules:
- migration
subjects_filter:
- CI-CD: 7
- Containers and Virtualization: 31
- Containers and Virtualization: 32
- Databases: 17
- Libraries: 9
- ML: 31
- Performance and Architecture: 71
- Performance and Architecture: 72
- Storage: 1
- Web: 12
subtitle: Optimize cloud native apps on Arm for performance and cost
Expand Down Expand Up @@ -72,7 +72,7 @@ tools_software_languages_filter:
- Capstone: 1
- CCA: 8
- Clair: 1
- Clang: 12
- Clang: 13
- ClickBench: 1
- ClickHouse: 1
- CMake: 1
Expand All @@ -89,7 +89,7 @@ tools_software_languages_filter:
- Fortran: 1
- FunASR: 1
- FVP: 7
- GCC: 24
- GCC: 25
- gdb: 1
- Geekbench: 1
- Generative AI: 12
Expand All @@ -106,21 +106,27 @@ tools_software_languages_filter:
- Google Cloud: 2
- Google Test: 1
- HammerDB: 1
- Helm: 1
- Herd7: 1
- Hugging Face: 11
- InnoDB: 1
- Intrinsics: 1
- iPerf3: 1
- ipmitool: 1
- Java: 4
- JAX: 1
- JMH: 1
- Kafka: 1
- KEDA: 1
- Kedify: 1
- Keras: 1
- Kubernetes: 10
- KleidiAI: 1
- Kubernetes: 11
- Libamath: 1
- libbpf: 1
- Linaro Forge: 1
- Litmus7: 1
- llama.cpp: 1
- Llama.cpp: 2
- LLM: 10
- llvm-mca: 1
Expand All @@ -135,19 +141,22 @@ tools_software_languages_filter:
- mpi: 1
- MySQL: 9
- NEON: 7
- Neoverse: 1
- Networking: 1
- Nexmark: 1
- NGINX: 4
- Node.js: 3
- Ollama: 1
- ONNX Runtime: 1
- OpenBLAS: 1
- OpenBMC: 1
- OpenJDK 21: 2
- OpenShift: 1
- Orchard Core: 1
- PAPI: 1
- perf: 6
- PostgreSQL: 4
- Profiling: 1
- Python: 31
- PyTorch: 9
- QEMU: 1
Expand Down Expand Up @@ -188,6 +197,7 @@ tools_software_languages_filter:
- wrk2: 2
- x265: 1
- YCSB: 1
- Yocto/BitBake: 1
- zlib: 1
- ZooKeeper: 1
weight: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,4 @@ done

### Saving these changes

Any changes you make to IRQs will be reset at reboot. You will need to change your systems settings to make your changes permanant.
Any changes you make to IRQs will be reset at reboot. You will need to change your systems settings to make your changes permanent.
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ For installation guidance, refer to the [Streamline installation guide](https://

Clone the gator repository that matches your Streamline version and build the `Annotation support library`.

The installation step is depends on your developement machine.
The installation step is depends on your development machine.

For Arm native build, you can use following insturction to install the packages.
For other machine, you need to set up the cross compiler environment by install [aarch64 gcc compiler toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads).
Expand Down Expand Up @@ -121,7 +121,7 @@ Finally, add an annotation marker inside the main loop:
}
```

A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed.
A string is added to the Annotation Marker to record the position of input tokens and number of tokens to be processed.

### Step 3: Build llama-cli

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles du
We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.

Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are organized in form of call stack.
![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")

In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions.
Expand Down
Loading