Skip to content
92 changes: 91 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4667,7 +4667,7 @@ Sommelier
chromeos
linuxcontainers
XPS
NIC's
NIC’s
offlines
passthrough
SLOs
Expand Down Expand Up @@ -4722,4 +4722,94 @@ ATtestation
CoCo
procedureS
NIC’s
httpbin
proxying
OpenBMC
PoC
PoCs
evb
ipmitool
openbmc
poc
IPMI
integrators
KCS
PLDM
MCTP
Redfish
hyperscalers
BMCs
OEM
NetFn
RDv
CSSv
penBmc
BMC's
socat
ZooKeeper
IRQs
IRQS
Friedt
namespaces
atlascli
benchmarkDB
cursorTest
replset
testCollection
Namespaces
mongotop
Mongotop
baselineDB
ef
netstat
tulnp
mongostat
arw
conn
getmore
qrw
vsize
conn
WiredTiger
GLE
getLastError
createIndex
getMore
getmore
RoT
lkvm
JMH
jmh
UseG
Xmx
Xms
JavaServer
servlets
RMSNorm
RoPE
FFN
ukernel
libstreamline
prefill
OpenCL
subgraphs
threadpool
worksize
Zhilong
Denoiser
RGGB
denoised
YGGV
Mohamad
Najem
kata
svl
svzero
anf
DynamIQ
Zena
learnt
lof
BalenaOS
balenaCloud

Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Total run time over 20 iterations: 2030.5525 ms
```

Re-run the Low Light Enhancment benchmark:
Re-run the Low Light Enhancement benchmark:

```bash
bin/low_light_image_enhancement_benchmark 20 resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,6 @@ A successful test shows HTTP/1.1 200 OK with a JSON body from httpbin.org, for e
- **Successful connection:** The `curl` command successfully connected to the Envoy proxy on `localhost:10000`.
- **Correct status code:** Envoy forwards the request and receives a successful `200 OK` response from the upstream.
- **Host header rewrite:** Envoy rewrites `Host` to `httpbin.org` as configured.
- **End-to-end Success:** The proxy is operational; requests are received, processed, and forwarded to the ackend.
- **End-to-end Success:** The proxy is operational; requests are received, processed, and forwarded to the backend.

To stop Envoy in the first terminal, press **Ctrl+C**. This confirms the end-to-end flow with Envoy server is working correctly.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ weight: 2
layout: learningpathall
---

First you should run the following command to identify all IRQs on the system. Identify the NIC IRQs and adjust the system by experirmenting and seeing how performance improves.
First you should run the following command to identify all IRQs on the system. Identify the NIC IRQs and adjust the system by experimenting and seeing how performance improves.

```
grep '' /proc/irq/*/smp_affinity_list | while IFS=: read path cpus; do
Expand Down Expand Up @@ -47,7 +47,7 @@ IRQ 104 -> CPUs 12 -> Device ens34-Tx-Rx-5
IRQ 105 -> CPUs 5 -> Device ens34-Tx-Rx-6
IRQ 106 -> CPUs 10 -> Device ens34-Tx-Rx-7
```
This can potential hurt performance. Suggestions and patterns to expertiment with will be on the next step.
This can potential hurt performance. Suggestions and patterns to experiment with will be on the next step.

### reset

Expand All @@ -69,4 +69,4 @@ done

### Saving these changes

Any changes you make to IRQs will be reset at reboot. You will need to change your systems settings to make your changes permenant.
Any changes you make to IRQs will be reset at reboot. You will need to change your systems settings to make your changes permanant.
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ then add the Annotation Marker generation code here,
}
```

A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed.
A string is added to the Annotation Marker to record the position of input tokens and number of tokens to be processed.

### Step 3: Build llama-cli executable
For convenience, llama-cli is static linked.
Expand Down Expand Up @@ -181,7 +181,7 @@ By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles du
We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.

Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are organized in form of call stack.

![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")

Expand All @@ -201,4 +201,4 @@ As we can see, the function, graph_compute, takes the largest portion of the run

* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.
* The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library.
* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.
* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ layout: learningpathall
# Conclusion
By leveraging the Streamline tool together with a good understanding of the llama.cpp code, the execution process of the LLM model can be visualized, which helps analyze code efficiency and investigate potential optimization.

Note that addtional annotation code in llama.cpp and gatord might somehow affect the performance.
Note that additional annotation code in llama.cpp and gatord might somehow affect the performance.

Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ cascade:

minutes_to_complete: 50

who_is_this_for: Engineers who want to learn LLM inference on CPU or proflie and optimize llama.cpp code.
who_is_this_for: Engineers who want to learn LLM inference on CPU or profile and optimize llama.cpp code.

learning_objectives:
- Be able to use Streamline to profile llama.cpp code
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ layout: learningpathall
## Benchmark MongoDB with **mongotop** and **mongostat**

In this section, you will measure MongoDB's performance in real time.
You will install the official MongoDB database tools, start MongoDB and run a script to simulate heavy load. With the script running you will then meassure the database's live performance using **mongotop** and **mongostat**.
You will install the official MongoDB database tools, start MongoDB and run a script to simulate heavy load. With the script running you will then measure the database's live performance using **mongotop** and **mongostat**.

1. Install MongoDB Database Tools

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,4 @@ Creating a virtual machine based on Azure Cobalt 100 is no different from creati

![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal")

While the virtual machine ready, proceed to the next section to delpoy MongoDB on your running instance.
While the virtual machine ready, proceed to the next section to deploy MongoDB on your running instance.
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ git config --global user.email "<your-email@example.com>"

## Step 2: Fetch the source code

The RD‑V3 platform firmware stack consists of multiple components, most maintained in separate Git respositories, such as:
The RD‑V3 platform firmware stack consists of multiple components, most maintained in separate Git repositories, such as:

- TF‑A
- SCP/MCP
Expand Down