Open Source Architecture Code Analyzer
This tool allows automatic instruction fetching of assembly code, auto-generating of testcases for assembly instructions creating latency and throughput benchmarks on a specific instruction form and throughput analysis and throughput prediction for a innermost loop kernel.
On most systems with python pip and setuputils installed, just run:
pip install --user osaca
for the latest release.
To build OSACA from source, clone this repository using
git clone https://github.com/RRZE-HPC/OSACA and run in the root directory:
python ./setup.py install
After installation, OSACA can be started with the command
osaca in the CLI.
Additional requirements are:
- Graphviz for dependency graph creation (minimal dependency is libgraphviz-dev on Ubuntu)
- Kerncraft for marker insertion
- ibench or asmbench for throughput/latency measurements
A schematic design of OSACA's workflow is shown below:
The usage of OSACA can be listed as:
osaca [-h] [-V] [--arch ARCH] [--fixed] [--db-check] [--import MICROBENCH] [--insert-marker] [--export-graph GRAPHNAME] FILEPATH
|-h, --help||prints out the help message.|
|-V, --version||shows the program’s version number.|
|--arch ARCH||needs to be replaced with the wished architecture abbreviation.
Possible options are |
|--fixed||Run the throughput analysis with fixed probabilities for all suitable ports per instruction. Otherwise, OSACA will print out the optimal port utilization for the kernel.|
|--db-check||Run a sanity check on the by "--arch" specified database. The output depends on the verbosity level. Keep in mind you have to provide a (dummy) filename in anyway.|
|Import a given microbenchmark output file into the corresponding architecture instruction database. Define the type of microbenchmark either as "ibench" or "asmbench".|
|OSACA calls the Kerncraft module for the interactively insertion of IACA marker in suggested assembly blocks.|
|Output path for .dot file export. If "." is given, the file will be stored as "./osaca_dg.dot". After the file was created, you can convert it to a PDF file using dot: dot -Tpdf osaca_dg.dot -o osaca_dependency_graph.pdf|
The FILEPATH describes the filepath to the file to work with and is always necessary
Hereinafter OSACA's scope of function will be described.
Throughput & Latency analysis
As main functionality of OSACA this process starts by default. It is always necessary to specify the core architecture by the flag
--arch ARCH, where
ARCH can stand for
For extracting the right kernel, one has to mark it beforehand. Currently, only the detechtion of markers in the assembly code and therefore the analysis of assemly files is supported by OSACA.
Marking a kernel means to insert the byte markers in the assembly file in before and after the loop. For this, the start marker has to be inserted right in front of the loop label and the end marker directly after the jump instruction. For the convience of the user, in x86 assembly IACA byte markers are used.
x86 Byte Markers
movl $111,%ebx #IACA/OSACA START MARKER .byte 100,103,144 #IACA/OSACA START MARKER Loop: # ... movl $222,%ebx #IACA/OSACA END MARKER .byte 100,103,144 #IACA/OSACA END MARKER
AArch64 Byte Markers
mov x1, #111 // OSACA START .byte 213,3,32,31 // OSACA START \\ ... mov x1, #222 // OSACA END .byte 213,3,32,31 // OSACA END
Insert IACA markers
--insert-marker flags for a given file, OSACA calls the implemented Kerncraft module for identifying and marking the inner-loop block in manual mode. More information about how this is done can be found in the Kerncraft repository.
Note that this currrently only works for x86 loop kernels
For clarifying the functionality of OSACA a sample kernel is analyzed for an Intel CSX core hereafter:
double a[N], double b[N]; double s; // loop for(int i = 0; i < N; ++i) a[i] = s * b[i];
The code shows a simple scalar multiplication of a vector
b and a floating-point number
The result is written in vector
After including the OSACA byte marker into the assembly, one can start the analysis typing
osaca --arch CSX PATH/TO/FILE
in the command line.
The output is:
Open Source Architecture Code Analyzer (OSACA) - v0.3 Analyzed file: scale.s.csx.O3.s Architecture: csx Timestamp: 2019-10-03 23:36:21 P - Throughput of LOAD operation can be hidden behind a past or future STORE instruction * - Instruction micro-ops not bound to a port X - No throughput/latency information for this instruction in data file Combined Analysis Report ----------------------- Port pressure in cycles | 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD | ------------------------------------------------------------------------------------------------- 170 | | | | | | | | || | | .L22: 171 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 8.0 | | vmulpd (%r12,%rax), %ymm1, %ymm0 172 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, 0(%r13,%rax) 173 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax 174 | 0.00 | 0.00 | | | | 0.50 | 0.50 | || | | cmpq %rax, %r14 175 | | | | | | | | || | | * jne .L22 0.75 0.75 1.00 0.50 1.00 0.50 1.00 0.75 0.75 13.0 1.0 Loop-Carried Dependencies Analysis Report ----------------------------------------- 173 | 1.0 | addq $32, %rax | 
It shows the whole kernel together with the optimized port pressure of each instruction form and the overall port binding. Furthermore, in the two columns on the right, the critical path (CP) and the longest loop-carried dependency (LCD) of the loop kernel. In the bottom, all loop-carried dependencies are shown, each with a list of line numbers being part of this dependency chain on the right.
Implementation: Jan Laukemann